update
This commit is contained in:
266
OSINT_SYSTEM_README.md
Normal file
266
OSINT_SYSTEM_README.md
Normal file
@@ -0,0 +1,266 @@
|
||||
# Enterprise OSINT System Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The Enterprise OSINT (Open Source Intelligence) system automatically crawls seed websites, searches for scam-related keywords, and generates reports for moderator review. Approved reports are automatically published to the platform.
|
||||
|
||||
## Features
|
||||
|
||||
### 1. Seed Website Management
|
||||
- **Admin Interface**: Manage seed websites to crawl
|
||||
- **Configuration**: Set crawl depth, interval, allowed domains, user agent
|
||||
- **Priority Levels**: High, Medium, Low
|
||||
- **Statistics**: Track pages crawled and matches found
|
||||
|
||||
### 2. Keyword Management
|
||||
- **Multiple Types**: Exact match, regex, phrase, domain, email, phone patterns
|
||||
- **Confidence Scoring**: Each keyword has a confidence score (0-100)
|
||||
- **Auto-approval**: Keywords can be set to auto-approve high-confidence matches
|
||||
- **Case Sensitivity**: Configurable per keyword
|
||||
|
||||
### 3. Automated Crawling
|
||||
- **Web Scraping**: Crawls seed websites using BeautifulSoup
|
||||
- **Content Analysis**: Extracts and analyzes page content
|
||||
- **Keyword Matching**: Searches for configured keywords
|
||||
- **Deduplication**: Uses content hashing to avoid duplicates
|
||||
- **Rate Limiting**: Configurable delays between requests
|
||||
|
||||
### 4. Auto-Report Generation
|
||||
- **Automatic Detection**: Creates reports when keywords match
|
||||
- **Confidence Scoring**: Calculates confidence based on matches
|
||||
- **Moderator Review**: Reports sent for approval
|
||||
- **Auto-approval**: High-confidence reports with auto-approve keywords are automatically published
|
||||
|
||||
### 5. Moderation Interface
|
||||
- **Review Queue**: Moderators can review pending auto-generated reports
|
||||
- **Approve/Reject**: One-click approval or rejection with notes
|
||||
- **Statistics Dashboard**: View counts by status
|
||||
- **Detailed View**: See full crawled content and matched keywords
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
New dependencies added:
|
||||
- `beautifulsoup4>=4.12.2` - Web scraping
|
||||
- `lxml>=4.9.3` - HTML parsing
|
||||
- `urllib3>=2.0.7` - HTTP client
|
||||
|
||||
### 2. Run Migrations
|
||||
|
||||
```bash
|
||||
python manage.py makemigrations osint
|
||||
python manage.py makemigrations reports # For is_auto_discovered field
|
||||
python manage.py migrate
|
||||
```
|
||||
|
||||
### 3. Configure Seed Websites
|
||||
|
||||
1. Go to Django Admin → OSINT → Seed Websites
|
||||
2. Click "Add Seed Website"
|
||||
3. Fill in:
|
||||
- **Name**: Friendly name
|
||||
- **URL**: Base URL to crawl
|
||||
- **Crawl Depth**: How many levels deep to crawl (0 = only this page)
|
||||
- **Crawl Interval**: Hours between crawls
|
||||
- **Priority**: High/Medium/Low
|
||||
- **Allowed Domains**: List of domains to crawl (empty = same domain only)
|
||||
- **User Agent**: Custom user agent string
|
||||
|
||||
### 4. Configure Keywords
|
||||
|
||||
1. Go to Django Admin → OSINT → OSINT Keywords
|
||||
2. Click "Add OSINT Keyword"
|
||||
3. Fill in:
|
||||
- **Name**: Friendly name
|
||||
- **Keyword**: The pattern to search for
|
||||
- **Keyword Type**:
|
||||
- `exact` - Exact string match
|
||||
- `regex` - Regular expression
|
||||
- `phrase` - Phrase with word boundaries
|
||||
- `domain` - Domain pattern
|
||||
- `email` - Email pattern
|
||||
- `phone` - Phone pattern
|
||||
- **Confidence Score**: Default confidence (0-100)
|
||||
- **Auto Approve**: Auto-approve if confidence >= 80
|
||||
|
||||
### 5. Run Crawling
|
||||
|
||||
#### Manual Crawling
|
||||
|
||||
```bash
|
||||
# Crawl all due seed websites
|
||||
python manage.py crawl_osint
|
||||
|
||||
# Crawl all active seed websites
|
||||
python manage.py crawl_osint --all
|
||||
|
||||
# Crawl specific seed website
|
||||
python manage.py crawl_osint --seed-id 1
|
||||
|
||||
# Force crawl (ignore crawl interval)
|
||||
python manage.py crawl_osint --all --force
|
||||
|
||||
# Limit pages per seed
|
||||
python manage.py crawl_osint --max-pages 100
|
||||
|
||||
# Set delay between requests
|
||||
python manage.py crawl_osint --delay 2.0
|
||||
```
|
||||
|
||||
#### Scheduled Crawling (Celery)
|
||||
|
||||
Add to your Celery beat schedule:
|
||||
|
||||
```python
|
||||
# In your Celery configuration (celery.py or settings)
|
||||
from celery.schedules import crontab
|
||||
|
||||
app.conf.beat_schedule = {
|
||||
'crawl-osint-hourly': {
|
||||
'task': 'osint.tasks.crawl_osint_seeds',
|
||||
'schedule': crontab(minute=0), # Every hour
|
||||
},
|
||||
'auto-approve-reports': {
|
||||
'task': 'osint.tasks.auto_approve_high_confidence_reports',
|
||||
'schedule': crontab(minute='*/15'), # Every 15 minutes
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Crawling Process
|
||||
|
||||
```
|
||||
Seed Website → Crawl Pages → Extract Content → Match Keywords → Calculate Confidence
|
||||
```
|
||||
|
||||
1. System crawls seed website starting from base URL
|
||||
2. For each page:
|
||||
- Fetches HTML content
|
||||
- Extracts text content (removes scripts/styles)
|
||||
- Calculates content hash for deduplication
|
||||
- Matches against all active keywords
|
||||
- Calculates confidence score
|
||||
3. If confidence >= 30, creates `CrawledContent` record
|
||||
4. If confidence >= 30, creates `AutoGeneratedReport` with status 'pending'
|
||||
|
||||
### 2. Confidence Calculation
|
||||
|
||||
```
|
||||
Base Score = Average of matched keyword confidence scores
|
||||
Match Boost = min(match_count * 2, 30)
|
||||
Keyword Boost = min(unique_keywords * 5, 20)
|
||||
Total = min(base_score + match_boost + keyword_boost, 100)
|
||||
```
|
||||
|
||||
### 3. Auto-Approval
|
||||
|
||||
Reports are auto-approved if:
|
||||
- Confidence score >= 80
|
||||
- At least one matched keyword has `auto_approve=True`
|
||||
|
||||
Auto-approved reports are immediately published to the platform.
|
||||
|
||||
### 4. Moderator Review
|
||||
|
||||
1. Moderator views pending reports at `/osint/auto-reports/`
|
||||
2. Can filter by status (pending, approved, published, rejected)
|
||||
3. Views details including:
|
||||
- Matched keywords
|
||||
- Crawled content
|
||||
- Source URL
|
||||
- Confidence score
|
||||
4. Approves or rejects with optional notes
|
||||
5. Approved reports are published as `ScamReport` with `is_auto_discovered=True`
|
||||
|
||||
## URL Routes
|
||||
|
||||
- `/osint/auto-reports/` - List auto-generated reports (moderators only)
|
||||
- `/osint/auto-reports/<id>/` - View report details
|
||||
- `/osint/auto-reports/<id>/approve/` - Approve report
|
||||
- `/osint/auto-reports/<id>/reject/` - Reject report
|
||||
|
||||
## Models
|
||||
|
||||
### SeedWebsite
|
||||
- Manages websites to crawl
|
||||
- Tracks crawling statistics
|
||||
- Configures crawl behavior
|
||||
|
||||
### OSINTKeyword
|
||||
- Defines patterns to search for
|
||||
- Sets confidence scores
|
||||
- Enables auto-approval
|
||||
|
||||
### CrawledContent
|
||||
- Stores crawled page content
|
||||
- Links matched keywords
|
||||
- Tracks confidence scores
|
||||
|
||||
### AutoGeneratedReport
|
||||
- Generated from crawled content
|
||||
- Links to ScamReport when approved
|
||||
- Tracks review status
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start Small**: Begin with 1-2 seed websites and a few keywords
|
||||
2. **Monitor Performance**: Check crawl statistics regularly
|
||||
3. **Tune Keywords**: Adjust confidence scores based on false positives
|
||||
4. **Respect Rate Limits**: Use appropriate delays to avoid being blocked
|
||||
5. **Review Regularly**: Check pending reports daily
|
||||
6. **Update Keywords**: Add new scam patterns as they emerge
|
||||
7. **Test Regex**: Validate regex patterns before activating
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Crawling Fails
|
||||
- Check network connectivity
|
||||
- Verify seed website URLs are accessible
|
||||
- Check user agent and rate limiting
|
||||
- Review error messages in admin
|
||||
|
||||
### Too Many False Positives
|
||||
- Increase confidence score thresholds
|
||||
- Refine keyword patterns
|
||||
- Add negative keywords (future feature)
|
||||
|
||||
### Too Few Matches
|
||||
- Lower confidence thresholds
|
||||
- Add more keywords
|
||||
- Check if seed websites are being crawled
|
||||
- Verify keyword patterns match content
|
||||
|
||||
### Performance Issues
|
||||
- Reduce crawl depth
|
||||
- Limit max pages per crawl
|
||||
- Increase delay between requests
|
||||
- Use priority levels to focus on important sites
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **User Agent**: Use identifiable user agent for transparency
|
||||
2. **Rate Limiting**: Respect website terms of service
|
||||
3. **Content Storage**: Large HTML content stored in database
|
||||
4. **API Keys**: Store OSINT service API keys securely (encrypted)
|
||||
5. **Access Control**: Only moderators can review reports
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- [ ] Negative keywords to reduce false positives
|
||||
- [ ] Machine learning for better pattern recognition
|
||||
- [ ] Image analysis for scam detection
|
||||
- [ ] Social media monitoring
|
||||
- [ ] Email/phone validation services
|
||||
- [ ] Automated report categorization
|
||||
- [ ] Export/import keyword sets
|
||||
- [ ] Crawl scheduling per seed website
|
||||
- [ ] Content change detection
|
||||
- [ ] Multi-language support
|
||||
|
||||
Reference in New Issue
Block a user