267 lines
7.9 KiB
Markdown
267 lines
7.9 KiB
Markdown
# Enterprise OSINT System Documentation
|
|
|
|
## Overview
|
|
|
|
The Enterprise OSINT (Open Source Intelligence) system automatically crawls seed websites, searches for scam-related keywords, and generates reports for moderator review. Approved reports are automatically published to the platform.
|
|
|
|
## Features
|
|
|
|
### 1. Seed Website Management
|
|
- **Admin Interface**: Manage seed websites to crawl
|
|
- **Configuration**: Set crawl depth, interval, allowed domains, user agent
|
|
- **Priority Levels**: High, Medium, Low
|
|
- **Statistics**: Track pages crawled and matches found
|
|
|
|
### 2. Keyword Management
|
|
- **Multiple Types**: Exact match, regex, phrase, domain, email, phone patterns
|
|
- **Confidence Scoring**: Each keyword has a confidence score (0-100)
|
|
- **Auto-approval**: Keywords can be set to auto-approve high-confidence matches
|
|
- **Case Sensitivity**: Configurable per keyword
|
|
|
|
### 3. Automated Crawling
|
|
- **Web Scraping**: Crawls seed websites using BeautifulSoup
|
|
- **Content Analysis**: Extracts and analyzes page content
|
|
- **Keyword Matching**: Searches for configured keywords
|
|
- **Deduplication**: Uses content hashing to avoid duplicates
|
|
- **Rate Limiting**: Configurable delays between requests
|
|
|
|
### 4. Auto-Report Generation
|
|
- **Automatic Detection**: Creates reports when keywords match
|
|
- **Confidence Scoring**: Calculates confidence based on matches
|
|
- **Moderator Review**: Reports sent for approval
|
|
- **Auto-approval**: High-confidence reports with auto-approve keywords are automatically published
|
|
|
|
### 5. Moderation Interface
|
|
- **Review Queue**: Moderators can review pending auto-generated reports
|
|
- **Approve/Reject**: One-click approval or rejection with notes
|
|
- **Statistics Dashboard**: View counts by status
|
|
- **Detailed View**: See full crawled content and matched keywords
|
|
|
|
## Setup Instructions
|
|
|
|
### 1. Install Dependencies
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
New dependencies added:
|
|
- `beautifulsoup4>=4.12.2` - Web scraping
|
|
- `lxml>=4.9.3` - HTML parsing
|
|
- `urllib3>=2.0.7` - HTTP client
|
|
|
|
### 2. Run Migrations
|
|
|
|
```bash
|
|
python manage.py makemigrations osint
|
|
python manage.py makemigrations reports # For is_auto_discovered field
|
|
python manage.py migrate
|
|
```
|
|
|
|
### 3. Configure Seed Websites
|
|
|
|
1. Go to Django Admin → OSINT → Seed Websites
|
|
2. Click "Add Seed Website"
|
|
3. Fill in:
|
|
- **Name**: Friendly name
|
|
- **URL**: Base URL to crawl
|
|
- **Crawl Depth**: How many levels deep to crawl (0 = only this page)
|
|
- **Crawl Interval**: Hours between crawls
|
|
- **Priority**: High/Medium/Low
|
|
- **Allowed Domains**: List of domains to crawl (empty = same domain only)
|
|
- **User Agent**: Custom user agent string
|
|
|
|
### 4. Configure Keywords
|
|
|
|
1. Go to Django Admin → OSINT → OSINT Keywords
|
|
2. Click "Add OSINT Keyword"
|
|
3. Fill in:
|
|
- **Name**: Friendly name
|
|
- **Keyword**: The pattern to search for
|
|
- **Keyword Type**:
|
|
- `exact` - Exact string match
|
|
- `regex` - Regular expression
|
|
- `phrase` - Phrase with word boundaries
|
|
- `domain` - Domain pattern
|
|
- `email` - Email pattern
|
|
- `phone` - Phone pattern
|
|
- **Confidence Score**: Default confidence (0-100)
|
|
- **Auto Approve**: Auto-approve if confidence >= 80
|
|
|
|
### 5. Run Crawling
|
|
|
|
#### Manual Crawling
|
|
|
|
```bash
|
|
# Crawl all due seed websites
|
|
python manage.py crawl_osint
|
|
|
|
# Crawl all active seed websites
|
|
python manage.py crawl_osint --all
|
|
|
|
# Crawl specific seed website
|
|
python manage.py crawl_osint --seed-id 1
|
|
|
|
# Force crawl (ignore crawl interval)
|
|
python manage.py crawl_osint --all --force
|
|
|
|
# Limit pages per seed
|
|
python manage.py crawl_osint --max-pages 100
|
|
|
|
# Set delay between requests
|
|
python manage.py crawl_osint --delay 2.0
|
|
```
|
|
|
|
#### Scheduled Crawling (Celery)
|
|
|
|
Add to your Celery beat schedule:
|
|
|
|
```python
|
|
# In your Celery configuration (celery.py or settings)
|
|
from celery.schedules import crontab
|
|
|
|
app.conf.beat_schedule = {
|
|
'crawl-osint-hourly': {
|
|
'task': 'osint.tasks.crawl_osint_seeds',
|
|
'schedule': crontab(minute=0), # Every hour
|
|
},
|
|
'auto-approve-reports': {
|
|
'task': 'osint.tasks.auto_approve_high_confidence_reports',
|
|
'schedule': crontab(minute='*/15'), # Every 15 minutes
|
|
},
|
|
}
|
|
```
|
|
|
|
## Workflow
|
|
|
|
### 1. Crawling Process
|
|
|
|
```
|
|
Seed Website → Crawl Pages → Extract Content → Match Keywords → Calculate Confidence
|
|
```
|
|
|
|
1. System crawls seed website starting from base URL
|
|
2. For each page:
|
|
- Fetches HTML content
|
|
- Extracts text content (removes scripts/styles)
|
|
- Calculates content hash for deduplication
|
|
- Matches against all active keywords
|
|
- Calculates confidence score
|
|
3. If confidence >= 30, creates `CrawledContent` record
|
|
4. If confidence >= 30, creates `AutoGeneratedReport` with status 'pending'
|
|
|
|
### 2. Confidence Calculation
|
|
|
|
```
|
|
Base Score = Average of matched keyword confidence scores
|
|
Match Boost = min(match_count * 2, 30)
|
|
Keyword Boost = min(unique_keywords * 5, 20)
|
|
Total = min(base_score + match_boost + keyword_boost, 100)
|
|
```
|
|
|
|
### 3. Auto-Approval
|
|
|
|
Reports are auto-approved if:
|
|
- Confidence score >= 80
|
|
- At least one matched keyword has `auto_approve=True`
|
|
|
|
Auto-approved reports are immediately published to the platform.
|
|
|
|
### 4. Moderator Review
|
|
|
|
1. Moderator views pending reports at `/osint/auto-reports/`
|
|
2. Can filter by status (pending, approved, published, rejected)
|
|
3. Views details including:
|
|
- Matched keywords
|
|
- Crawled content
|
|
- Source URL
|
|
- Confidence score
|
|
4. Approves or rejects with optional notes
|
|
5. Approved reports are published as `ScamReport` with `is_auto_discovered=True`
|
|
|
|
## URL Routes
|
|
|
|
- `/osint/auto-reports/` - List auto-generated reports (moderators only)
|
|
- `/osint/auto-reports/<id>/` - View report details
|
|
- `/osint/auto-reports/<id>/approve/` - Approve report
|
|
- `/osint/auto-reports/<id>/reject/` - Reject report
|
|
|
|
## Models
|
|
|
|
### SeedWebsite
|
|
- Manages websites to crawl
|
|
- Tracks crawling statistics
|
|
- Configures crawl behavior
|
|
|
|
### OSINTKeyword
|
|
- Defines patterns to search for
|
|
- Sets confidence scores
|
|
- Enables auto-approval
|
|
|
|
### CrawledContent
|
|
- Stores crawled page content
|
|
- Links matched keywords
|
|
- Tracks confidence scores
|
|
|
|
### AutoGeneratedReport
|
|
- Generated from crawled content
|
|
- Links to ScamReport when approved
|
|
- Tracks review status
|
|
|
|
## Best Practices
|
|
|
|
1. **Start Small**: Begin with 1-2 seed websites and a few keywords
|
|
2. **Monitor Performance**: Check crawl statistics regularly
|
|
3. **Tune Keywords**: Adjust confidence scores based on false positives
|
|
4. **Respect Rate Limits**: Use appropriate delays to avoid being blocked
|
|
5. **Review Regularly**: Check pending reports daily
|
|
6. **Update Keywords**: Add new scam patterns as they emerge
|
|
7. **Test Regex**: Validate regex patterns before activating
|
|
|
|
## Troubleshooting
|
|
|
|
### Crawling Fails
|
|
- Check network connectivity
|
|
- Verify seed website URLs are accessible
|
|
- Check user agent and rate limiting
|
|
- Review error messages in admin
|
|
|
|
### Too Many False Positives
|
|
- Increase confidence score thresholds
|
|
- Refine keyword patterns
|
|
- Add negative keywords (future feature)
|
|
|
|
### Too Few Matches
|
|
- Lower confidence thresholds
|
|
- Add more keywords
|
|
- Check if seed websites are being crawled
|
|
- Verify keyword patterns match content
|
|
|
|
### Performance Issues
|
|
- Reduce crawl depth
|
|
- Limit max pages per crawl
|
|
- Increase delay between requests
|
|
- Use priority levels to focus on important sites
|
|
|
|
## Security Considerations
|
|
|
|
1. **User Agent**: Use identifiable user agent for transparency
|
|
2. **Rate Limiting**: Respect website terms of service
|
|
3. **Content Storage**: Large HTML content stored in database
|
|
4. **API Keys**: Store OSINT service API keys securely (encrypted)
|
|
5. **Access Control**: Only moderators can review reports
|
|
|
|
## Future Enhancements
|
|
|
|
- [ ] Negative keywords to reduce false positives
|
|
- [ ] Machine learning for better pattern recognition
|
|
- [ ] Image analysis for scam detection
|
|
- [ ] Social media monitoring
|
|
- [ ] Email/phone validation services
|
|
- [ ] Automated report categorization
|
|
- [ ] Export/import keyword sets
|
|
- [ ] Crawl scheduling per seed website
|
|
- [ ] Content change detection
|
|
- [ ] Multi-language support
|
|
|