Files
OSINT/OSINT_SYSTEM_README.md
Iliyan Angelov ed94dd22dd update
2025-11-26 22:32:20 +02:00

267 lines
7.9 KiB
Markdown

# Enterprise OSINT System Documentation
## Overview
The Enterprise OSINT (Open Source Intelligence) system automatically crawls seed websites, searches for scam-related keywords, and generates reports for moderator review. Approved reports are automatically published to the platform.
## Features
### 1. Seed Website Management
- **Admin Interface**: Manage seed websites to crawl
- **Configuration**: Set crawl depth, interval, allowed domains, user agent
- **Priority Levels**: High, Medium, Low
- **Statistics**: Track pages crawled and matches found
### 2. Keyword Management
- **Multiple Types**: Exact match, regex, phrase, domain, email, phone patterns
- **Confidence Scoring**: Each keyword has a confidence score (0-100)
- **Auto-approval**: Keywords can be set to auto-approve high-confidence matches
- **Case Sensitivity**: Configurable per keyword
### 3. Automated Crawling
- **Web Scraping**: Crawls seed websites using BeautifulSoup
- **Content Analysis**: Extracts and analyzes page content
- **Keyword Matching**: Searches for configured keywords
- **Deduplication**: Uses content hashing to avoid duplicates
- **Rate Limiting**: Configurable delays between requests
### 4. Auto-Report Generation
- **Automatic Detection**: Creates reports when keywords match
- **Confidence Scoring**: Calculates confidence based on matches
- **Moderator Review**: Reports sent for approval
- **Auto-approval**: High-confidence reports with auto-approve keywords are automatically published
### 5. Moderation Interface
- **Review Queue**: Moderators can review pending auto-generated reports
- **Approve/Reject**: One-click approval or rejection with notes
- **Statistics Dashboard**: View counts by status
- **Detailed View**: See full crawled content and matched keywords
## Setup Instructions
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
New dependencies added:
- `beautifulsoup4>=4.12.2` - Web scraping
- `lxml>=4.9.3` - HTML parsing
- `urllib3>=2.0.7` - HTTP client
### 2. Run Migrations
```bash
python manage.py makemigrations osint
python manage.py makemigrations reports # For is_auto_discovered field
python manage.py migrate
```
### 3. Configure Seed Websites
1. Go to Django Admin → OSINT → Seed Websites
2. Click "Add Seed Website"
3. Fill in:
- **Name**: Friendly name
- **URL**: Base URL to crawl
- **Crawl Depth**: How many levels deep to crawl (0 = only this page)
- **Crawl Interval**: Hours between crawls
- **Priority**: High/Medium/Low
- **Allowed Domains**: List of domains to crawl (empty = same domain only)
- **User Agent**: Custom user agent string
### 4. Configure Keywords
1. Go to Django Admin → OSINT → OSINT Keywords
2. Click "Add OSINT Keyword"
3. Fill in:
- **Name**: Friendly name
- **Keyword**: The pattern to search for
- **Keyword Type**:
- `exact` - Exact string match
- `regex` - Regular expression
- `phrase` - Phrase with word boundaries
- `domain` - Domain pattern
- `email` - Email pattern
- `phone` - Phone pattern
- **Confidence Score**: Default confidence (0-100)
- **Auto Approve**: Auto-approve if confidence >= 80
### 5. Run Crawling
#### Manual Crawling
```bash
# Crawl all due seed websites
python manage.py crawl_osint
# Crawl all active seed websites
python manage.py crawl_osint --all
# Crawl specific seed website
python manage.py crawl_osint --seed-id 1
# Force crawl (ignore crawl interval)
python manage.py crawl_osint --all --force
# Limit pages per seed
python manage.py crawl_osint --max-pages 100
# Set delay between requests
python manage.py crawl_osint --delay 2.0
```
#### Scheduled Crawling (Celery)
Add to your Celery beat schedule:
```python
# In your Celery configuration (celery.py or settings)
from celery.schedules import crontab
app.conf.beat_schedule = {
'crawl-osint-hourly': {
'task': 'osint.tasks.crawl_osint_seeds',
'schedule': crontab(minute=0), # Every hour
},
'auto-approve-reports': {
'task': 'osint.tasks.auto_approve_high_confidence_reports',
'schedule': crontab(minute='*/15'), # Every 15 minutes
},
}
```
## Workflow
### 1. Crawling Process
```
Seed Website → Crawl Pages → Extract Content → Match Keywords → Calculate Confidence
```
1. System crawls seed website starting from base URL
2. For each page:
- Fetches HTML content
- Extracts text content (removes scripts/styles)
- Calculates content hash for deduplication
- Matches against all active keywords
- Calculates confidence score
3. If confidence >= 30, creates `CrawledContent` record
4. If confidence >= 30, creates `AutoGeneratedReport` with status 'pending'
### 2. Confidence Calculation
```
Base Score = Average of matched keyword confidence scores
Match Boost = min(match_count * 2, 30)
Keyword Boost = min(unique_keywords * 5, 20)
Total = min(base_score + match_boost + keyword_boost, 100)
```
### 3. Auto-Approval
Reports are auto-approved if:
- Confidence score >= 80
- At least one matched keyword has `auto_approve=True`
Auto-approved reports are immediately published to the platform.
### 4. Moderator Review
1. Moderator views pending reports at `/osint/auto-reports/`
2. Can filter by status (pending, approved, published, rejected)
3. Views details including:
- Matched keywords
- Crawled content
- Source URL
- Confidence score
4. Approves or rejects with optional notes
5. Approved reports are published as `ScamReport` with `is_auto_discovered=True`
## URL Routes
- `/osint/auto-reports/` - List auto-generated reports (moderators only)
- `/osint/auto-reports/<id>/` - View report details
- `/osint/auto-reports/<id>/approve/` - Approve report
- `/osint/auto-reports/<id>/reject/` - Reject report
## Models
### SeedWebsite
- Manages websites to crawl
- Tracks crawling statistics
- Configures crawl behavior
### OSINTKeyword
- Defines patterns to search for
- Sets confidence scores
- Enables auto-approval
### CrawledContent
- Stores crawled page content
- Links matched keywords
- Tracks confidence scores
### AutoGeneratedReport
- Generated from crawled content
- Links to ScamReport when approved
- Tracks review status
## Best Practices
1. **Start Small**: Begin with 1-2 seed websites and a few keywords
2. **Monitor Performance**: Check crawl statistics regularly
3. **Tune Keywords**: Adjust confidence scores based on false positives
4. **Respect Rate Limits**: Use appropriate delays to avoid being blocked
5. **Review Regularly**: Check pending reports daily
6. **Update Keywords**: Add new scam patterns as they emerge
7. **Test Regex**: Validate regex patterns before activating
## Troubleshooting
### Crawling Fails
- Check network connectivity
- Verify seed website URLs are accessible
- Check user agent and rate limiting
- Review error messages in admin
### Too Many False Positives
- Increase confidence score thresholds
- Refine keyword patterns
- Add negative keywords (future feature)
### Too Few Matches
- Lower confidence thresholds
- Add more keywords
- Check if seed websites are being crawled
- Verify keyword patterns match content
### Performance Issues
- Reduce crawl depth
- Limit max pages per crawl
- Increase delay between requests
- Use priority levels to focus on important sites
## Security Considerations
1. **User Agent**: Use identifiable user agent for transparency
2. **Rate Limiting**: Respect website terms of service
3. **Content Storage**: Large HTML content stored in database
4. **API Keys**: Store OSINT service API keys securely (encrypted)
5. **Access Control**: Only moderators can review reports
## Future Enhancements
- [ ] Negative keywords to reduce false positives
- [ ] Machine learning for better pattern recognition
- [ ] Image analysis for scam detection
- [ ] Social media monitoring
- [ ] Email/phone validation services
- [ ] Automated report categorization
- [ ] Export/import keyword sets
- [ ] Crawl scheduling per seed website
- [ ] Content change detection
- [ ] Multi-language support