7.9 KiB
7.9 KiB
Enterprise OSINT System Documentation
Overview
The Enterprise OSINT (Open Source Intelligence) system automatically crawls seed websites, searches for scam-related keywords, and generates reports for moderator review. Approved reports are automatically published to the platform.
Features
1. Seed Website Management
- Admin Interface: Manage seed websites to crawl
- Configuration: Set crawl depth, interval, allowed domains, user agent
- Priority Levels: High, Medium, Low
- Statistics: Track pages crawled and matches found
2. Keyword Management
- Multiple Types: Exact match, regex, phrase, domain, email, phone patterns
- Confidence Scoring: Each keyword has a confidence score (0-100)
- Auto-approval: Keywords can be set to auto-approve high-confidence matches
- Case Sensitivity: Configurable per keyword
3. Automated Crawling
- Web Scraping: Crawls seed websites using BeautifulSoup
- Content Analysis: Extracts and analyzes page content
- Keyword Matching: Searches for configured keywords
- Deduplication: Uses content hashing to avoid duplicates
- Rate Limiting: Configurable delays between requests
4. Auto-Report Generation
- Automatic Detection: Creates reports when keywords match
- Confidence Scoring: Calculates confidence based on matches
- Moderator Review: Reports sent for approval
- Auto-approval: High-confidence reports with auto-approve keywords are automatically published
5. Moderation Interface
- Review Queue: Moderators can review pending auto-generated reports
- Approve/Reject: One-click approval or rejection with notes
- Statistics Dashboard: View counts by status
- Detailed View: See full crawled content and matched keywords
Setup Instructions
1. Install Dependencies
pip install -r requirements.txt
New dependencies added:
beautifulsoup4>=4.12.2- Web scrapinglxml>=4.9.3- HTML parsingurllib3>=2.0.7- HTTP client
2. Run Migrations
python manage.py makemigrations osint
python manage.py makemigrations reports # For is_auto_discovered field
python manage.py migrate
3. Configure Seed Websites
- Go to Django Admin → OSINT → Seed Websites
- Click "Add Seed Website"
- Fill in:
- Name: Friendly name
- URL: Base URL to crawl
- Crawl Depth: How many levels deep to crawl (0 = only this page)
- Crawl Interval: Hours between crawls
- Priority: High/Medium/Low
- Allowed Domains: List of domains to crawl (empty = same domain only)
- User Agent: Custom user agent string
4. Configure Keywords
- Go to Django Admin → OSINT → OSINT Keywords
- Click "Add OSINT Keyword"
- Fill in:
- Name: Friendly name
- Keyword: The pattern to search for
- Keyword Type:
exact- Exact string matchregex- Regular expressionphrase- Phrase with word boundariesdomain- Domain patternemail- Email patternphone- Phone pattern
- Confidence Score: Default confidence (0-100)
- Auto Approve: Auto-approve if confidence >= 80
5. Run Crawling
Manual Crawling
# Crawl all due seed websites
python manage.py crawl_osint
# Crawl all active seed websites
python manage.py crawl_osint --all
# Crawl specific seed website
python manage.py crawl_osint --seed-id 1
# Force crawl (ignore crawl interval)
python manage.py crawl_osint --all --force
# Limit pages per seed
python manage.py crawl_osint --max-pages 100
# Set delay between requests
python manage.py crawl_osint --delay 2.0
Scheduled Crawling (Celery)
Add to your Celery beat schedule:
# In your Celery configuration (celery.py or settings)
from celery.schedules import crontab
app.conf.beat_schedule = {
'crawl-osint-hourly': {
'task': 'osint.tasks.crawl_osint_seeds',
'schedule': crontab(minute=0), # Every hour
},
'auto-approve-reports': {
'task': 'osint.tasks.auto_approve_high_confidence_reports',
'schedule': crontab(minute='*/15'), # Every 15 minutes
},
}
Workflow
1. Crawling Process
Seed Website → Crawl Pages → Extract Content → Match Keywords → Calculate Confidence
- System crawls seed website starting from base URL
- For each page:
- Fetches HTML content
- Extracts text content (removes scripts/styles)
- Calculates content hash for deduplication
- Matches against all active keywords
- Calculates confidence score
- If confidence >= 30, creates
CrawledContentrecord - If confidence >= 30, creates
AutoGeneratedReportwith status 'pending'
2. Confidence Calculation
Base Score = Average of matched keyword confidence scores
Match Boost = min(match_count * 2, 30)
Keyword Boost = min(unique_keywords * 5, 20)
Total = min(base_score + match_boost + keyword_boost, 100)
3. Auto-Approval
Reports are auto-approved if:
- Confidence score >= 80
- At least one matched keyword has
auto_approve=True
Auto-approved reports are immediately published to the platform.
4. Moderator Review
- Moderator views pending reports at
/osint/auto-reports/ - Can filter by status (pending, approved, published, rejected)
- Views details including:
- Matched keywords
- Crawled content
- Source URL
- Confidence score
- Approves or rejects with optional notes
- Approved reports are published as
ScamReportwithis_auto_discovered=True
URL Routes
/osint/auto-reports/- List auto-generated reports (moderators only)/osint/auto-reports/<id>/- View report details/osint/auto-reports/<id>/approve/- Approve report/osint/auto-reports/<id>/reject/- Reject report
Models
SeedWebsite
- Manages websites to crawl
- Tracks crawling statistics
- Configures crawl behavior
OSINTKeyword
- Defines patterns to search for
- Sets confidence scores
- Enables auto-approval
CrawledContent
- Stores crawled page content
- Links matched keywords
- Tracks confidence scores
AutoGeneratedReport
- Generated from crawled content
- Links to ScamReport when approved
- Tracks review status
Best Practices
- Start Small: Begin with 1-2 seed websites and a few keywords
- Monitor Performance: Check crawl statistics regularly
- Tune Keywords: Adjust confidence scores based on false positives
- Respect Rate Limits: Use appropriate delays to avoid being blocked
- Review Regularly: Check pending reports daily
- Update Keywords: Add new scam patterns as they emerge
- Test Regex: Validate regex patterns before activating
Troubleshooting
Crawling Fails
- Check network connectivity
- Verify seed website URLs are accessible
- Check user agent and rate limiting
- Review error messages in admin
Too Many False Positives
- Increase confidence score thresholds
- Refine keyword patterns
- Add negative keywords (future feature)
Too Few Matches
- Lower confidence thresholds
- Add more keywords
- Check if seed websites are being crawled
- Verify keyword patterns match content
Performance Issues
- Reduce crawl depth
- Limit max pages per crawl
- Increase delay between requests
- Use priority levels to focus on important sites
Security Considerations
- User Agent: Use identifiable user agent for transparency
- Rate Limiting: Respect website terms of service
- Content Storage: Large HTML content stored in database
- API Keys: Store OSINT service API keys securely (encrypted)
- Access Control: Only moderators can review reports
Future Enhancements
- Negative keywords to reduce false positives
- Machine learning for better pattern recognition
- Image analysis for scam detection
- Social media monitoring
- Email/phone validation services
- Automated report categorization
- Export/import keyword sets
- Crawl scheduling per seed website
- Content change detection
- Multi-language support