# Enterprise OSINT System Documentation ## Overview The Enterprise OSINT (Open Source Intelligence) system automatically crawls seed websites, searches for scam-related keywords, and generates reports for moderator review. Approved reports are automatically published to the platform. ## Features ### 1. Seed Website Management - **Admin Interface**: Manage seed websites to crawl - **Configuration**: Set crawl depth, interval, allowed domains, user agent - **Priority Levels**: High, Medium, Low - **Statistics**: Track pages crawled and matches found ### 2. Keyword Management - **Multiple Types**: Exact match, regex, phrase, domain, email, phone patterns - **Confidence Scoring**: Each keyword has a confidence score (0-100) - **Auto-approval**: Keywords can be set to auto-approve high-confidence matches - **Case Sensitivity**: Configurable per keyword ### 3. Automated Crawling - **Web Scraping**: Crawls seed websites using BeautifulSoup - **Content Analysis**: Extracts and analyzes page content - **Keyword Matching**: Searches for configured keywords - **Deduplication**: Uses content hashing to avoid duplicates - **Rate Limiting**: Configurable delays between requests ### 4. Auto-Report Generation - **Automatic Detection**: Creates reports when keywords match - **Confidence Scoring**: Calculates confidence based on matches - **Moderator Review**: Reports sent for approval - **Auto-approval**: High-confidence reports with auto-approve keywords are automatically published ### 5. Moderation Interface - **Review Queue**: Moderators can review pending auto-generated reports - **Approve/Reject**: One-click approval or rejection with notes - **Statistics Dashboard**: View counts by status - **Detailed View**: See full crawled content and matched keywords ## Setup Instructions ### 1. Install Dependencies ```bash pip install -r requirements.txt ``` New dependencies added: - `beautifulsoup4>=4.12.2` - Web scraping - `lxml>=4.9.3` - HTML parsing - `urllib3>=2.0.7` - HTTP client ### 2. Run Migrations ```bash python manage.py makemigrations osint python manage.py makemigrations reports # For is_auto_discovered field python manage.py migrate ``` ### 3. Configure Seed Websites 1. Go to Django Admin → OSINT → Seed Websites 2. Click "Add Seed Website" 3. Fill in: - **Name**: Friendly name - **URL**: Base URL to crawl - **Crawl Depth**: How many levels deep to crawl (0 = only this page) - **Crawl Interval**: Hours between crawls - **Priority**: High/Medium/Low - **Allowed Domains**: List of domains to crawl (empty = same domain only) - **User Agent**: Custom user agent string ### 4. Configure Keywords 1. Go to Django Admin → OSINT → OSINT Keywords 2. Click "Add OSINT Keyword" 3. Fill in: - **Name**: Friendly name - **Keyword**: The pattern to search for - **Keyword Type**: - `exact` - Exact string match - `regex` - Regular expression - `phrase` - Phrase with word boundaries - `domain` - Domain pattern - `email` - Email pattern - `phone` - Phone pattern - **Confidence Score**: Default confidence (0-100) - **Auto Approve**: Auto-approve if confidence >= 80 ### 5. Run Crawling #### Manual Crawling ```bash # Crawl all due seed websites python manage.py crawl_osint # Crawl all active seed websites python manage.py crawl_osint --all # Crawl specific seed website python manage.py crawl_osint --seed-id 1 # Force crawl (ignore crawl interval) python manage.py crawl_osint --all --force # Limit pages per seed python manage.py crawl_osint --max-pages 100 # Set delay between requests python manage.py crawl_osint --delay 2.0 ``` #### Scheduled Crawling (Celery) Add to your Celery beat schedule: ```python # In your Celery configuration (celery.py or settings) from celery.schedules import crontab app.conf.beat_schedule = { 'crawl-osint-hourly': { 'task': 'osint.tasks.crawl_osint_seeds', 'schedule': crontab(minute=0), # Every hour }, 'auto-approve-reports': { 'task': 'osint.tasks.auto_approve_high_confidence_reports', 'schedule': crontab(minute='*/15'), # Every 15 minutes }, } ``` ## Workflow ### 1. Crawling Process ``` Seed Website → Crawl Pages → Extract Content → Match Keywords → Calculate Confidence ``` 1. System crawls seed website starting from base URL 2. For each page: - Fetches HTML content - Extracts text content (removes scripts/styles) - Calculates content hash for deduplication - Matches against all active keywords - Calculates confidence score 3. If confidence >= 30, creates `CrawledContent` record 4. If confidence >= 30, creates `AutoGeneratedReport` with status 'pending' ### 2. Confidence Calculation ``` Base Score = Average of matched keyword confidence scores Match Boost = min(match_count * 2, 30) Keyword Boost = min(unique_keywords * 5, 20) Total = min(base_score + match_boost + keyword_boost, 100) ``` ### 3. Auto-Approval Reports are auto-approved if: - Confidence score >= 80 - At least one matched keyword has `auto_approve=True` Auto-approved reports are immediately published to the platform. ### 4. Moderator Review 1. Moderator views pending reports at `/osint/auto-reports/` 2. Can filter by status (pending, approved, published, rejected) 3. Views details including: - Matched keywords - Crawled content - Source URL - Confidence score 4. Approves or rejects with optional notes 5. Approved reports are published as `ScamReport` with `is_auto_discovered=True` ## URL Routes - `/osint/auto-reports/` - List auto-generated reports (moderators only) - `/osint/auto-reports//` - View report details - `/osint/auto-reports//approve/` - Approve report - `/osint/auto-reports//reject/` - Reject report ## Models ### SeedWebsite - Manages websites to crawl - Tracks crawling statistics - Configures crawl behavior ### OSINTKeyword - Defines patterns to search for - Sets confidence scores - Enables auto-approval ### CrawledContent - Stores crawled page content - Links matched keywords - Tracks confidence scores ### AutoGeneratedReport - Generated from crawled content - Links to ScamReport when approved - Tracks review status ## Best Practices 1. **Start Small**: Begin with 1-2 seed websites and a few keywords 2. **Monitor Performance**: Check crawl statistics regularly 3. **Tune Keywords**: Adjust confidence scores based on false positives 4. **Respect Rate Limits**: Use appropriate delays to avoid being blocked 5. **Review Regularly**: Check pending reports daily 6. **Update Keywords**: Add new scam patterns as they emerge 7. **Test Regex**: Validate regex patterns before activating ## Troubleshooting ### Crawling Fails - Check network connectivity - Verify seed website URLs are accessible - Check user agent and rate limiting - Review error messages in admin ### Too Many False Positives - Increase confidence score thresholds - Refine keyword patterns - Add negative keywords (future feature) ### Too Few Matches - Lower confidence thresholds - Add more keywords - Check if seed websites are being crawled - Verify keyword patterns match content ### Performance Issues - Reduce crawl depth - Limit max pages per crawl - Increase delay between requests - Use priority levels to focus on important sites ## Security Considerations 1. **User Agent**: Use identifiable user agent for transparency 2. **Rate Limiting**: Respect website terms of service 3. **Content Storage**: Large HTML content stored in database 4. **API Keys**: Store OSINT service API keys securely (encrypted) 5. **Access Control**: Only moderators can review reports ## Future Enhancements - [ ] Negative keywords to reduce false positives - [ ] Machine learning for better pattern recognition - [ ] Image analysis for scam detection - [ ] Social media monitoring - [ ] Email/phone validation services - [ ] Automated report categorization - [ ] Export/import keyword sets - [ ] Crawl scheduling per seed website - [ ] Content change detection - [ ] Multi-language support