Files
OSINT/OSINT_SYSTEM_README.md
Iliyan Angelov ed94dd22dd update
2025-11-26 22:32:20 +02:00

7.9 KiB

Enterprise OSINT System Documentation

Overview

The Enterprise OSINT (Open Source Intelligence) system automatically crawls seed websites, searches for scam-related keywords, and generates reports for moderator review. Approved reports are automatically published to the platform.

Features

1. Seed Website Management

  • Admin Interface: Manage seed websites to crawl
  • Configuration: Set crawl depth, interval, allowed domains, user agent
  • Priority Levels: High, Medium, Low
  • Statistics: Track pages crawled and matches found

2. Keyword Management

  • Multiple Types: Exact match, regex, phrase, domain, email, phone patterns
  • Confidence Scoring: Each keyword has a confidence score (0-100)
  • Auto-approval: Keywords can be set to auto-approve high-confidence matches
  • Case Sensitivity: Configurable per keyword

3. Automated Crawling

  • Web Scraping: Crawls seed websites using BeautifulSoup
  • Content Analysis: Extracts and analyzes page content
  • Keyword Matching: Searches for configured keywords
  • Deduplication: Uses content hashing to avoid duplicates
  • Rate Limiting: Configurable delays between requests

4. Auto-Report Generation

  • Automatic Detection: Creates reports when keywords match
  • Confidence Scoring: Calculates confidence based on matches
  • Moderator Review: Reports sent for approval
  • Auto-approval: High-confidence reports with auto-approve keywords are automatically published

5. Moderation Interface

  • Review Queue: Moderators can review pending auto-generated reports
  • Approve/Reject: One-click approval or rejection with notes
  • Statistics Dashboard: View counts by status
  • Detailed View: See full crawled content and matched keywords

Setup Instructions

1. Install Dependencies

pip install -r requirements.txt

New dependencies added:

  • beautifulsoup4>=4.12.2 - Web scraping
  • lxml>=4.9.3 - HTML parsing
  • urllib3>=2.0.7 - HTTP client

2. Run Migrations

python manage.py makemigrations osint
python manage.py makemigrations reports  # For is_auto_discovered field
python manage.py migrate

3. Configure Seed Websites

  1. Go to Django Admin → OSINT → Seed Websites
  2. Click "Add Seed Website"
  3. Fill in:
    • Name: Friendly name
    • URL: Base URL to crawl
    • Crawl Depth: How many levels deep to crawl (0 = only this page)
    • Crawl Interval: Hours between crawls
    • Priority: High/Medium/Low
    • Allowed Domains: List of domains to crawl (empty = same domain only)
    • User Agent: Custom user agent string

4. Configure Keywords

  1. Go to Django Admin → OSINT → OSINT Keywords
  2. Click "Add OSINT Keyword"
  3. Fill in:
    • Name: Friendly name
    • Keyword: The pattern to search for
    • Keyword Type:
      • exact - Exact string match
      • regex - Regular expression
      • phrase - Phrase with word boundaries
      • domain - Domain pattern
      • email - Email pattern
      • phone - Phone pattern
    • Confidence Score: Default confidence (0-100)
    • Auto Approve: Auto-approve if confidence >= 80

5. Run Crawling

Manual Crawling

# Crawl all due seed websites
python manage.py crawl_osint

# Crawl all active seed websites
python manage.py crawl_osint --all

# Crawl specific seed website
python manage.py crawl_osint --seed-id 1

# Force crawl (ignore crawl interval)
python manage.py crawl_osint --all --force

# Limit pages per seed
python manage.py crawl_osint --max-pages 100

# Set delay between requests
python manage.py crawl_osint --delay 2.0

Scheduled Crawling (Celery)

Add to your Celery beat schedule:

# In your Celery configuration (celery.py or settings)
from celery.schedules import crontab

app.conf.beat_schedule = {
    'crawl-osint-hourly': {
        'task': 'osint.tasks.crawl_osint_seeds',
        'schedule': crontab(minute=0),  # Every hour
    },
    'auto-approve-reports': {
        'task': 'osint.tasks.auto_approve_high_confidence_reports',
        'schedule': crontab(minute='*/15'),  # Every 15 minutes
    },
}

Workflow

1. Crawling Process

Seed Website → Crawl Pages → Extract Content → Match Keywords → Calculate Confidence
  1. System crawls seed website starting from base URL
  2. For each page:
    • Fetches HTML content
    • Extracts text content (removes scripts/styles)
    • Calculates content hash for deduplication
    • Matches against all active keywords
    • Calculates confidence score
  3. If confidence >= 30, creates CrawledContent record
  4. If confidence >= 30, creates AutoGeneratedReport with status 'pending'

2. Confidence Calculation

Base Score = Average of matched keyword confidence scores
Match Boost = min(match_count * 2, 30)
Keyword Boost = min(unique_keywords * 5, 20)
Total = min(base_score + match_boost + keyword_boost, 100)

3. Auto-Approval

Reports are auto-approved if:

  • Confidence score >= 80
  • At least one matched keyword has auto_approve=True

Auto-approved reports are immediately published to the platform.

4. Moderator Review

  1. Moderator views pending reports at /osint/auto-reports/
  2. Can filter by status (pending, approved, published, rejected)
  3. Views details including:
    • Matched keywords
    • Crawled content
    • Source URL
    • Confidence score
  4. Approves or rejects with optional notes
  5. Approved reports are published as ScamReport with is_auto_discovered=True

URL Routes

  • /osint/auto-reports/ - List auto-generated reports (moderators only)
  • /osint/auto-reports/<id>/ - View report details
  • /osint/auto-reports/<id>/approve/ - Approve report
  • /osint/auto-reports/<id>/reject/ - Reject report

Models

SeedWebsite

  • Manages websites to crawl
  • Tracks crawling statistics
  • Configures crawl behavior

OSINTKeyword

  • Defines patterns to search for
  • Sets confidence scores
  • Enables auto-approval

CrawledContent

  • Stores crawled page content
  • Links matched keywords
  • Tracks confidence scores

AutoGeneratedReport

  • Generated from crawled content
  • Links to ScamReport when approved
  • Tracks review status

Best Practices

  1. Start Small: Begin with 1-2 seed websites and a few keywords
  2. Monitor Performance: Check crawl statistics regularly
  3. Tune Keywords: Adjust confidence scores based on false positives
  4. Respect Rate Limits: Use appropriate delays to avoid being blocked
  5. Review Regularly: Check pending reports daily
  6. Update Keywords: Add new scam patterns as they emerge
  7. Test Regex: Validate regex patterns before activating

Troubleshooting

Crawling Fails

  • Check network connectivity
  • Verify seed website URLs are accessible
  • Check user agent and rate limiting
  • Review error messages in admin

Too Many False Positives

  • Increase confidence score thresholds
  • Refine keyword patterns
  • Add negative keywords (future feature)

Too Few Matches

  • Lower confidence thresholds
  • Add more keywords
  • Check if seed websites are being crawled
  • Verify keyword patterns match content

Performance Issues

  • Reduce crawl depth
  • Limit max pages per crawl
  • Increase delay between requests
  • Use priority levels to focus on important sites

Security Considerations

  1. User Agent: Use identifiable user agent for transparency
  2. Rate Limiting: Respect website terms of service
  3. Content Storage: Large HTML content stored in database
  4. API Keys: Store OSINT service API keys securely (encrypted)
  5. Access Control: Only moderators can review reports

Future Enhancements

  • Negative keywords to reduce false positives
  • Machine learning for better pattern recognition
  • Image analysis for scam detection
  • Social media monitoring
  • Email/phone validation services
  • Automated report categorization
  • Export/import keyword sets
  • Crawl scheduling per seed website
  • Content change detection
  • Multi-language support