# Enterprise OSINT System Documentation

## Overview

The Enterprise OSINT (Open Source Intelligence) system automatically crawls seed websites, searches for scam-related keywords, and generates reports for moderator review. Approved reports are automatically published to the platform.

## Features

### 1. Seed Website Management
- **Admin Interface**: Manage seed websites to crawl
- **Configuration**: Set crawl depth, interval, allowed domains, user agent
- **Priority Levels**: High, Medium, Low
- **Statistics**: Track pages crawled and matches found

### 2. Keyword Management
- **Multiple Types**: Exact match, regex, phrase, domain, email, phone patterns
- **Confidence Scoring**: Each keyword has a confidence score (0-100)
- **Auto-approval**: Keywords can be set to auto-approve high-confidence matches
- **Case Sensitivity**: Configurable per keyword

### 3. Automated Crawling
- **Web Scraping**: Crawls seed websites using BeautifulSoup
- **Content Analysis**: Extracts and analyzes page content
- **Keyword Matching**: Searches for configured keywords
- **Deduplication**: Uses content hashing to avoid duplicates
- **Rate Limiting**: Configurable delays between requests

### 4. Auto-Report Generation
- **Automatic Detection**: Creates reports when keywords match
- **Confidence Scoring**: Calculates confidence based on matches
- **Moderator Review**: Reports sent for approval
- **Auto-approval**: High-confidence reports with auto-approve keywords are automatically published

### 5. Moderation Interface
- **Review Queue**: Moderators can review pending auto-generated reports
- **Approve/Reject**: One-click approval or rejection with notes
- **Statistics Dashboard**: View counts by status
- **Detailed View**: See full crawled content and matched keywords

## Setup Instructions

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

New dependencies added:
- `beautifulsoup4>=4.12.2` - Web scraping
- `lxml>=4.9.3` - HTML parsing
- `urllib3>=2.0.7` - HTTP client

### 2. Run Migrations

```bash
python manage.py makemigrations osint
python manage.py makemigrations reports  # For is_auto_discovered field
python manage.py migrate
```

### 3. Configure Seed Websites

1. Go to Django Admin → OSINT → Seed Websites
2. Click "Add Seed Website"
3. Fill in:
   - **Name**: Friendly name
   - **URL**: Base URL to crawl
   - **Crawl Depth**: How many levels deep to crawl (0 = only this page)
   - **Crawl Interval**: Hours between crawls
   - **Priority**: High/Medium/Low
   - **Allowed Domains**: List of domains to crawl (empty = same domain only)
   - **User Agent**: Custom user agent string

### 4. Configure Keywords

1. Go to Django Admin → OSINT → OSINT Keywords
2. Click "Add OSINT Keyword"
3. Fill in:
   - **Name**: Friendly name
   - **Keyword**: The pattern to search for
   - **Keyword Type**: 
     - `exact` - Exact string match
     - `regex` - Regular expression
     - `phrase` - Phrase with word boundaries
     - `domain` - Domain pattern
     - `email` - Email pattern
     - `phone` - Phone pattern
   - **Confidence Score**: Default confidence (0-100)
   - **Auto Approve**: Auto-approve if confidence >= 80

### 5. Run Crawling

#### Manual Crawling

```bash
# Crawl all due seed websites
python manage.py crawl_osint

# Crawl all active seed websites
python manage.py crawl_osint --all

# Crawl specific seed website
python manage.py crawl_osint --seed-id 1

# Force crawl (ignore crawl interval)
python manage.py crawl_osint --all --force

# Limit pages per seed
python manage.py crawl_osint --max-pages 100

# Set delay between requests
python manage.py crawl_osint --delay 2.0
```

#### Scheduled Crawling (Celery)

Add to your Celery beat schedule:

```python
# In your Celery configuration (celery.py or settings)
from celery.schedules import crontab

app.conf.beat_schedule = {
    'crawl-osint-hourly': {
        'task': 'osint.tasks.crawl_osint_seeds',
        'schedule': crontab(minute=0),  # Every hour
    },
    'auto-approve-reports': {
        'task': 'osint.tasks.auto_approve_high_confidence_reports',
        'schedule': crontab(minute='*/15'),  # Every 15 minutes
    },
}
```

## Workflow

### 1. Crawling Process

```
Seed Website → Crawl Pages → Extract Content → Match Keywords → Calculate Confidence
```

1. System crawls seed website starting from base URL
2. For each page:
   - Fetches HTML content
   - Extracts text content (removes scripts/styles)
   - Calculates content hash for deduplication
   - Matches against all active keywords
   - Calculates confidence score
3. If confidence >= 30, creates `CrawledContent` record
4. If confidence >= 30, creates `AutoGeneratedReport` with status 'pending'

### 2. Confidence Calculation

```
Base Score = Average of matched keyword confidence scores
Match Boost = min(match_count * 2, 30)
Keyword Boost = min(unique_keywords * 5, 20)
Total = min(base_score + match_boost + keyword_boost, 100)
```

### 3. Auto-Approval

Reports are auto-approved if:
- Confidence score >= 80
- At least one matched keyword has `auto_approve=True`

Auto-approved reports are immediately published to the platform.

### 4. Moderator Review

1. Moderator views pending reports at `/osint/auto-reports/`
2. Can filter by status (pending, approved, published, rejected)
3. Views details including:
   - Matched keywords
   - Crawled content
   - Source URL
   - Confidence score
4. Approves or rejects with optional notes
5. Approved reports are published as `ScamReport` with `is_auto_discovered=True`

## URL Routes

- `/osint/auto-reports/` - List auto-generated reports (moderators only)
- `/osint/auto-reports/<id>/` - View report details
- `/osint/auto-reports/<id>/approve/` - Approve report
- `/osint/auto-reports/<id>/reject/` - Reject report

## Models

### SeedWebsite
- Manages websites to crawl
- Tracks crawling statistics
- Configures crawl behavior

### OSINTKeyword
- Defines patterns to search for
- Sets confidence scores
- Enables auto-approval

### CrawledContent
- Stores crawled page content
- Links matched keywords
- Tracks confidence scores

### AutoGeneratedReport
- Generated from crawled content
- Links to ScamReport when approved
- Tracks review status

## Best Practices

1. **Start Small**: Begin with 1-2 seed websites and a few keywords
2. **Monitor Performance**: Check crawl statistics regularly
3. **Tune Keywords**: Adjust confidence scores based on false positives
4. **Respect Rate Limits**: Use appropriate delays to avoid being blocked
5. **Review Regularly**: Check pending reports daily
6. **Update Keywords**: Add new scam patterns as they emerge
7. **Test Regex**: Validate regex patterns before activating

## Troubleshooting

### Crawling Fails
- Check network connectivity
- Verify seed website URLs are accessible
- Check user agent and rate limiting
- Review error messages in admin

### Too Many False Positives
- Increase confidence score thresholds
- Refine keyword patterns
- Add negative keywords (future feature)

### Too Few Matches
- Lower confidence thresholds
- Add more keywords
- Check if seed websites are being crawled
- Verify keyword patterns match content

### Performance Issues
- Reduce crawl depth
- Limit max pages per crawl
- Increase delay between requests
- Use priority levels to focus on important sites

## Security Considerations

1. **User Agent**: Use identifiable user agent for transparency
2. **Rate Limiting**: Respect website terms of service
3. **Content Storage**: Large HTML content stored in database
4. **API Keys**: Store OSINT service API keys securely (encrypted)
5. **Access Control**: Only moderators can review reports

## Future Enhancements

- [ ] Negative keywords to reduce false positives
- [ ] Machine learning for better pattern recognition
- [ ] Image analysis for scam detection
- [ ] Social media monitoring
- [ ] Email/phone validation services
- [ ] Automated report categorization
- [ ] Export/import keyword sets
- [ ] Crawl scheduling per seed website
- [ ] Content change detection
- [ ] Multi-language support