ETB/ETB-API/monitoring/Documentations/MONITORING_SYSTEM_API.md

# ETB-API Monitoring System Documentation

## Overview

The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.

## Features

### 1. Health Monitoring
- **System Health Checks**: Monitor application, database, cache, and queue health
- **Module Health**: Individual module status and dependency tracking
- **External Integrations**: Third-party service health monitoring
- **Infrastructure Monitoring**: Server resources and network connectivity

### 2. Metrics Collection
- **Performance Metrics**: API response times, throughput, error rates
- **Business Metrics**: Incident counts, MTTR, MTTA, SLA compliance
- **Security Metrics**: Security events, failed logins, risk assessments
- **Infrastructure Metrics**: CPU, memory, disk usage
- **AI/ML Metrics**: Model accuracy, automation success rates

### 3. Intelligent Alerting
- **Threshold Alerts**: Configurable thresholds for all metrics
- **Anomaly Detection**: Statistical anomaly detection
- **Pattern Alerts**: Pattern-based alerting
- **Multi-Channel Notifications**: Email, Slack, Webhook support
- **Alert Management**: Acknowledge, resolve, and track alerts

### 4. Monitoring Dashboards
- **System Overview**: High-level system status
- **Performance Dashboard**: Performance metrics visualization
- **Business Metrics**: Operational metrics dashboard
- **Security Dashboard**: Security monitoring dashboard
- **Custom Dashboards**: User-configurable dashboards

## API Endpoints

### Base URL
```
http://localhost:8000/api/monitoring/
```

### Authentication
All endpoints require authentication using Django REST Framework token authentication.

### Health Checks

#### Get Health Check Summary
```http
GET /api/monitoring/health-checks/summary/
Authorization: Token your-token-here
```

**Response:**
```json
{
    "overall_status": "HEALTHY",
    "total_targets": 12,
    "healthy_targets": 11,
    "warning_targets": 1,
    "critical_targets": 0,
    "health_percentage": 91.67,
    "last_updated": "2024-01-15T10:30:00Z"
}
```

#### Run All Health Checks
```http
POST /api/monitoring/health-checks/run_all_checks/
Authorization: Token your-token-here
```

**Response:**
```json
{
    "status": "success",
    "message": "Health checks started",
    "task_id": "celery-task-id"
}
```

#### Test Target Connection
```http
POST /api/monitoring/targets/{target_id}/test_connection/
Authorization: Token your-token-here
```

### Metrics

#### Get Metric Measurements
```http
GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
Authorization: Token your-token-here
```

#### Get Metric Trends
```http
GET /api/monitoring/metrics/{metric_id}/trends/?days=7
Authorization: Token your-token-here
```

**Response:**
```json
{
    "metric_name": "API Response Time",
    "period_days": 7,
    "daily_data": [
        {
            "date": "2024-01-08",
            "value": 150.5,
            "count": 1440
        }
    ],
    "trend": "STABLE"
}
```

### Alerts

#### Get Alert Summary
```http
GET /api/monitoring/alerts/summary/
Authorization: Token your-token-here
```

**Response:**
```json
{
    "total_alerts": 25,
    "critical_alerts": 2,
    "high_alerts": 5,
    "medium_alerts": 8,
    "low_alerts": 10,
    "acknowledged_alerts": 15,
    "resolved_alerts": 20
}
```

#### Acknowledge Alert
```http
POST /api/monitoring/alerts/{alert_id}/acknowledge/
Authorization: Token your-token-here
```

#### Resolve Alert
```http
POST /api/monitoring/alerts/{alert_id}/resolve/
Authorization: Token your-token-here
```

### System Overview

#### Get System Overview
```http
GET /api/monitoring/overview/
Authorization: Token your-token-here
```

**Response:**
```json
{
    "system_status": {
        "status": "OPERATIONAL",
        "message": "All systems operational",
        "started_at": "2024-01-15T09:00:00Z"
    },
    "health_summary": {
        "overall_status": "HEALTHY",
        "total_targets": 12,
        "healthy_targets": 12,
        "health_percentage": 100.0
    },
    "alert_summary": {
        "total_alerts": 0,
        "critical_alerts": 0
    },
    "system_resources": {
        "cpu_percent": 45.2,
        "memory_percent": 67.8,
        "disk_percent": 34.5
    }
}
```

### Monitoring Tasks

#### Execute Monitoring Tasks
```http
POST /api/monitoring/tasks/
Authorization: Token your-token-here
Content-Type: application/json

{
    "task_type": "health_checks"
}
```

**Available task types:**
- `health_checks`: Execute health checks for all targets
- `metrics_collection`: Collect metrics from all sources
- `alert_evaluation`: Evaluate alert rules and send notifications
- `system_status_report`: Generate system status report

## Data Models

### MonitoringTarget
Represents a system, service, or component to monitor.

**Fields:**
- `name`: Target name
- `target_type`: Type (APPLICATION, DATABASE, CACHE, etc.)
- `endpoint_url`: Health check endpoint
- `status`: Current status (ACTIVE, INACTIVE, etc.)
- `last_status`: Last health check result
- `health_check_enabled`: Whether health checks are enabled

### SystemMetric
Defines metrics to collect and monitor.

**Fields:**
- `name`: Metric name
- `metric_type`: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)
- `category`: Category (API_RESPONSE_TIME, MTTR, etc.)
- `unit`: Unit of measurement
- `aggregation_method`: How to aggregate values
- `warning_threshold`: Warning threshold
- `critical_threshold`: Critical threshold

### AlertRule
Defines alert conditions and notifications.

**Fields:**
- `name`: Rule name
- `alert_type`: Type (THRESHOLD, ANOMALY, etc.)
- `severity`: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)
- `condition`: Alert condition configuration
- `notification_channels`: Notification channels
- `is_enabled`: Whether rule is enabled

### Alert
Represents triggered alerts.

**Fields:**
- `title`: Alert title
- `description`: Alert description
- `severity`: Alert severity
- `status`: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)
- `triggered_value`: Value that triggered the alert
- `threshold_value`: Threshold that was exceeded

## Configuration

### Environment Variables

```bash
# Monitoring Settings
MONITORING_ENABLED=true
MONITORING_HEALTH_CHECK_INTERVAL=60
MONITORING_METRICS_COLLECTION_INTERVAL=300
MONITORING_ALERT_EVALUATION_INTERVAL=60

# Alerting Settings
ALERTING_EMAIL_FROM=monitoring@etb-api.com
ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts

# Performance Thresholds
PERFORMANCE_API_RESPONSE_THRESHOLD=2000
PERFORMANCE_CPU_THRESHOLD=80
PERFORMANCE_MEMORY_THRESHOLD=80
```

### Celery Configuration

Add to your Celery configuration:

```python
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    'health-checks': {
        'task': 'monitoring.tasks.execute_health_checks',
        'schedule': 60.0,  # Every minute
    },
    'metrics-collection': {
        'task': 'monitoring.tasks.collect_metrics',
        'schedule': 300.0,  # Every 5 minutes
    },
    'alert-evaluation': {
        'task': 'monitoring.tasks.evaluate_alerts',
        'schedule': 60.0,  # Every minute
    },
    'data-cleanup': {
        'task': 'monitoring.tasks.cleanup_old_data',
        'schedule': crontab(hour=2, minute=0),  # Daily at 2 AM
    },
}
```

## Setup Instructions

### 1. Install Dependencies

Add to `requirements.txt`:
```
psutil>=5.9.0
requests>=2.31.0
```

### 2. Run Migrations

```bash
python manage.py makemigrations monitoring
python manage.py migrate
```

### 3. Set Up Initial Configuration

```bash
python manage.py setup_monitoring --admin-user admin
```

### 4. Start Celery Workers

```bash
celery -A core worker -l info
celery -A core beat -l info
```

### 5. Access Monitoring

- **Admin Interface**: `http://localhost:8000/admin/monitoring/`
- **API Documentation**: `http://localhost:8000/api/monitoring/`
- **System Overview**: `http://localhost:8000/api/monitoring/overview/`

## Monitoring Best Practices

### 1. Health Checks
- Set appropriate check intervals (not too frequent)
- Use timeouts to prevent hanging checks
- Monitor dependencies and external services
- Implement graceful degradation

### 2. Metrics Collection
- Collect metrics at appropriate intervals
- Use proper aggregation methods
- Set meaningful thresholds
- Monitor both technical and business metrics

### 3. Alerting
- Set up alert rules with appropriate severity levels
- Use multiple notification channels
- Implement alert fatigue prevention
- Regularly review and tune alert thresholds

### 4. Dashboards
- Create role-based dashboards
- Use appropriate refresh intervals
- Include both real-time and historical data
- Make dashboards actionable

## Troubleshooting

### Common Issues

1. **Health Checks Failing**
   - Check network connectivity
   - Verify endpoint URLs
   - Check authentication credentials
   - Review timeout settings

2. **Metrics Not Collecting**
   - Verify Celery workers are running
   - Check metric configuration
   - Review collection intervals
   - Check for errors in logs

3. **Alerts Not Triggering**
   - Verify alert rules are enabled
   - Check threshold values
   - Review notification channel configuration
   - Check alert evaluation task is running

4. **Performance Issues**
   - Monitor system resources
   - Check database query performance
   - Review metric retention settings
   - Optimize collection intervals

### Debug Commands

```bash
# Check monitoring status
python manage.py shell
>>> from monitoring.services.health_checks import HealthCheckService
>>> service = HealthCheckService()
>>> service.get_system_health_summary()

# Test health checks
>>> from monitoring.models import MonitoringTarget
>>> target = MonitoringTarget.objects.first()
>>> service.execute_health_check(target, 'HTTP')

# Check metrics collection
>>> from monitoring.services.metrics_collector import MetricsCollector
>>> collector = MetricsCollector()
>>> collector.collect_all_metrics()
```

## Integration with Other Modules

### Security Module
- Monitor authentication failures
- Track security events
- Monitor device posture assessments
- Alert on risk assessment anomalies

### Incident Intelligence
- Monitor incident processing times
- Track AI model performance
- Monitor correlation engine health
- Alert on incident volume spikes

### Automation & Orchestration
- Monitor runbook execution success
- Track integration health
- Monitor ChatOps command usage
- Alert on automation failures

### SLA & On-Call
- Monitor SLA compliance
- Track escalation times
- Monitor on-call assignments
- Alert on SLA breaches

### Analytics & Predictive Insights
- Monitor ML model accuracy
- Track prediction performance
- Monitor cost impact calculations
- Alert on anomaly detections

## Future Enhancements

### Planned Features
1. **Advanced Anomaly Detection**: Machine learning-based anomaly detection
2. **Predictive Alerting**: Predict and prevent issues before they occur
3. **Custom Metrics**: User-defined custom metrics
4. **Advanced Dashboards**: Interactive dashboards with drill-down capabilities
5. **Mobile App**: Mobile monitoring application
6. **Integration APIs**: APIs for external monitoring tools
7. **Cost Optimization**: Resource usage optimization recommendations
8. **Compliance Reporting**: Automated compliance reporting

### Integration Roadmap
1. **APM Tools**: New Relic, DataDog, AppDynamics
2. **Log Aggregation**: ELK Stack, Splunk, Fluentd
3. **Infrastructure Monitoring**: Prometheus, Grafana, InfluxDB
4. **Cloud Platforms**: AWS CloudWatch, Azure Monitor, GCP Monitoring
5. **Communication Platforms**: PagerDuty, OpsGenie, VictorOps