# ETB-API Monitoring System Documentation ## Overview The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities. ## Features ### 1. Health Monitoring - **System Health Checks**: Monitor application, database, cache, and queue health - **Module Health**: Individual module status and dependency tracking - **External Integrations**: Third-party service health monitoring - **Infrastructure Monitoring**: Server resources and network connectivity ### 2. Metrics Collection - **Performance Metrics**: API response times, throughput, error rates - **Business Metrics**: Incident counts, MTTR, MTTA, SLA compliance - **Security Metrics**: Security events, failed logins, risk assessments - **Infrastructure Metrics**: CPU, memory, disk usage - **AI/ML Metrics**: Model accuracy, automation success rates ### 3. Intelligent Alerting - **Threshold Alerts**: Configurable thresholds for all metrics - **Anomaly Detection**: Statistical anomaly detection - **Pattern Alerts**: Pattern-based alerting - **Multi-Channel Notifications**: Email, Slack, Webhook support - **Alert Management**: Acknowledge, resolve, and track alerts ### 4. Monitoring Dashboards - **System Overview**: High-level system status - **Performance Dashboard**: Performance metrics visualization - **Business Metrics**: Operational metrics dashboard - **Security Dashboard**: Security monitoring dashboard - **Custom Dashboards**: User-configurable dashboards ## API Endpoints ### Base URL ``` http://localhost:8000/api/monitoring/ ``` ### Authentication All endpoints require authentication using Django REST Framework token authentication. ### Health Checks #### Get Health Check Summary ```http GET /api/monitoring/health-checks/summary/ Authorization: Token your-token-here ``` **Response:** ```json { "overall_status": "HEALTHY", "total_targets": 12, "healthy_targets": 11, "warning_targets": 1, "critical_targets": 0, "health_percentage": 91.67, "last_updated": "2024-01-15T10:30:00Z" } ``` #### Run All Health Checks ```http POST /api/monitoring/health-checks/run_all_checks/ Authorization: Token your-token-here ``` **Response:** ```json { "status": "success", "message": "Health checks started", "task_id": "celery-task-id" } ``` #### Test Target Connection ```http POST /api/monitoring/targets/{target_id}/test_connection/ Authorization: Token your-token-here ``` ### Metrics #### Get Metric Measurements ```http GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100 Authorization: Token your-token-here ``` #### Get Metric Trends ```http GET /api/monitoring/metrics/{metric_id}/trends/?days=7 Authorization: Token your-token-here ``` **Response:** ```json { "metric_name": "API Response Time", "period_days": 7, "daily_data": [ { "date": "2024-01-08", "value": 150.5, "count": 1440 } ], "trend": "STABLE" } ``` ### Alerts #### Get Alert Summary ```http GET /api/monitoring/alerts/summary/ Authorization: Token your-token-here ``` **Response:** ```json { "total_alerts": 25, "critical_alerts": 2, "high_alerts": 5, "medium_alerts": 8, "low_alerts": 10, "acknowledged_alerts": 15, "resolved_alerts": 20 } ``` #### Acknowledge Alert ```http POST /api/monitoring/alerts/{alert_id}/acknowledge/ Authorization: Token your-token-here ``` #### Resolve Alert ```http POST /api/monitoring/alerts/{alert_id}/resolve/ Authorization: Token your-token-here ``` ### System Overview #### Get System Overview ```http GET /api/monitoring/overview/ Authorization: Token your-token-here ``` **Response:** ```json { "system_status": { "status": "OPERATIONAL", "message": "All systems operational", "started_at": "2024-01-15T09:00:00Z" }, "health_summary": { "overall_status": "HEALTHY", "total_targets": 12, "healthy_targets": 12, "health_percentage": 100.0 }, "alert_summary": { "total_alerts": 0, "critical_alerts": 0 }, "system_resources": { "cpu_percent": 45.2, "memory_percent": 67.8, "disk_percent": 34.5 } } ``` ### Monitoring Tasks #### Execute Monitoring Tasks ```http POST /api/monitoring/tasks/ Authorization: Token your-token-here Content-Type: application/json { "task_type": "health_checks" } ``` **Available task types:** - `health_checks`: Execute health checks for all targets - `metrics_collection`: Collect metrics from all sources - `alert_evaluation`: Evaluate alert rules and send notifications - `system_status_report`: Generate system status report ## Data Models ### MonitoringTarget Represents a system, service, or component to monitor. **Fields:** - `name`: Target name - `target_type`: Type (APPLICATION, DATABASE, CACHE, etc.) - `endpoint_url`: Health check endpoint - `status`: Current status (ACTIVE, INACTIVE, etc.) - `last_status`: Last health check result - `health_check_enabled`: Whether health checks are enabled ### SystemMetric Defines metrics to collect and monitor. **Fields:** - `name`: Metric name - `metric_type`: Type (PERFORMANCE, BUSINESS, SECURITY, etc.) - `category`: Category (API_RESPONSE_TIME, MTTR, etc.) - `unit`: Unit of measurement - `aggregation_method`: How to aggregate values - `warning_threshold`: Warning threshold - `critical_threshold`: Critical threshold ### AlertRule Defines alert conditions and notifications. **Fields:** - `name`: Rule name - `alert_type`: Type (THRESHOLD, ANOMALY, etc.) - `severity`: Alert severity (LOW, MEDIUM, HIGH, CRITICAL) - `condition`: Alert condition configuration - `notification_channels`: Notification channels - `is_enabled`: Whether rule is enabled ### Alert Represents triggered alerts. **Fields:** - `title`: Alert title - `description`: Alert description - `severity`: Alert severity - `status`: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED) - `triggered_value`: Value that triggered the alert - `threshold_value`: Threshold that was exceeded ## Configuration ### Environment Variables ```bash # Monitoring Settings MONITORING_ENABLED=true MONITORING_HEALTH_CHECK_INTERVAL=60 MONITORING_METRICS_COLLECTION_INTERVAL=300 MONITORING_ALERT_EVALUATION_INTERVAL=60 # Alerting Settings ALERTING_EMAIL_FROM=monitoring@etb-api.com ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/... ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts # Performance Thresholds PERFORMANCE_API_RESPONSE_THRESHOLD=2000 PERFORMANCE_CPU_THRESHOLD=80 PERFORMANCE_MEMORY_THRESHOLD=80 ``` ### Celery Configuration Add to your Celery configuration: ```python from celery.schedules import crontab CELERY_BEAT_SCHEDULE = { 'health-checks': { 'task': 'monitoring.tasks.execute_health_checks', 'schedule': 60.0, # Every minute }, 'metrics-collection': { 'task': 'monitoring.tasks.collect_metrics', 'schedule': 300.0, # Every 5 minutes }, 'alert-evaluation': { 'task': 'monitoring.tasks.evaluate_alerts', 'schedule': 60.0, # Every minute }, 'data-cleanup': { 'task': 'monitoring.tasks.cleanup_old_data', 'schedule': crontab(hour=2, minute=0), # Daily at 2 AM }, } ``` ## Setup Instructions ### 1. Install Dependencies Add to `requirements.txt`: ``` psutil>=5.9.0 requests>=2.31.0 ``` ### 2. Run Migrations ```bash python manage.py makemigrations monitoring python manage.py migrate ``` ### 3. Set Up Initial Configuration ```bash python manage.py setup_monitoring --admin-user admin ``` ### 4. Start Celery Workers ```bash celery -A core worker -l info celery -A core beat -l info ``` ### 5. Access Monitoring - **Admin Interface**: `http://localhost:8000/admin/monitoring/` - **API Documentation**: `http://localhost:8000/api/monitoring/` - **System Overview**: `http://localhost:8000/api/monitoring/overview/` ## Monitoring Best Practices ### 1. Health Checks - Set appropriate check intervals (not too frequent) - Use timeouts to prevent hanging checks - Monitor dependencies and external services - Implement graceful degradation ### 2. Metrics Collection - Collect metrics at appropriate intervals - Use proper aggregation methods - Set meaningful thresholds - Monitor both technical and business metrics ### 3. Alerting - Set up alert rules with appropriate severity levels - Use multiple notification channels - Implement alert fatigue prevention - Regularly review and tune alert thresholds ### 4. Dashboards - Create role-based dashboards - Use appropriate refresh intervals - Include both real-time and historical data - Make dashboards actionable ## Troubleshooting ### Common Issues 1. **Health Checks Failing** - Check network connectivity - Verify endpoint URLs - Check authentication credentials - Review timeout settings 2. **Metrics Not Collecting** - Verify Celery workers are running - Check metric configuration - Review collection intervals - Check for errors in logs 3. **Alerts Not Triggering** - Verify alert rules are enabled - Check threshold values - Review notification channel configuration - Check alert evaluation task is running 4. **Performance Issues** - Monitor system resources - Check database query performance - Review metric retention settings - Optimize collection intervals ### Debug Commands ```bash # Check monitoring status python manage.py shell >>> from monitoring.services.health_checks import HealthCheckService >>> service = HealthCheckService() >>> service.get_system_health_summary() # Test health checks >>> from monitoring.models import MonitoringTarget >>> target = MonitoringTarget.objects.first() >>> service.execute_health_check(target, 'HTTP') # Check metrics collection >>> from monitoring.services.metrics_collector import MetricsCollector >>> collector = MetricsCollector() >>> collector.collect_all_metrics() ``` ## Integration with Other Modules ### Security Module - Monitor authentication failures - Track security events - Monitor device posture assessments - Alert on risk assessment anomalies ### Incident Intelligence - Monitor incident processing times - Track AI model performance - Monitor correlation engine health - Alert on incident volume spikes ### Automation & Orchestration - Monitor runbook execution success - Track integration health - Monitor ChatOps command usage - Alert on automation failures ### SLA & On-Call - Monitor SLA compliance - Track escalation times - Monitor on-call assignments - Alert on SLA breaches ### Analytics & Predictive Insights - Monitor ML model accuracy - Track prediction performance - Monitor cost impact calculations - Alert on anomaly detections ## Future Enhancements ### Planned Features 1. **Advanced Anomaly Detection**: Machine learning-based anomaly detection 2. **Predictive Alerting**: Predict and prevent issues before they occur 3. **Custom Metrics**: User-defined custom metrics 4. **Advanced Dashboards**: Interactive dashboards with drill-down capabilities 5. **Mobile App**: Mobile monitoring application 6. **Integration APIs**: APIs for external monitoring tools 7. **Cost Optimization**: Resource usage optimization recommendations 8. **Compliance Reporting**: Automated compliance reporting ### Integration Roadmap 1. **APM Tools**: New Relic, DataDog, AppDynamics 2. **Log Aggregation**: ELK Stack, Splunk, Fluentd 3. **Infrastructure Monitoring**: Prometheus, Grafana, InfluxDB 4. **Cloud Platforms**: AWS CloudWatch, Azure Monitor, GCP Monitoring 5. **Communication Platforms**: PagerDuty, OpsGenie, VictorOps