12 KiB
12 KiB
ETB-API Monitoring System Documentation
Overview
The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.
Features
1. Health Monitoring
- System Health Checks: Monitor application, database, cache, and queue health
- Module Health: Individual module status and dependency tracking
- External Integrations: Third-party service health monitoring
- Infrastructure Monitoring: Server resources and network connectivity
2. Metrics Collection
- Performance Metrics: API response times, throughput, error rates
- Business Metrics: Incident counts, MTTR, MTTA, SLA compliance
- Security Metrics: Security events, failed logins, risk assessments
- Infrastructure Metrics: CPU, memory, disk usage
- AI/ML Metrics: Model accuracy, automation success rates
3. Intelligent Alerting
- Threshold Alerts: Configurable thresholds for all metrics
- Anomaly Detection: Statistical anomaly detection
- Pattern Alerts: Pattern-based alerting
- Multi-Channel Notifications: Email, Slack, Webhook support
- Alert Management: Acknowledge, resolve, and track alerts
4. Monitoring Dashboards
- System Overview: High-level system status
- Performance Dashboard: Performance metrics visualization
- Business Metrics: Operational metrics dashboard
- Security Dashboard: Security monitoring dashboard
- Custom Dashboards: User-configurable dashboards
API Endpoints
Base URL
http://localhost:8000/api/monitoring/
Authentication
All endpoints require authentication using Django REST Framework token authentication.
Health Checks
Get Health Check Summary
GET /api/monitoring/health-checks/summary/
Authorization: Token your-token-here
Response:
{
"overall_status": "HEALTHY",
"total_targets": 12,
"healthy_targets": 11,
"warning_targets": 1,
"critical_targets": 0,
"health_percentage": 91.67,
"last_updated": "2024-01-15T10:30:00Z"
}
Run All Health Checks
POST /api/monitoring/health-checks/run_all_checks/
Authorization: Token your-token-here
Response:
{
"status": "success",
"message": "Health checks started",
"task_id": "celery-task-id"
}
Test Target Connection
POST /api/monitoring/targets/{target_id}/test_connection/
Authorization: Token your-token-here
Metrics
Get Metric Measurements
GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
Authorization: Token your-token-here
Get Metric Trends
GET /api/monitoring/metrics/{metric_id}/trends/?days=7
Authorization: Token your-token-here
Response:
{
"metric_name": "API Response Time",
"period_days": 7,
"daily_data": [
{
"date": "2024-01-08",
"value": 150.5,
"count": 1440
}
],
"trend": "STABLE"
}
Alerts
Get Alert Summary
GET /api/monitoring/alerts/summary/
Authorization: Token your-token-here
Response:
{
"total_alerts": 25,
"critical_alerts": 2,
"high_alerts": 5,
"medium_alerts": 8,
"low_alerts": 10,
"acknowledged_alerts": 15,
"resolved_alerts": 20
}
Acknowledge Alert
POST /api/monitoring/alerts/{alert_id}/acknowledge/
Authorization: Token your-token-here
Resolve Alert
POST /api/monitoring/alerts/{alert_id}/resolve/
Authorization: Token your-token-here
System Overview
Get System Overview
GET /api/monitoring/overview/
Authorization: Token your-token-here
Response:
{
"system_status": {
"status": "OPERATIONAL",
"message": "All systems operational",
"started_at": "2024-01-15T09:00:00Z"
},
"health_summary": {
"overall_status": "HEALTHY",
"total_targets": 12,
"healthy_targets": 12,
"health_percentage": 100.0
},
"alert_summary": {
"total_alerts": 0,
"critical_alerts": 0
},
"system_resources": {
"cpu_percent": 45.2,
"memory_percent": 67.8,
"disk_percent": 34.5
}
}
Monitoring Tasks
Execute Monitoring Tasks
POST /api/monitoring/tasks/
Authorization: Token your-token-here
Content-Type: application/json
{
"task_type": "health_checks"
}
Available task types:
health_checks: Execute health checks for all targetsmetrics_collection: Collect metrics from all sourcesalert_evaluation: Evaluate alert rules and send notificationssystem_status_report: Generate system status report
Data Models
MonitoringTarget
Represents a system, service, or component to monitor.
Fields:
name: Target nametarget_type: Type (APPLICATION, DATABASE, CACHE, etc.)endpoint_url: Health check endpointstatus: Current status (ACTIVE, INACTIVE, etc.)last_status: Last health check resulthealth_check_enabled: Whether health checks are enabled
SystemMetric
Defines metrics to collect and monitor.
Fields:
name: Metric namemetric_type: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)category: Category (API_RESPONSE_TIME, MTTR, etc.)unit: Unit of measurementaggregation_method: How to aggregate valueswarning_threshold: Warning thresholdcritical_threshold: Critical threshold
AlertRule
Defines alert conditions and notifications.
Fields:
name: Rule namealert_type: Type (THRESHOLD, ANOMALY, etc.)severity: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)condition: Alert condition configurationnotification_channels: Notification channelsis_enabled: Whether rule is enabled
Alert
Represents triggered alerts.
Fields:
title: Alert titledescription: Alert descriptionseverity: Alert severitystatus: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)triggered_value: Value that triggered the alertthreshold_value: Threshold that was exceeded
Configuration
Environment Variables
# Monitoring Settings
MONITORING_ENABLED=true
MONITORING_HEALTH_CHECK_INTERVAL=60
MONITORING_METRICS_COLLECTION_INTERVAL=300
MONITORING_ALERT_EVALUATION_INTERVAL=60
# Alerting Settings
ALERTING_EMAIL_FROM=monitoring@etb-api.com
ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts
# Performance Thresholds
PERFORMANCE_API_RESPONSE_THRESHOLD=2000
PERFORMANCE_CPU_THRESHOLD=80
PERFORMANCE_MEMORY_THRESHOLD=80
Celery Configuration
Add to your Celery configuration:
from celery.schedules import crontab
CELERY_BEAT_SCHEDULE = {
'health-checks': {
'task': 'monitoring.tasks.execute_health_checks',
'schedule': 60.0, # Every minute
},
'metrics-collection': {
'task': 'monitoring.tasks.collect_metrics',
'schedule': 300.0, # Every 5 minutes
},
'alert-evaluation': {
'task': 'monitoring.tasks.evaluate_alerts',
'schedule': 60.0, # Every minute
},
'data-cleanup': {
'task': 'monitoring.tasks.cleanup_old_data',
'schedule': crontab(hour=2, minute=0), # Daily at 2 AM
},
}
Setup Instructions
1. Install Dependencies
Add to requirements.txt:
psutil>=5.9.0
requests>=2.31.0
2. Run Migrations
python manage.py makemigrations monitoring
python manage.py migrate
3. Set Up Initial Configuration
python manage.py setup_monitoring --admin-user admin
4. Start Celery Workers
celery -A core worker -l info
celery -A core beat -l info
5. Access Monitoring
- Admin Interface:
http://localhost:8000/admin/monitoring/ - API Documentation:
http://localhost:8000/api/monitoring/ - System Overview:
http://localhost:8000/api/monitoring/overview/
Monitoring Best Practices
1. Health Checks
- Set appropriate check intervals (not too frequent)
- Use timeouts to prevent hanging checks
- Monitor dependencies and external services
- Implement graceful degradation
2. Metrics Collection
- Collect metrics at appropriate intervals
- Use proper aggregation methods
- Set meaningful thresholds
- Monitor both technical and business metrics
3. Alerting
- Set up alert rules with appropriate severity levels
- Use multiple notification channels
- Implement alert fatigue prevention
- Regularly review and tune alert thresholds
4. Dashboards
- Create role-based dashboards
- Use appropriate refresh intervals
- Include both real-time and historical data
- Make dashboards actionable
Troubleshooting
Common Issues
-
Health Checks Failing
- Check network connectivity
- Verify endpoint URLs
- Check authentication credentials
- Review timeout settings
-
Metrics Not Collecting
- Verify Celery workers are running
- Check metric configuration
- Review collection intervals
- Check for errors in logs
-
Alerts Not Triggering
- Verify alert rules are enabled
- Check threshold values
- Review notification channel configuration
- Check alert evaluation task is running
-
Performance Issues
- Monitor system resources
- Check database query performance
- Review metric retention settings
- Optimize collection intervals
Debug Commands
# Check monitoring status
python manage.py shell
>>> from monitoring.services.health_checks import HealthCheckService
>>> service = HealthCheckService()
>>> service.get_system_health_summary()
# Test health checks
>>> from monitoring.models import MonitoringTarget
>>> target = MonitoringTarget.objects.first()
>>> service.execute_health_check(target, 'HTTP')
# Check metrics collection
>>> from monitoring.services.metrics_collector import MetricsCollector
>>> collector = MetricsCollector()
>>> collector.collect_all_metrics()
Integration with Other Modules
Security Module
- Monitor authentication failures
- Track security events
- Monitor device posture assessments
- Alert on risk assessment anomalies
Incident Intelligence
- Monitor incident processing times
- Track AI model performance
- Monitor correlation engine health
- Alert on incident volume spikes
Automation & Orchestration
- Monitor runbook execution success
- Track integration health
- Monitor ChatOps command usage
- Alert on automation failures
SLA & On-Call
- Monitor SLA compliance
- Track escalation times
- Monitor on-call assignments
- Alert on SLA breaches
Analytics & Predictive Insights
- Monitor ML model accuracy
- Track prediction performance
- Monitor cost impact calculations
- Alert on anomaly detections
Future Enhancements
Planned Features
- Advanced Anomaly Detection: Machine learning-based anomaly detection
- Predictive Alerting: Predict and prevent issues before they occur
- Custom Metrics: User-defined custom metrics
- Advanced Dashboards: Interactive dashboards with drill-down capabilities
- Mobile App: Mobile monitoring application
- Integration APIs: APIs for external monitoring tools
- Cost Optimization: Resource usage optimization recommendations
- Compliance Reporting: Automated compliance reporting
Integration Roadmap
- APM Tools: New Relic, DataDog, AppDynamics
- Log Aggregation: ELK Stack, Splunk, Fluentd
- Infrastructure Monitoring: Prometheus, Grafana, InfluxDB
- Cloud Platforms: AWS CloudWatch, Azure Monitor, GCP Monitoring
- Communication Platforms: PagerDuty, OpsGenie, VictorOps