Files
ETB/ETB-API/monitoring/Documentations/MONITORING_SYSTEM_API.md
Iliyan Angelov 6b247e5b9f Updates
2025-09-19 11:58:53 +03:00

460 lines
12 KiB
Markdown

# ETB-API Monitoring System Documentation
## Overview
The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.
## Features
### 1. Health Monitoring
- **System Health Checks**: Monitor application, database, cache, and queue health
- **Module Health**: Individual module status and dependency tracking
- **External Integrations**: Third-party service health monitoring
- **Infrastructure Monitoring**: Server resources and network connectivity
### 2. Metrics Collection
- **Performance Metrics**: API response times, throughput, error rates
- **Business Metrics**: Incident counts, MTTR, MTTA, SLA compliance
- **Security Metrics**: Security events, failed logins, risk assessments
- **Infrastructure Metrics**: CPU, memory, disk usage
- **AI/ML Metrics**: Model accuracy, automation success rates
### 3. Intelligent Alerting
- **Threshold Alerts**: Configurable thresholds for all metrics
- **Anomaly Detection**: Statistical anomaly detection
- **Pattern Alerts**: Pattern-based alerting
- **Multi-Channel Notifications**: Email, Slack, Webhook support
- **Alert Management**: Acknowledge, resolve, and track alerts
### 4. Monitoring Dashboards
- **System Overview**: High-level system status
- **Performance Dashboard**: Performance metrics visualization
- **Business Metrics**: Operational metrics dashboard
- **Security Dashboard**: Security monitoring dashboard
- **Custom Dashboards**: User-configurable dashboards
## API Endpoints
### Base URL
```
http://localhost:8000/api/monitoring/
```
### Authentication
All endpoints require authentication using Django REST Framework token authentication.
### Health Checks
#### Get Health Check Summary
```http
GET /api/monitoring/health-checks/summary/
Authorization: Token your-token-here
```
**Response:**
```json
{
"overall_status": "HEALTHY",
"total_targets": 12,
"healthy_targets": 11,
"warning_targets": 1,
"critical_targets": 0,
"health_percentage": 91.67,
"last_updated": "2024-01-15T10:30:00Z"
}
```
#### Run All Health Checks
```http
POST /api/monitoring/health-checks/run_all_checks/
Authorization: Token your-token-here
```
**Response:**
```json
{
"status": "success",
"message": "Health checks started",
"task_id": "celery-task-id"
}
```
#### Test Target Connection
```http
POST /api/monitoring/targets/{target_id}/test_connection/
Authorization: Token your-token-here
```
### Metrics
#### Get Metric Measurements
```http
GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
Authorization: Token your-token-here
```
#### Get Metric Trends
```http
GET /api/monitoring/metrics/{metric_id}/trends/?days=7
Authorization: Token your-token-here
```
**Response:**
```json
{
"metric_name": "API Response Time",
"period_days": 7,
"daily_data": [
{
"date": "2024-01-08",
"value": 150.5,
"count": 1440
}
],
"trend": "STABLE"
}
```
### Alerts
#### Get Alert Summary
```http
GET /api/monitoring/alerts/summary/
Authorization: Token your-token-here
```
**Response:**
```json
{
"total_alerts": 25,
"critical_alerts": 2,
"high_alerts": 5,
"medium_alerts": 8,
"low_alerts": 10,
"acknowledged_alerts": 15,
"resolved_alerts": 20
}
```
#### Acknowledge Alert
```http
POST /api/monitoring/alerts/{alert_id}/acknowledge/
Authorization: Token your-token-here
```
#### Resolve Alert
```http
POST /api/monitoring/alerts/{alert_id}/resolve/
Authorization: Token your-token-here
```
### System Overview
#### Get System Overview
```http
GET /api/monitoring/overview/
Authorization: Token your-token-here
```
**Response:**
```json
{
"system_status": {
"status": "OPERATIONAL",
"message": "All systems operational",
"started_at": "2024-01-15T09:00:00Z"
},
"health_summary": {
"overall_status": "HEALTHY",
"total_targets": 12,
"healthy_targets": 12,
"health_percentage": 100.0
},
"alert_summary": {
"total_alerts": 0,
"critical_alerts": 0
},
"system_resources": {
"cpu_percent": 45.2,
"memory_percent": 67.8,
"disk_percent": 34.5
}
}
```
### Monitoring Tasks
#### Execute Monitoring Tasks
```http
POST /api/monitoring/tasks/
Authorization: Token your-token-here
Content-Type: application/json
{
"task_type": "health_checks"
}
```
**Available task types:**
- `health_checks`: Execute health checks for all targets
- `metrics_collection`: Collect metrics from all sources
- `alert_evaluation`: Evaluate alert rules and send notifications
- `system_status_report`: Generate system status report
## Data Models
### MonitoringTarget
Represents a system, service, or component to monitor.
**Fields:**
- `name`: Target name
- `target_type`: Type (APPLICATION, DATABASE, CACHE, etc.)
- `endpoint_url`: Health check endpoint
- `status`: Current status (ACTIVE, INACTIVE, etc.)
- `last_status`: Last health check result
- `health_check_enabled`: Whether health checks are enabled
### SystemMetric
Defines metrics to collect and monitor.
**Fields:**
- `name`: Metric name
- `metric_type`: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)
- `category`: Category (API_RESPONSE_TIME, MTTR, etc.)
- `unit`: Unit of measurement
- `aggregation_method`: How to aggregate values
- `warning_threshold`: Warning threshold
- `critical_threshold`: Critical threshold
### AlertRule
Defines alert conditions and notifications.
**Fields:**
- `name`: Rule name
- `alert_type`: Type (THRESHOLD, ANOMALY, etc.)
- `severity`: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)
- `condition`: Alert condition configuration
- `notification_channels`: Notification channels
- `is_enabled`: Whether rule is enabled
### Alert
Represents triggered alerts.
**Fields:**
- `title`: Alert title
- `description`: Alert description
- `severity`: Alert severity
- `status`: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)
- `triggered_value`: Value that triggered the alert
- `threshold_value`: Threshold that was exceeded
## Configuration
### Environment Variables
```bash
# Monitoring Settings
MONITORING_ENABLED=true
MONITORING_HEALTH_CHECK_INTERVAL=60
MONITORING_METRICS_COLLECTION_INTERVAL=300
MONITORING_ALERT_EVALUATION_INTERVAL=60
# Alerting Settings
ALERTING_EMAIL_FROM=monitoring@etb-api.com
ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts
# Performance Thresholds
PERFORMANCE_API_RESPONSE_THRESHOLD=2000
PERFORMANCE_CPU_THRESHOLD=80
PERFORMANCE_MEMORY_THRESHOLD=80
```
### Celery Configuration
Add to your Celery configuration:
```python
from celery.schedules import crontab
CELERY_BEAT_SCHEDULE = {
'health-checks': {
'task': 'monitoring.tasks.execute_health_checks',
'schedule': 60.0, # Every minute
},
'metrics-collection': {
'task': 'monitoring.tasks.collect_metrics',
'schedule': 300.0, # Every 5 minutes
},
'alert-evaluation': {
'task': 'monitoring.tasks.evaluate_alerts',
'schedule': 60.0, # Every minute
},
'data-cleanup': {
'task': 'monitoring.tasks.cleanup_old_data',
'schedule': crontab(hour=2, minute=0), # Daily at 2 AM
},
}
```
## Setup Instructions
### 1. Install Dependencies
Add to `requirements.txt`:
```
psutil>=5.9.0
requests>=2.31.0
```
### 2. Run Migrations
```bash
python manage.py makemigrations monitoring
python manage.py migrate
```
### 3. Set Up Initial Configuration
```bash
python manage.py setup_monitoring --admin-user admin
```
### 4. Start Celery Workers
```bash
celery -A core worker -l info
celery -A core beat -l info
```
### 5. Access Monitoring
- **Admin Interface**: `http://localhost:8000/admin/monitoring/`
- **API Documentation**: `http://localhost:8000/api/monitoring/`
- **System Overview**: `http://localhost:8000/api/monitoring/overview/`
## Monitoring Best Practices
### 1. Health Checks
- Set appropriate check intervals (not too frequent)
- Use timeouts to prevent hanging checks
- Monitor dependencies and external services
- Implement graceful degradation
### 2. Metrics Collection
- Collect metrics at appropriate intervals
- Use proper aggregation methods
- Set meaningful thresholds
- Monitor both technical and business metrics
### 3. Alerting
- Set up alert rules with appropriate severity levels
- Use multiple notification channels
- Implement alert fatigue prevention
- Regularly review and tune alert thresholds
### 4. Dashboards
- Create role-based dashboards
- Use appropriate refresh intervals
- Include both real-time and historical data
- Make dashboards actionable
## Troubleshooting
### Common Issues
1. **Health Checks Failing**
- Check network connectivity
- Verify endpoint URLs
- Check authentication credentials
- Review timeout settings
2. **Metrics Not Collecting**
- Verify Celery workers are running
- Check metric configuration
- Review collection intervals
- Check for errors in logs
3. **Alerts Not Triggering**
- Verify alert rules are enabled
- Check threshold values
- Review notification channel configuration
- Check alert evaluation task is running
4. **Performance Issues**
- Monitor system resources
- Check database query performance
- Review metric retention settings
- Optimize collection intervals
### Debug Commands
```bash
# Check monitoring status
python manage.py shell
>>> from monitoring.services.health_checks import HealthCheckService
>>> service = HealthCheckService()
>>> service.get_system_health_summary()
# Test health checks
>>> from monitoring.models import MonitoringTarget
>>> target = MonitoringTarget.objects.first()
>>> service.execute_health_check(target, 'HTTP')
# Check metrics collection
>>> from monitoring.services.metrics_collector import MetricsCollector
>>> collector = MetricsCollector()
>>> collector.collect_all_metrics()
```
## Integration with Other Modules
### Security Module
- Monitor authentication failures
- Track security events
- Monitor device posture assessments
- Alert on risk assessment anomalies
### Incident Intelligence
- Monitor incident processing times
- Track AI model performance
- Monitor correlation engine health
- Alert on incident volume spikes
### Automation & Orchestration
- Monitor runbook execution success
- Track integration health
- Monitor ChatOps command usage
- Alert on automation failures
### SLA & On-Call
- Monitor SLA compliance
- Track escalation times
- Monitor on-call assignments
- Alert on SLA breaches
### Analytics & Predictive Insights
- Monitor ML model accuracy
- Track prediction performance
- Monitor cost impact calculations
- Alert on anomaly detections
## Future Enhancements
### Planned Features
1. **Advanced Anomaly Detection**: Machine learning-based anomaly detection
2. **Predictive Alerting**: Predict and prevent issues before they occur
3. **Custom Metrics**: User-defined custom metrics
4. **Advanced Dashboards**: Interactive dashboards with drill-down capabilities
5. **Mobile App**: Mobile monitoring application
6. **Integration APIs**: APIs for external monitoring tools
7. **Cost Optimization**: Resource usage optimization recommendations
8. **Compliance Reporting**: Automated compliance reporting
### Integration Roadmap
1. **APM Tools**: New Relic, DataDog, AppDynamics
2. **Log Aggregation**: ELK Stack, Splunk, Fluentd
3. **Infrastructure Monitoring**: Prometheus, Grafana, InfluxDB
4. **Cloud Platforms**: AWS CloudWatch, Azure Monitor, GCP Monitoring
5. **Communication Platforms**: PagerDuty, OpsGenie, VictorOps