460 lines
12 KiB
Markdown
460 lines
12 KiB
Markdown
# ETB-API Monitoring System Documentation
|
|
|
|
## Overview
|
|
|
|
The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.
|
|
|
|
## Features
|
|
|
|
### 1. Health Monitoring
|
|
- **System Health Checks**: Monitor application, database, cache, and queue health
|
|
- **Module Health**: Individual module status and dependency tracking
|
|
- **External Integrations**: Third-party service health monitoring
|
|
- **Infrastructure Monitoring**: Server resources and network connectivity
|
|
|
|
### 2. Metrics Collection
|
|
- **Performance Metrics**: API response times, throughput, error rates
|
|
- **Business Metrics**: Incident counts, MTTR, MTTA, SLA compliance
|
|
- **Security Metrics**: Security events, failed logins, risk assessments
|
|
- **Infrastructure Metrics**: CPU, memory, disk usage
|
|
- **AI/ML Metrics**: Model accuracy, automation success rates
|
|
|
|
### 3. Intelligent Alerting
|
|
- **Threshold Alerts**: Configurable thresholds for all metrics
|
|
- **Anomaly Detection**: Statistical anomaly detection
|
|
- **Pattern Alerts**: Pattern-based alerting
|
|
- **Multi-Channel Notifications**: Email, Slack, Webhook support
|
|
- **Alert Management**: Acknowledge, resolve, and track alerts
|
|
|
|
### 4. Monitoring Dashboards
|
|
- **System Overview**: High-level system status
|
|
- **Performance Dashboard**: Performance metrics visualization
|
|
- **Business Metrics**: Operational metrics dashboard
|
|
- **Security Dashboard**: Security monitoring dashboard
|
|
- **Custom Dashboards**: User-configurable dashboards
|
|
|
|
## API Endpoints
|
|
|
|
### Base URL
|
|
```
|
|
http://localhost:8000/api/monitoring/
|
|
```
|
|
|
|
### Authentication
|
|
All endpoints require authentication using Django REST Framework token authentication.
|
|
|
|
### Health Checks
|
|
|
|
#### Get Health Check Summary
|
|
```http
|
|
GET /api/monitoring/health-checks/summary/
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"overall_status": "HEALTHY",
|
|
"total_targets": 12,
|
|
"healthy_targets": 11,
|
|
"warning_targets": 1,
|
|
"critical_targets": 0,
|
|
"health_percentage": 91.67,
|
|
"last_updated": "2024-01-15T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
#### Run All Health Checks
|
|
```http
|
|
POST /api/monitoring/health-checks/run_all_checks/
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "success",
|
|
"message": "Health checks started",
|
|
"task_id": "celery-task-id"
|
|
}
|
|
```
|
|
|
|
#### Test Target Connection
|
|
```http
|
|
POST /api/monitoring/targets/{target_id}/test_connection/
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
### Metrics
|
|
|
|
#### Get Metric Measurements
|
|
```http
|
|
GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
#### Get Metric Trends
|
|
```http
|
|
GET /api/monitoring/metrics/{metric_id}/trends/?days=7
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"metric_name": "API Response Time",
|
|
"period_days": 7,
|
|
"daily_data": [
|
|
{
|
|
"date": "2024-01-08",
|
|
"value": 150.5,
|
|
"count": 1440
|
|
}
|
|
],
|
|
"trend": "STABLE"
|
|
}
|
|
```
|
|
|
|
### Alerts
|
|
|
|
#### Get Alert Summary
|
|
```http
|
|
GET /api/monitoring/alerts/summary/
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"total_alerts": 25,
|
|
"critical_alerts": 2,
|
|
"high_alerts": 5,
|
|
"medium_alerts": 8,
|
|
"low_alerts": 10,
|
|
"acknowledged_alerts": 15,
|
|
"resolved_alerts": 20
|
|
}
|
|
```
|
|
|
|
#### Acknowledge Alert
|
|
```http
|
|
POST /api/monitoring/alerts/{alert_id}/acknowledge/
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
#### Resolve Alert
|
|
```http
|
|
POST /api/monitoring/alerts/{alert_id}/resolve/
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
### System Overview
|
|
|
|
#### Get System Overview
|
|
```http
|
|
GET /api/monitoring/overview/
|
|
Authorization: Token your-token-here
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"system_status": {
|
|
"status": "OPERATIONAL",
|
|
"message": "All systems operational",
|
|
"started_at": "2024-01-15T09:00:00Z"
|
|
},
|
|
"health_summary": {
|
|
"overall_status": "HEALTHY",
|
|
"total_targets": 12,
|
|
"healthy_targets": 12,
|
|
"health_percentage": 100.0
|
|
},
|
|
"alert_summary": {
|
|
"total_alerts": 0,
|
|
"critical_alerts": 0
|
|
},
|
|
"system_resources": {
|
|
"cpu_percent": 45.2,
|
|
"memory_percent": 67.8,
|
|
"disk_percent": 34.5
|
|
}
|
|
}
|
|
```
|
|
|
|
### Monitoring Tasks
|
|
|
|
#### Execute Monitoring Tasks
|
|
```http
|
|
POST /api/monitoring/tasks/
|
|
Authorization: Token your-token-here
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"task_type": "health_checks"
|
|
}
|
|
```
|
|
|
|
**Available task types:**
|
|
- `health_checks`: Execute health checks for all targets
|
|
- `metrics_collection`: Collect metrics from all sources
|
|
- `alert_evaluation`: Evaluate alert rules and send notifications
|
|
- `system_status_report`: Generate system status report
|
|
|
|
## Data Models
|
|
|
|
### MonitoringTarget
|
|
Represents a system, service, or component to monitor.
|
|
|
|
**Fields:**
|
|
- `name`: Target name
|
|
- `target_type`: Type (APPLICATION, DATABASE, CACHE, etc.)
|
|
- `endpoint_url`: Health check endpoint
|
|
- `status`: Current status (ACTIVE, INACTIVE, etc.)
|
|
- `last_status`: Last health check result
|
|
- `health_check_enabled`: Whether health checks are enabled
|
|
|
|
### SystemMetric
|
|
Defines metrics to collect and monitor.
|
|
|
|
**Fields:**
|
|
- `name`: Metric name
|
|
- `metric_type`: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)
|
|
- `category`: Category (API_RESPONSE_TIME, MTTR, etc.)
|
|
- `unit`: Unit of measurement
|
|
- `aggregation_method`: How to aggregate values
|
|
- `warning_threshold`: Warning threshold
|
|
- `critical_threshold`: Critical threshold
|
|
|
|
### AlertRule
|
|
Defines alert conditions and notifications.
|
|
|
|
**Fields:**
|
|
- `name`: Rule name
|
|
- `alert_type`: Type (THRESHOLD, ANOMALY, etc.)
|
|
- `severity`: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)
|
|
- `condition`: Alert condition configuration
|
|
- `notification_channels`: Notification channels
|
|
- `is_enabled`: Whether rule is enabled
|
|
|
|
### Alert
|
|
Represents triggered alerts.
|
|
|
|
**Fields:**
|
|
- `title`: Alert title
|
|
- `description`: Alert description
|
|
- `severity`: Alert severity
|
|
- `status`: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)
|
|
- `triggered_value`: Value that triggered the alert
|
|
- `threshold_value`: Threshold that was exceeded
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# Monitoring Settings
|
|
MONITORING_ENABLED=true
|
|
MONITORING_HEALTH_CHECK_INTERVAL=60
|
|
MONITORING_METRICS_COLLECTION_INTERVAL=300
|
|
MONITORING_ALERT_EVALUATION_INTERVAL=60
|
|
|
|
# Alerting Settings
|
|
ALERTING_EMAIL_FROM=monitoring@etb-api.com
|
|
ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
|
|
ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts
|
|
|
|
# Performance Thresholds
|
|
PERFORMANCE_API_RESPONSE_THRESHOLD=2000
|
|
PERFORMANCE_CPU_THRESHOLD=80
|
|
PERFORMANCE_MEMORY_THRESHOLD=80
|
|
```
|
|
|
|
### Celery Configuration
|
|
|
|
Add to your Celery configuration:
|
|
|
|
```python
|
|
from celery.schedules import crontab
|
|
|
|
CELERY_BEAT_SCHEDULE = {
|
|
'health-checks': {
|
|
'task': 'monitoring.tasks.execute_health_checks',
|
|
'schedule': 60.0, # Every minute
|
|
},
|
|
'metrics-collection': {
|
|
'task': 'monitoring.tasks.collect_metrics',
|
|
'schedule': 300.0, # Every 5 minutes
|
|
},
|
|
'alert-evaluation': {
|
|
'task': 'monitoring.tasks.evaluate_alerts',
|
|
'schedule': 60.0, # Every minute
|
|
},
|
|
'data-cleanup': {
|
|
'task': 'monitoring.tasks.cleanup_old_data',
|
|
'schedule': crontab(hour=2, minute=0), # Daily at 2 AM
|
|
},
|
|
}
|
|
```
|
|
|
|
## Setup Instructions
|
|
|
|
### 1. Install Dependencies
|
|
|
|
Add to `requirements.txt`:
|
|
```
|
|
psutil>=5.9.0
|
|
requests>=2.31.0
|
|
```
|
|
|
|
### 2. Run Migrations
|
|
|
|
```bash
|
|
python manage.py makemigrations monitoring
|
|
python manage.py migrate
|
|
```
|
|
|
|
### 3. Set Up Initial Configuration
|
|
|
|
```bash
|
|
python manage.py setup_monitoring --admin-user admin
|
|
```
|
|
|
|
### 4. Start Celery Workers
|
|
|
|
```bash
|
|
celery -A core worker -l info
|
|
celery -A core beat -l info
|
|
```
|
|
|
|
### 5. Access Monitoring
|
|
|
|
- **Admin Interface**: `http://localhost:8000/admin/monitoring/`
|
|
- **API Documentation**: `http://localhost:8000/api/monitoring/`
|
|
- **System Overview**: `http://localhost:8000/api/monitoring/overview/`
|
|
|
|
## Monitoring Best Practices
|
|
|
|
### 1. Health Checks
|
|
- Set appropriate check intervals (not too frequent)
|
|
- Use timeouts to prevent hanging checks
|
|
- Monitor dependencies and external services
|
|
- Implement graceful degradation
|
|
|
|
### 2. Metrics Collection
|
|
- Collect metrics at appropriate intervals
|
|
- Use proper aggregation methods
|
|
- Set meaningful thresholds
|
|
- Monitor both technical and business metrics
|
|
|
|
### 3. Alerting
|
|
- Set up alert rules with appropriate severity levels
|
|
- Use multiple notification channels
|
|
- Implement alert fatigue prevention
|
|
- Regularly review and tune alert thresholds
|
|
|
|
### 4. Dashboards
|
|
- Create role-based dashboards
|
|
- Use appropriate refresh intervals
|
|
- Include both real-time and historical data
|
|
- Make dashboards actionable
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Health Checks Failing**
|
|
- Check network connectivity
|
|
- Verify endpoint URLs
|
|
- Check authentication credentials
|
|
- Review timeout settings
|
|
|
|
2. **Metrics Not Collecting**
|
|
- Verify Celery workers are running
|
|
- Check metric configuration
|
|
- Review collection intervals
|
|
- Check for errors in logs
|
|
|
|
3. **Alerts Not Triggering**
|
|
- Verify alert rules are enabled
|
|
- Check threshold values
|
|
- Review notification channel configuration
|
|
- Check alert evaluation task is running
|
|
|
|
4. **Performance Issues**
|
|
- Monitor system resources
|
|
- Check database query performance
|
|
- Review metric retention settings
|
|
- Optimize collection intervals
|
|
|
|
### Debug Commands
|
|
|
|
```bash
|
|
# Check monitoring status
|
|
python manage.py shell
|
|
>>> from monitoring.services.health_checks import HealthCheckService
|
|
>>> service = HealthCheckService()
|
|
>>> service.get_system_health_summary()
|
|
|
|
# Test health checks
|
|
>>> from monitoring.models import MonitoringTarget
|
|
>>> target = MonitoringTarget.objects.first()
|
|
>>> service.execute_health_check(target, 'HTTP')
|
|
|
|
# Check metrics collection
|
|
>>> from monitoring.services.metrics_collector import MetricsCollector
|
|
>>> collector = MetricsCollector()
|
|
>>> collector.collect_all_metrics()
|
|
```
|
|
|
|
## Integration with Other Modules
|
|
|
|
### Security Module
|
|
- Monitor authentication failures
|
|
- Track security events
|
|
- Monitor device posture assessments
|
|
- Alert on risk assessment anomalies
|
|
|
|
### Incident Intelligence
|
|
- Monitor incident processing times
|
|
- Track AI model performance
|
|
- Monitor correlation engine health
|
|
- Alert on incident volume spikes
|
|
|
|
### Automation & Orchestration
|
|
- Monitor runbook execution success
|
|
- Track integration health
|
|
- Monitor ChatOps command usage
|
|
- Alert on automation failures
|
|
|
|
### SLA & On-Call
|
|
- Monitor SLA compliance
|
|
- Track escalation times
|
|
- Monitor on-call assignments
|
|
- Alert on SLA breaches
|
|
|
|
### Analytics & Predictive Insights
|
|
- Monitor ML model accuracy
|
|
- Track prediction performance
|
|
- Monitor cost impact calculations
|
|
- Alert on anomaly detections
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
1. **Advanced Anomaly Detection**: Machine learning-based anomaly detection
|
|
2. **Predictive Alerting**: Predict and prevent issues before they occur
|
|
3. **Custom Metrics**: User-defined custom metrics
|
|
4. **Advanced Dashboards**: Interactive dashboards with drill-down capabilities
|
|
5. **Mobile App**: Mobile monitoring application
|
|
6. **Integration APIs**: APIs for external monitoring tools
|
|
7. **Cost Optimization**: Resource usage optimization recommendations
|
|
8. **Compliance Reporting**: Automated compliance reporting
|
|
|
|
### Integration Roadmap
|
|
1. **APM Tools**: New Relic, DataDog, AppDynamics
|
|
2. **Log Aggregation**: ELK Stack, Splunk, Fluentd
|
|
3. **Infrastructure Monitoring**: Prometheus, Grafana, InfluxDB
|
|
4. **Cloud Platforms**: AWS CloudWatch, Azure Monitor, GCP Monitoring
|
|
5. **Communication Platforms**: PagerDuty, OpsGenie, VictorOps
|