Updates
This commit is contained in:
459
ETB-API/monitoring/Documentations/MONITORING_SYSTEM_API.md
Normal file
459
ETB-API/monitoring/Documentations/MONITORING_SYSTEM_API.md
Normal file
@@ -0,0 +1,459 @@
|
||||
# ETB-API Monitoring System Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.
|
||||
|
||||
## Features
|
||||
|
||||
### 1. Health Monitoring
|
||||
- **System Health Checks**: Monitor application, database, cache, and queue health
|
||||
- **Module Health**: Individual module status and dependency tracking
|
||||
- **External Integrations**: Third-party service health monitoring
|
||||
- **Infrastructure Monitoring**: Server resources and network connectivity
|
||||
|
||||
### 2. Metrics Collection
|
||||
- **Performance Metrics**: API response times, throughput, error rates
|
||||
- **Business Metrics**: Incident counts, MTTR, MTTA, SLA compliance
|
||||
- **Security Metrics**: Security events, failed logins, risk assessments
|
||||
- **Infrastructure Metrics**: CPU, memory, disk usage
|
||||
- **AI/ML Metrics**: Model accuracy, automation success rates
|
||||
|
||||
### 3. Intelligent Alerting
|
||||
- **Threshold Alerts**: Configurable thresholds for all metrics
|
||||
- **Anomaly Detection**: Statistical anomaly detection
|
||||
- **Pattern Alerts**: Pattern-based alerting
|
||||
- **Multi-Channel Notifications**: Email, Slack, Webhook support
|
||||
- **Alert Management**: Acknowledge, resolve, and track alerts
|
||||
|
||||
### 4. Monitoring Dashboards
|
||||
- **System Overview**: High-level system status
|
||||
- **Performance Dashboard**: Performance metrics visualization
|
||||
- **Business Metrics**: Operational metrics dashboard
|
||||
- **Security Dashboard**: Security monitoring dashboard
|
||||
- **Custom Dashboards**: User-configurable dashboards
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Base URL
|
||||
```
|
||||
http://localhost:8000/api/monitoring/
|
||||
```
|
||||
|
||||
### Authentication
|
||||
All endpoints require authentication using Django REST Framework token authentication.
|
||||
|
||||
### Health Checks
|
||||
|
||||
#### Get Health Check Summary
|
||||
```http
|
||||
GET /api/monitoring/health-checks/summary/
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"overall_status": "HEALTHY",
|
||||
"total_targets": 12,
|
||||
"healthy_targets": 11,
|
||||
"warning_targets": 1,
|
||||
"critical_targets": 0,
|
||||
"health_percentage": 91.67,
|
||||
"last_updated": "2024-01-15T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
#### Run All Health Checks
|
||||
```http
|
||||
POST /api/monitoring/health-checks/run_all_checks/
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"message": "Health checks started",
|
||||
"task_id": "celery-task-id"
|
||||
}
|
||||
```
|
||||
|
||||
#### Test Target Connection
|
||||
```http
|
||||
POST /api/monitoring/targets/{target_id}/test_connection/
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
#### Get Metric Measurements
|
||||
```http
|
||||
GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
#### Get Metric Trends
|
||||
```http
|
||||
GET /api/monitoring/metrics/{metric_id}/trends/?days=7
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"metric_name": "API Response Time",
|
||||
"period_days": 7,
|
||||
"daily_data": [
|
||||
{
|
||||
"date": "2024-01-08",
|
||||
"value": 150.5,
|
||||
"count": 1440
|
||||
}
|
||||
],
|
||||
"trend": "STABLE"
|
||||
}
|
||||
```
|
||||
|
||||
### Alerts
|
||||
|
||||
#### Get Alert Summary
|
||||
```http
|
||||
GET /api/monitoring/alerts/summary/
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"total_alerts": 25,
|
||||
"critical_alerts": 2,
|
||||
"high_alerts": 5,
|
||||
"medium_alerts": 8,
|
||||
"low_alerts": 10,
|
||||
"acknowledged_alerts": 15,
|
||||
"resolved_alerts": 20
|
||||
}
|
||||
```
|
||||
|
||||
#### Acknowledge Alert
|
||||
```http
|
||||
POST /api/monitoring/alerts/{alert_id}/acknowledge/
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
#### Resolve Alert
|
||||
```http
|
||||
POST /api/monitoring/alerts/{alert_id}/resolve/
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
### System Overview
|
||||
|
||||
#### Get System Overview
|
||||
```http
|
||||
GET /api/monitoring/overview/
|
||||
Authorization: Token your-token-here
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"system_status": {
|
||||
"status": "OPERATIONAL",
|
||||
"message": "All systems operational",
|
||||
"started_at": "2024-01-15T09:00:00Z"
|
||||
},
|
||||
"health_summary": {
|
||||
"overall_status": "HEALTHY",
|
||||
"total_targets": 12,
|
||||
"healthy_targets": 12,
|
||||
"health_percentage": 100.0
|
||||
},
|
||||
"alert_summary": {
|
||||
"total_alerts": 0,
|
||||
"critical_alerts": 0
|
||||
},
|
||||
"system_resources": {
|
||||
"cpu_percent": 45.2,
|
||||
"memory_percent": 67.8,
|
||||
"disk_percent": 34.5
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Monitoring Tasks
|
||||
|
||||
#### Execute Monitoring Tasks
|
||||
```http
|
||||
POST /api/monitoring/tasks/
|
||||
Authorization: Token your-token-here
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"task_type": "health_checks"
|
||||
}
|
||||
```
|
||||
|
||||
**Available task types:**
|
||||
- `health_checks`: Execute health checks for all targets
|
||||
- `metrics_collection`: Collect metrics from all sources
|
||||
- `alert_evaluation`: Evaluate alert rules and send notifications
|
||||
- `system_status_report`: Generate system status report
|
||||
|
||||
## Data Models
|
||||
|
||||
### MonitoringTarget
|
||||
Represents a system, service, or component to monitor.
|
||||
|
||||
**Fields:**
|
||||
- `name`: Target name
|
||||
- `target_type`: Type (APPLICATION, DATABASE, CACHE, etc.)
|
||||
- `endpoint_url`: Health check endpoint
|
||||
- `status`: Current status (ACTIVE, INACTIVE, etc.)
|
||||
- `last_status`: Last health check result
|
||||
- `health_check_enabled`: Whether health checks are enabled
|
||||
|
||||
### SystemMetric
|
||||
Defines metrics to collect and monitor.
|
||||
|
||||
**Fields:**
|
||||
- `name`: Metric name
|
||||
- `metric_type`: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)
|
||||
- `category`: Category (API_RESPONSE_TIME, MTTR, etc.)
|
||||
- `unit`: Unit of measurement
|
||||
- `aggregation_method`: How to aggregate values
|
||||
- `warning_threshold`: Warning threshold
|
||||
- `critical_threshold`: Critical threshold
|
||||
|
||||
### AlertRule
|
||||
Defines alert conditions and notifications.
|
||||
|
||||
**Fields:**
|
||||
- `name`: Rule name
|
||||
- `alert_type`: Type (THRESHOLD, ANOMALY, etc.)
|
||||
- `severity`: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)
|
||||
- `condition`: Alert condition configuration
|
||||
- `notification_channels`: Notification channels
|
||||
- `is_enabled`: Whether rule is enabled
|
||||
|
||||
### Alert
|
||||
Represents triggered alerts.
|
||||
|
||||
**Fields:**
|
||||
- `title`: Alert title
|
||||
- `description`: Alert description
|
||||
- `severity`: Alert severity
|
||||
- `status`: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)
|
||||
- `triggered_value`: Value that triggered the alert
|
||||
- `threshold_value`: Threshold that was exceeded
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Monitoring Settings
|
||||
MONITORING_ENABLED=true
|
||||
MONITORING_HEALTH_CHECK_INTERVAL=60
|
||||
MONITORING_METRICS_COLLECTION_INTERVAL=300
|
||||
MONITORING_ALERT_EVALUATION_INTERVAL=60
|
||||
|
||||
# Alerting Settings
|
||||
ALERTING_EMAIL_FROM=monitoring@etb-api.com
|
||||
ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
|
||||
ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts
|
||||
|
||||
# Performance Thresholds
|
||||
PERFORMANCE_API_RESPONSE_THRESHOLD=2000
|
||||
PERFORMANCE_CPU_THRESHOLD=80
|
||||
PERFORMANCE_MEMORY_THRESHOLD=80
|
||||
```
|
||||
|
||||
### Celery Configuration
|
||||
|
||||
Add to your Celery configuration:
|
||||
|
||||
```python
|
||||
from celery.schedules import crontab
|
||||
|
||||
CELERY_BEAT_SCHEDULE = {
|
||||
'health-checks': {
|
||||
'task': 'monitoring.tasks.execute_health_checks',
|
||||
'schedule': 60.0, # Every minute
|
||||
},
|
||||
'metrics-collection': {
|
||||
'task': 'monitoring.tasks.collect_metrics',
|
||||
'schedule': 300.0, # Every 5 minutes
|
||||
},
|
||||
'alert-evaluation': {
|
||||
'task': 'monitoring.tasks.evaluate_alerts',
|
||||
'schedule': 60.0, # Every minute
|
||||
},
|
||||
'data-cleanup': {
|
||||
'task': 'monitoring.tasks.cleanup_old_data',
|
||||
'schedule': crontab(hour=2, minute=0), # Daily at 2 AM
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
Add to `requirements.txt`:
|
||||
```
|
||||
psutil>=5.9.0
|
||||
requests>=2.31.0
|
||||
```
|
||||
|
||||
### 2. Run Migrations
|
||||
|
||||
```bash
|
||||
python manage.py makemigrations monitoring
|
||||
python manage.py migrate
|
||||
```
|
||||
|
||||
### 3. Set Up Initial Configuration
|
||||
|
||||
```bash
|
||||
python manage.py setup_monitoring --admin-user admin
|
||||
```
|
||||
|
||||
### 4. Start Celery Workers
|
||||
|
||||
```bash
|
||||
celery -A core worker -l info
|
||||
celery -A core beat -l info
|
||||
```
|
||||
|
||||
### 5. Access Monitoring
|
||||
|
||||
- **Admin Interface**: `http://localhost:8000/admin/monitoring/`
|
||||
- **API Documentation**: `http://localhost:8000/api/monitoring/`
|
||||
- **System Overview**: `http://localhost:8000/api/monitoring/overview/`
|
||||
|
||||
## Monitoring Best Practices
|
||||
|
||||
### 1. Health Checks
|
||||
- Set appropriate check intervals (not too frequent)
|
||||
- Use timeouts to prevent hanging checks
|
||||
- Monitor dependencies and external services
|
||||
- Implement graceful degradation
|
||||
|
||||
### 2. Metrics Collection
|
||||
- Collect metrics at appropriate intervals
|
||||
- Use proper aggregation methods
|
||||
- Set meaningful thresholds
|
||||
- Monitor both technical and business metrics
|
||||
|
||||
### 3. Alerting
|
||||
- Set up alert rules with appropriate severity levels
|
||||
- Use multiple notification channels
|
||||
- Implement alert fatigue prevention
|
||||
- Regularly review and tune alert thresholds
|
||||
|
||||
### 4. Dashboards
|
||||
- Create role-based dashboards
|
||||
- Use appropriate refresh intervals
|
||||
- Include both real-time and historical data
|
||||
- Make dashboards actionable
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Health Checks Failing**
|
||||
- Check network connectivity
|
||||
- Verify endpoint URLs
|
||||
- Check authentication credentials
|
||||
- Review timeout settings
|
||||
|
||||
2. **Metrics Not Collecting**
|
||||
- Verify Celery workers are running
|
||||
- Check metric configuration
|
||||
- Review collection intervals
|
||||
- Check for errors in logs
|
||||
|
||||
3. **Alerts Not Triggering**
|
||||
- Verify alert rules are enabled
|
||||
- Check threshold values
|
||||
- Review notification channel configuration
|
||||
- Check alert evaluation task is running
|
||||
|
||||
4. **Performance Issues**
|
||||
- Monitor system resources
|
||||
- Check database query performance
|
||||
- Review metric retention settings
|
||||
- Optimize collection intervals
|
||||
|
||||
### Debug Commands
|
||||
|
||||
```bash
|
||||
# Check monitoring status
|
||||
python manage.py shell
|
||||
>>> from monitoring.services.health_checks import HealthCheckService
|
||||
>>> service = HealthCheckService()
|
||||
>>> service.get_system_health_summary()
|
||||
|
||||
# Test health checks
|
||||
>>> from monitoring.models import MonitoringTarget
|
||||
>>> target = MonitoringTarget.objects.first()
|
||||
>>> service.execute_health_check(target, 'HTTP')
|
||||
|
||||
# Check metrics collection
|
||||
>>> from monitoring.services.metrics_collector import MetricsCollector
|
||||
>>> collector = MetricsCollector()
|
||||
>>> collector.collect_all_metrics()
|
||||
```
|
||||
|
||||
## Integration with Other Modules
|
||||
|
||||
### Security Module
|
||||
- Monitor authentication failures
|
||||
- Track security events
|
||||
- Monitor device posture assessments
|
||||
- Alert on risk assessment anomalies
|
||||
|
||||
### Incident Intelligence
|
||||
- Monitor incident processing times
|
||||
- Track AI model performance
|
||||
- Monitor correlation engine health
|
||||
- Alert on incident volume spikes
|
||||
|
||||
### Automation & Orchestration
|
||||
- Monitor runbook execution success
|
||||
- Track integration health
|
||||
- Monitor ChatOps command usage
|
||||
- Alert on automation failures
|
||||
|
||||
### SLA & On-Call
|
||||
- Monitor SLA compliance
|
||||
- Track escalation times
|
||||
- Monitor on-call assignments
|
||||
- Alert on SLA breaches
|
||||
|
||||
### Analytics & Predictive Insights
|
||||
- Monitor ML model accuracy
|
||||
- Track prediction performance
|
||||
- Monitor cost impact calculations
|
||||
- Alert on anomaly detections
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Planned Features
|
||||
1. **Advanced Anomaly Detection**: Machine learning-based anomaly detection
|
||||
2. **Predictive Alerting**: Predict and prevent issues before they occur
|
||||
3. **Custom Metrics**: User-defined custom metrics
|
||||
4. **Advanced Dashboards**: Interactive dashboards with drill-down capabilities
|
||||
5. **Mobile App**: Mobile monitoring application
|
||||
6. **Integration APIs**: APIs for external monitoring tools
|
||||
7. **Cost Optimization**: Resource usage optimization recommendations
|
||||
8. **Compliance Reporting**: Automated compliance reporting
|
||||
|
||||
### Integration Roadmap
|
||||
1. **APM Tools**: New Relic, DataDog, AppDynamics
|
||||
2. **Log Aggregation**: ELK Stack, Splunk, Fluentd
|
||||
3. **Infrastructure Monitoring**: Prometheus, Grafana, InfluxDB
|
||||
4. **Cloud Platforms**: AWS CloudWatch, Azure Monitor, GCP Monitoring
|
||||
5. **Communication Platforms**: PagerDuty, OpsGenie, VictorOps
|
||||
1
ETB-API/monitoring/__init__.py
Normal file
1
ETB-API/monitoring/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Monitoring module for ETB-API system
|
||||
BIN
ETB-API/monitoring/__pycache__/__init__.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/__init__.cpython-312.pyc
Normal file
Binary file not shown.
BIN
ETB-API/monitoring/__pycache__/admin.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/admin.cpython-312.pyc
Normal file
Binary file not shown.
BIN
ETB-API/monitoring/__pycache__/apps.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/apps.cpython-312.pyc
Normal file
Binary file not shown.
Binary file not shown.
BIN
ETB-API/monitoring/__pycache__/models.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/models.cpython-312.pyc
Normal file
Binary file not shown.
BIN
ETB-API/monitoring/__pycache__/serializers.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/serializers.cpython-312.pyc
Normal file
Binary file not shown.
BIN
ETB-API/monitoring/__pycache__/signals.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/signals.cpython-312.pyc
Normal file
Binary file not shown.
BIN
ETB-API/monitoring/__pycache__/tasks.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/tasks.cpython-312.pyc
Normal file
Binary file not shown.
BIN
ETB-API/monitoring/__pycache__/urls.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/urls.cpython-312.pyc
Normal file
Binary file not shown.
BIN
ETB-API/monitoring/__pycache__/views.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/__pycache__/views.cpython-312.pyc
Normal file
Binary file not shown.
289
ETB-API/monitoring/admin.py
Normal file
289
ETB-API/monitoring/admin.py
Normal file
@@ -0,0 +1,289 @@
|
||||
"""
|
||||
Admin configuration for monitoring models
|
||||
"""
|
||||
from django.contrib import admin
|
||||
from django.utils.html import format_html
|
||||
from django.urls import reverse
|
||||
from django.utils import timezone
|
||||
|
||||
from monitoring.models import (
|
||||
MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
|
||||
AlertRule, Alert, MonitoringDashboard, SystemStatus
|
||||
)
|
||||
|
||||
|
||||
@admin.register(MonitoringTarget)
|
||||
class MonitoringTargetAdmin(admin.ModelAdmin):
|
||||
"""Admin for MonitoringTarget model"""
|
||||
|
||||
list_display = [
|
||||
'name', 'target_type', 'status', 'last_status', 'last_checked',
|
||||
'health_check_enabled', 'related_module', 'created_at'
|
||||
]
|
||||
list_filter = ['target_type', 'status', 'last_status', 'health_check_enabled', 'related_module']
|
||||
search_fields = ['name', 'description', 'endpoint_url']
|
||||
readonly_fields = ['id', 'created_at', 'updated_at', 'last_checked']
|
||||
|
||||
fieldsets = (
|
||||
('Basic Information', {
|
||||
'fields': ('id', 'name', 'description', 'target_type', 'related_module')
|
||||
}),
|
||||
('Connection Details', {
|
||||
'fields': ('endpoint_url', 'connection_config')
|
||||
}),
|
||||
('Monitoring Configuration', {
|
||||
'fields': (
|
||||
'check_interval_seconds', 'timeout_seconds', 'retry_count',
|
||||
'health_check_enabled', 'health_check_endpoint', 'expected_status_codes'
|
||||
)
|
||||
}),
|
||||
('Status', {
|
||||
'fields': ('status', 'last_checked', 'last_status')
|
||||
}),
|
||||
('Metadata', {
|
||||
'fields': ('created_by', 'created_at', 'updated_at'),
|
||||
'classes': ('collapse',)
|
||||
})
|
||||
)
|
||||
|
||||
def get_queryset(self, request):
|
||||
return super().get_queryset(request).select_related('created_by')
|
||||
|
||||
|
||||
@admin.register(HealthCheck)
|
||||
class HealthCheckAdmin(admin.ModelAdmin):
|
||||
"""Admin for HealthCheck model"""
|
||||
|
||||
list_display = [
|
||||
'target_name', 'check_type', 'status', 'response_time_ms',
|
||||
'status_code', 'checked_at'
|
||||
]
|
||||
list_filter = ['check_type', 'status', 'target__target_type']
|
||||
search_fields = ['target__name', 'error_message']
|
||||
readonly_fields = ['id', 'checked_at']
|
||||
date_hierarchy = 'checked_at'
|
||||
|
||||
def target_name(self, obj):
|
||||
return obj.target.name
|
||||
target_name.short_description = 'Target'
|
||||
|
||||
def get_queryset(self, request):
|
||||
return super().get_queryset(request).select_related('target')
|
||||
|
||||
|
||||
@admin.register(SystemMetric)
|
||||
class SystemMetricAdmin(admin.ModelAdmin):
|
||||
"""Admin for SystemMetric model"""
|
||||
|
||||
list_display = [
|
||||
'name', 'metric_type', 'category', 'unit', 'is_active',
|
||||
'is_system_metric', 'related_module', 'created_at'
|
||||
]
|
||||
list_filter = ['metric_type', 'category', 'is_active', 'is_system_metric', 'related_module']
|
||||
search_fields = ['name', 'description']
|
||||
readonly_fields = ['id', 'created_at', 'updated_at']
|
||||
|
||||
fieldsets = (
|
||||
('Basic Information', {
|
||||
'fields': ('id', 'name', 'description', 'metric_type', 'category', 'unit')
|
||||
}),
|
||||
('Configuration', {
|
||||
'fields': (
|
||||
'aggregation_method', 'collection_interval_seconds', 'retention_days',
|
||||
'warning_threshold', 'critical_threshold'
|
||||
)
|
||||
}),
|
||||
('Status', {
|
||||
'fields': ('is_active', 'is_system_metric', 'related_module')
|
||||
}),
|
||||
('Metadata', {
|
||||
'fields': ('created_by', 'created_at', 'updated_at'),
|
||||
'classes': ('collapse',)
|
||||
})
|
||||
)
|
||||
|
||||
def get_queryset(self, request):
|
||||
return super().get_queryset(request).select_related('created_by')
|
||||
|
||||
|
||||
@admin.register(MetricMeasurement)
|
||||
class MetricMeasurementAdmin(admin.ModelAdmin):
|
||||
"""Admin for MetricMeasurement model"""
|
||||
|
||||
list_display = [
|
||||
'metric_name', 'value', 'unit', 'timestamp'
|
||||
]
|
||||
list_filter = ['metric__metric_type', 'metric__category', 'timestamp']
|
||||
search_fields = ['metric__name']
|
||||
readonly_fields = ['id', 'timestamp']
|
||||
date_hierarchy = 'timestamp'
|
||||
|
||||
def metric_name(self, obj):
|
||||
return obj.metric.name
|
||||
metric_name.short_description = 'Metric'
|
||||
|
||||
def unit(self, obj):
|
||||
return obj.metric.unit
|
||||
unit.short_description = 'Unit'
|
||||
|
||||
def get_queryset(self, request):
|
||||
return super().get_queryset(request).select_related('metric')
|
||||
|
||||
|
||||
@admin.register(AlertRule)
|
||||
class AlertRuleAdmin(admin.ModelAdmin):
|
||||
"""Admin for AlertRule model"""
|
||||
|
||||
list_display = [
|
||||
'name', 'alert_type', 'severity', 'status', 'is_enabled',
|
||||
'metric_name', 'target_name', 'created_at'
|
||||
]
|
||||
list_filter = ['alert_type', 'severity', 'status', 'is_enabled']
|
||||
search_fields = ['name', 'description']
|
||||
readonly_fields = ['id', 'created_at', 'updated_at']
|
||||
|
||||
fieldsets = (
|
||||
('Basic Information', {
|
||||
'fields': ('id', 'name', 'description', 'alert_type', 'severity')
|
||||
}),
|
||||
('Rule Configuration', {
|
||||
'fields': ('condition', 'evaluation_interval_seconds')
|
||||
}),
|
||||
('Related Objects', {
|
||||
'fields': ('metric', 'target')
|
||||
}),
|
||||
('Notifications', {
|
||||
'fields': ('notification_channels', 'notification_template')
|
||||
}),
|
||||
('Status', {
|
||||
'fields': ('status', 'is_enabled')
|
||||
}),
|
||||
('Metadata', {
|
||||
'fields': ('created_by', 'created_at', 'updated_at'),
|
||||
'classes': ('collapse',)
|
||||
})
|
||||
)
|
||||
|
||||
def metric_name(self, obj):
|
||||
return obj.metric.name if obj.metric else '-'
|
||||
metric_name.short_description = 'Metric'
|
||||
|
||||
def target_name(self, obj):
|
||||
return obj.target.name if obj.target else '-'
|
||||
target_name.short_description = 'Target'
|
||||
|
||||
def get_queryset(self, request):
|
||||
return super().get_queryset(request).select_related('metric', 'target', 'created_by')
|
||||
|
||||
|
||||
@admin.register(Alert)
|
||||
class AlertAdmin(admin.ModelAdmin):
|
||||
"""Admin for Alert model"""
|
||||
|
||||
list_display = [
|
||||
'title', 'severity', 'status', 'rule_name', 'triggered_value',
|
||||
'threshold_value', 'triggered_at', 'acknowledged_by', 'resolved_by'
|
||||
]
|
||||
list_filter = ['severity', 'status', 'rule__alert_type', 'triggered_at']
|
||||
search_fields = ['title', 'description', 'rule__name']
|
||||
readonly_fields = ['id', 'triggered_at']
|
||||
date_hierarchy = 'triggered_at'
|
||||
|
||||
fieldsets = (
|
||||
('Alert Information', {
|
||||
'fields': ('id', 'rule', 'title', 'description', 'severity', 'status')
|
||||
}),
|
||||
('Values', {
|
||||
'fields': ('triggered_value', 'threshold_value', 'context_data')
|
||||
}),
|
||||
('Timestamps', {
|
||||
'fields': ('triggered_at', 'acknowledged_at', 'resolved_at')
|
||||
}),
|
||||
('Assignment', {
|
||||
'fields': ('acknowledged_by', 'resolved_by')
|
||||
})
|
||||
)
|
||||
|
||||
def rule_name(self, obj):
|
||||
return obj.rule.name
|
||||
rule_name.short_description = 'Rule'
|
||||
|
||||
def get_queryset(self, request):
|
||||
return super().get_queryset(request).select_related(
|
||||
'rule', 'acknowledged_by', 'resolved_by'
|
||||
)
|
||||
|
||||
|
||||
@admin.register(MonitoringDashboard)
|
||||
class MonitoringDashboardAdmin(admin.ModelAdmin):
|
||||
"""Admin for MonitoringDashboard model"""
|
||||
|
||||
list_display = [
|
||||
'name', 'dashboard_type', 'is_active', 'is_public',
|
||||
'auto_refresh_enabled', 'created_by', 'created_at'
|
||||
]
|
||||
list_filter = ['dashboard_type', 'is_active', 'is_public', 'auto_refresh_enabled']
|
||||
search_fields = ['name', 'description']
|
||||
readonly_fields = ['id', 'created_at', 'updated_at']
|
||||
filter_horizontal = ['allowed_users']
|
||||
|
||||
fieldsets = (
|
||||
('Basic Information', {
|
||||
'fields': ('id', 'name', 'description', 'dashboard_type')
|
||||
}),
|
||||
('Configuration', {
|
||||
'fields': ('layout_config', 'widget_configs')
|
||||
}),
|
||||
('Access Control', {
|
||||
'fields': ('is_public', 'allowed_users', 'allowed_roles')
|
||||
}),
|
||||
('Refresh Settings', {
|
||||
'fields': ('auto_refresh_enabled', 'refresh_interval_seconds')
|
||||
}),
|
||||
('Status', {
|
||||
'fields': ('is_active',)
|
||||
}),
|
||||
('Metadata', {
|
||||
'fields': ('created_by', 'created_at', 'updated_at'),
|
||||
'classes': ('collapse',)
|
||||
})
|
||||
)
|
||||
|
||||
def get_queryset(self, request):
|
||||
return super().get_queryset(request).select_related('created_by')
|
||||
|
||||
|
||||
@admin.register(SystemStatus)
|
||||
class SystemStatusAdmin(admin.ModelAdmin):
|
||||
"""Admin for SystemStatus model"""
|
||||
|
||||
list_display = [
|
||||
'status', 'message', 'started_at', 'resolved_at', 'is_resolved',
|
||||
'created_by'
|
||||
]
|
||||
list_filter = ['status', 'started_at', 'resolved_at']
|
||||
search_fields = ['message', 'affected_services']
|
||||
readonly_fields = ['id', 'started_at', 'updated_at', 'is_resolved']
|
||||
date_hierarchy = 'started_at'
|
||||
|
||||
fieldsets = (
|
||||
('Status Information', {
|
||||
'fields': ('id', 'status', 'message', 'affected_services')
|
||||
}),
|
||||
('Timeline', {
|
||||
'fields': ('started_at', 'updated_at', 'resolved_at', 'estimated_resolution')
|
||||
}),
|
||||
('Metadata', {
|
||||
'fields': ('created_by', 'is_resolved'),
|
||||
'classes': ('collapse',)
|
||||
})
|
||||
)
|
||||
|
||||
def get_queryset(self, request):
|
||||
return super().get_queryset(request).select_related('created_by')
|
||||
|
||||
|
||||
# Custom admin site configuration
|
||||
admin.site.site_header = "ETB-API Monitoring Administration"
|
||||
admin.site.site_title = "ETB-API Monitoring"
|
||||
admin.site.index_title = "Monitoring System Administration"
|
||||
12
ETB-API/monitoring/apps.py
Normal file
12
ETB-API/monitoring/apps.py
Normal file
@@ -0,0 +1,12 @@
|
||||
from django.apps import AppConfig
|
||||
|
||||
|
||||
class MonitoringConfig(AppConfig):
|
||||
default_auto_field = 'django.db.models.BigAutoField'
|
||||
name = 'monitoring'
|
||||
verbose_name = 'System Monitoring'
|
||||
|
||||
def ready(self):
|
||||
"""Initialize monitoring when Django starts"""
|
||||
import monitoring.signals
|
||||
import monitoring.tasks
|
||||
795
ETB-API/monitoring/enterprise_monitoring.py
Normal file
795
ETB-API/monitoring/enterprise_monitoring.py
Normal file
@@ -0,0 +1,795 @@
|
||||
"""
|
||||
Enterprise Monitoring System for ETB-API
|
||||
Advanced monitoring with metrics, alerting, and observability
|
||||
"""
|
||||
import logging
|
||||
import time
|
||||
import psutil
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, List, Optional, Any, Union
|
||||
from django.http import HttpRequest, JsonResponse
|
||||
from django.conf import settings
|
||||
from django.utils import timezone
|
||||
from django.core.cache import cache
|
||||
from django.db import connection
|
||||
from django.core.management import call_command
|
||||
from rest_framework import status
|
||||
from rest_framework.response import Response
|
||||
from rest_framework.views import APIView
|
||||
from rest_framework.decorators import api_view, permission_classes
|
||||
from rest_framework.permissions import IsAuthenticated
|
||||
from django.core.management.base import BaseCommand
|
||||
import requests
|
||||
import redis
|
||||
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
|
||||
from prometheus_client.core import CollectorRegistry
|
||||
import threading
|
||||
import queue
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class MetricsCollector:
|
||||
"""Enterprise metrics collection system"""
|
||||
|
||||
def __init__(self):
|
||||
self.registry = CollectorRegistry()
|
||||
self.metrics = self._initialize_metrics()
|
||||
self.collection_interval = 60 # seconds
|
||||
self.is_running = False
|
||||
self.collection_thread = None
|
||||
|
||||
def _initialize_metrics(self):
|
||||
"""Initialize Prometheus metrics"""
|
||||
metrics = {}
|
||||
|
||||
# Application metrics
|
||||
metrics['http_requests_total'] = Counter(
|
||||
'http_requests_total',
|
||||
'Total HTTP requests',
|
||||
['method', 'endpoint', 'status_code'],
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['http_request_duration_seconds'] = Histogram(
|
||||
'http_request_duration_seconds',
|
||||
'HTTP request duration in seconds',
|
||||
['method', 'endpoint'],
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['active_users'] = Gauge(
|
||||
'active_users',
|
||||
'Number of active users',
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['incident_count'] = Gauge(
|
||||
'incident_count',
|
||||
'Total number of incidents',
|
||||
['status', 'priority'],
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['sla_breach_count'] = Gauge(
|
||||
'sla_breach_count',
|
||||
'Number of SLA breaches',
|
||||
['sla_type'],
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
# System metrics
|
||||
metrics['system_cpu_usage'] = Gauge(
|
||||
'system_cpu_usage_percent',
|
||||
'System CPU usage percentage',
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['system_memory_usage'] = Gauge(
|
||||
'system_memory_usage_percent',
|
||||
'System memory usage percentage',
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['system_disk_usage'] = Gauge(
|
||||
'system_disk_usage_percent',
|
||||
'System disk usage percentage',
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['database_connections'] = Gauge(
|
||||
'database_connections_active',
|
||||
'Active database connections',
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['cache_hit_ratio'] = Gauge(
|
||||
'cache_hit_ratio',
|
||||
'Cache hit ratio',
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
# Business metrics
|
||||
metrics['incident_resolution_time'] = Histogram(
|
||||
'incident_resolution_time_seconds',
|
||||
'Incident resolution time in seconds',
|
||||
['priority', 'category'],
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['automation_success_rate'] = Gauge(
|
||||
'automation_success_rate',
|
||||
'Automation success rate',
|
||||
['automation_type'],
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
metrics['user_satisfaction_score'] = Gauge(
|
||||
'user_satisfaction_score',
|
||||
'User satisfaction score',
|
||||
registry=self.registry
|
||||
)
|
||||
|
||||
return metrics
|
||||
|
||||
def start_collection(self):
|
||||
"""Start metrics collection in background thread"""
|
||||
if self.is_running:
|
||||
return
|
||||
|
||||
self.is_running = True
|
||||
self.collection_thread = threading.Thread(target=self._collect_metrics_loop)
|
||||
self.collection_thread.daemon = True
|
||||
self.collection_thread.start()
|
||||
logger.info("Metrics collection started")
|
||||
|
||||
def stop_collection(self):
|
||||
"""Stop metrics collection"""
|
||||
self.is_running = False
|
||||
if self.collection_thread:
|
||||
self.collection_thread.join()
|
||||
logger.info("Metrics collection stopped")
|
||||
|
||||
def _collect_metrics_loop(self):
|
||||
"""Main metrics collection loop"""
|
||||
while self.is_running:
|
||||
try:
|
||||
self._collect_system_metrics()
|
||||
self._collect_application_metrics()
|
||||
self._collect_business_metrics()
|
||||
time.sleep(self.collection_interval)
|
||||
except Exception as e:
|
||||
logger.error(f"Error collecting metrics: {str(e)}")
|
||||
time.sleep(self.collection_interval)
|
||||
|
||||
def _collect_system_metrics(self):
|
||||
"""Collect system-level metrics"""
|
||||
try:
|
||||
# CPU usage
|
||||
cpu_percent = psutil.cpu_percent(interval=1)
|
||||
self.metrics['system_cpu_usage'].set(cpu_percent)
|
||||
|
||||
# Memory usage
|
||||
memory = psutil.virtual_memory()
|
||||
self.metrics['system_memory_usage'].set(memory.percent)
|
||||
|
||||
# Disk usage
|
||||
disk = psutil.disk_usage('/')
|
||||
disk_percent = (disk.used / disk.total) * 100
|
||||
self.metrics['system_disk_usage'].set(disk_percent)
|
||||
|
||||
# Database connections
|
||||
with connection.cursor() as cursor:
|
||||
cursor.execute("SELECT COUNT(*) FROM pg_stat_activity")
|
||||
db_connections = cursor.fetchone()[0]
|
||||
self.metrics['database_connections'].set(db_connections)
|
||||
|
||||
# Cache hit ratio
|
||||
cache_stats = cache._cache.get_stats()
|
||||
if cache_stats:
|
||||
hit_ratio = cache_stats.get('hit_ratio', 0)
|
||||
self.metrics['cache_hit_ratio'].set(hit_ratio)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error collecting system metrics: {str(e)}")
|
||||
|
||||
def _collect_application_metrics(self):
|
||||
"""Collect application-level metrics"""
|
||||
try:
|
||||
# Active users (from cache)
|
||||
active_users = cache.get('active_users_count', 0)
|
||||
self.metrics['active_users'].set(active_users)
|
||||
|
||||
# Incident counts
|
||||
from incident_intelligence.models import Incident
|
||||
from django.db import models
|
||||
incident_counts = Incident.objects.values('status', 'priority').annotate(
|
||||
count=models.Count('id')
|
||||
)
|
||||
|
||||
for incident in incident_counts:
|
||||
self.metrics['incident_count'].labels(
|
||||
status=incident['status'],
|
||||
priority=incident['priority']
|
||||
).set(incident['count'])
|
||||
|
||||
# SLA breach counts
|
||||
from sla_oncall.models import SLAInstance
|
||||
sla_breaches = SLAInstance.objects.filter(
|
||||
status='breached'
|
||||
).values('sla_type').annotate(
|
||||
count=models.Count('id')
|
||||
)
|
||||
|
||||
for breach in sla_breaches:
|
||||
self.metrics['sla_breach_count'].labels(
|
||||
sla_type=breach['sla_type']
|
||||
).set(breach['count'])
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error collecting application metrics: {str(e)}")
|
||||
|
||||
def _collect_business_metrics(self):
|
||||
"""Collect business-level metrics"""
|
||||
try:
|
||||
# Incident resolution times
|
||||
from incident_intelligence.models import Incident
|
||||
from django.db import models
|
||||
resolved_incidents = Incident.objects.filter(
|
||||
status='resolved',
|
||||
resolved_at__isnull=False
|
||||
).values('priority', 'category')
|
||||
|
||||
for incident in resolved_incidents:
|
||||
resolution_time = (incident['resolved_at'] - incident['created_at']).total_seconds()
|
||||
self.metrics['incident_resolution_time'].labels(
|
||||
priority=incident['priority'],
|
||||
category=incident['category']
|
||||
).observe(resolution_time)
|
||||
|
||||
# Automation success rates
|
||||
from automation_orchestration.models import AutomationExecution
|
||||
from django.db import models
|
||||
automation_stats = AutomationExecution.objects.values('automation_type').annotate(
|
||||
total=models.Count('id'),
|
||||
successful=models.Count('id', filter=models.Q(status='success'))
|
||||
)
|
||||
|
||||
for stat in automation_stats:
|
||||
success_rate = (stat['successful'] / stat['total']) * 100 if stat['total'] > 0 else 0
|
||||
self.metrics['automation_success_rate'].labels(
|
||||
automation_type=stat['automation_type']
|
||||
).set(success_rate)
|
||||
|
||||
# User satisfaction score (from feedback)
|
||||
from knowledge_learning.models import UserFeedback
|
||||
from django.db import models
|
||||
feedback_scores = UserFeedback.objects.values('rating').annotate(
|
||||
count=models.Count('id')
|
||||
)
|
||||
|
||||
total_feedback = sum(f['count'] for f in feedback_scores)
|
||||
if total_feedback > 0:
|
||||
weighted_score = sum(f['rating'] * f['count'] for f in feedback_scores) / total_feedback
|
||||
self.metrics['user_satisfaction_score'].set(weighted_score)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error collecting business metrics: {str(e)}")
|
||||
|
||||
def record_http_request(self, method: str, endpoint: str, status_code: int, duration: float):
|
||||
"""Record HTTP request metrics"""
|
||||
self.metrics['http_requests_total'].labels(
|
||||
method=method,
|
||||
endpoint=endpoint,
|
||||
status_code=str(status_code)
|
||||
).inc()
|
||||
|
||||
self.metrics['http_request_duration_seconds'].labels(
|
||||
method=method,
|
||||
endpoint=endpoint
|
||||
).observe(duration)
|
||||
|
||||
def get_metrics(self) -> str:
|
||||
"""Get metrics in Prometheus format"""
|
||||
return generate_latest(self.registry)
|
||||
|
||||
|
||||
class AlertManager:
|
||||
"""Enterprise alert management system"""
|
||||
|
||||
def __init__(self):
|
||||
self.alert_rules = self._load_alert_rules()
|
||||
self.notification_channels = self._load_notification_channels()
|
||||
self.alert_queue = queue.Queue()
|
||||
self.is_running = False
|
||||
self.alert_thread = None
|
||||
|
||||
def _load_alert_rules(self) -> List[Dict[str, Any]]:
|
||||
"""Load alert rules from configuration"""
|
||||
return [
|
||||
{
|
||||
'name': 'high_cpu_usage',
|
||||
'condition': 'system_cpu_usage > 80',
|
||||
'severity': 'warning',
|
||||
'duration': 300, # 5 minutes
|
||||
'enabled': True,
|
||||
},
|
||||
{
|
||||
'name': 'high_memory_usage',
|
||||
'condition': 'system_memory_usage > 85',
|
||||
'severity': 'warning',
|
||||
'duration': 300,
|
||||
'enabled': True,
|
||||
},
|
||||
{
|
||||
'name': 'disk_space_low',
|
||||
'condition': 'system_disk_usage > 90',
|
||||
'severity': 'critical',
|
||||
'duration': 60,
|
||||
'enabled': True,
|
||||
},
|
||||
{
|
||||
'name': 'database_connections_high',
|
||||
'condition': 'database_connections > 50',
|
||||
'severity': 'warning',
|
||||
'duration': 300,
|
||||
'enabled': True,
|
||||
},
|
||||
{
|
||||
'name': 'incident_volume_high',
|
||||
'condition': 'incident_count > 100',
|
||||
'severity': 'warning',
|
||||
'duration': 600,
|
||||
'enabled': True,
|
||||
},
|
||||
{
|
||||
'name': 'sla_breach_detected',
|
||||
'condition': 'sla_breach_count > 0',
|
||||
'severity': 'critical',
|
||||
'duration': 0,
|
||||
'enabled': True,
|
||||
},
|
||||
]
|
||||
|
||||
def _load_notification_channels(self) -> List[Dict[str, Any]]:
|
||||
"""Load notification channels"""
|
||||
return [
|
||||
{
|
||||
'name': 'email',
|
||||
'type': 'email',
|
||||
'enabled': True,
|
||||
'config': {
|
||||
'recipients': ['admin@company.com'],
|
||||
'template': 'alert_email.html',
|
||||
}
|
||||
},
|
||||
{
|
||||
'name': 'slack',
|
||||
'type': 'slack',
|
||||
'enabled': True,
|
||||
'config': {
|
||||
'webhook_url': os.getenv('SLACK_WEBHOOK_URL'),
|
||||
'channel': '#alerts',
|
||||
}
|
||||
},
|
||||
{
|
||||
'name': 'webhook',
|
||||
'type': 'webhook',
|
||||
'enabled': True,
|
||||
'config': {
|
||||
'url': os.getenv('ALERT_WEBHOOK_URL'),
|
||||
'headers': {'Authorization': f'Bearer {os.getenv("ALERT_WEBHOOK_TOKEN")}'},
|
||||
}
|
||||
},
|
||||
]
|
||||
|
||||
def start_monitoring(self):
|
||||
"""Start alert monitoring"""
|
||||
if self.is_running:
|
||||
return
|
||||
|
||||
self.is_running = True
|
||||
self.alert_thread = threading.Thread(target=self._monitor_alerts)
|
||||
self.alert_thread.daemon = True
|
||||
self.alert_thread.start()
|
||||
logger.info("Alert monitoring started")
|
||||
|
||||
def stop_monitoring(self):
|
||||
"""Stop alert monitoring"""
|
||||
self.is_running = False
|
||||
if self.alert_thread:
|
||||
self.alert_thread.join()
|
||||
logger.info("Alert monitoring stopped")
|
||||
|
||||
def _monitor_alerts(self):
|
||||
"""Main alert monitoring loop"""
|
||||
while self.is_running:
|
||||
try:
|
||||
self._check_alert_rules()
|
||||
time.sleep(60) # Check every minute
|
||||
except Exception as e:
|
||||
logger.error(f"Error monitoring alerts: {str(e)}")
|
||||
time.sleep(60)
|
||||
|
||||
def _check_alert_rules(self):
|
||||
"""Check all alert rules"""
|
||||
for rule in self.alert_rules:
|
||||
if not rule['enabled']:
|
||||
continue
|
||||
|
||||
try:
|
||||
if self._evaluate_rule(rule):
|
||||
self._trigger_alert(rule)
|
||||
except Exception as e:
|
||||
logger.error(f"Error checking rule {rule['name']}: {str(e)}")
|
||||
|
||||
def _evaluate_rule(self, rule: Dict[str, Any]) -> bool:
|
||||
"""Evaluate alert rule condition"""
|
||||
condition = rule['condition']
|
||||
|
||||
# Parse condition (simplified)
|
||||
if 'system_cpu_usage' in condition:
|
||||
cpu_usage = psutil.cpu_percent()
|
||||
threshold = float(condition.split('>')[1].strip())
|
||||
return cpu_usage > threshold
|
||||
|
||||
elif 'system_memory_usage' in condition:
|
||||
memory = psutil.virtual_memory()
|
||||
threshold = float(condition.split('>')[1].strip())
|
||||
return memory.percent > threshold
|
||||
|
||||
elif 'system_disk_usage' in condition:
|
||||
disk = psutil.disk_usage('/')
|
||||
disk_percent = (disk.used / disk.total) * 100
|
||||
threshold = float(condition.split('>')[1].strip())
|
||||
return disk_percent > threshold
|
||||
|
||||
elif 'database_connections' in condition:
|
||||
with connection.cursor() as cursor:
|
||||
cursor.execute("SELECT COUNT(*) FROM pg_stat_activity")
|
||||
connections = cursor.fetchone()[0]
|
||||
threshold = float(condition.split('>')[1].strip())
|
||||
return connections > threshold
|
||||
|
||||
elif 'incident_count' in condition:
|
||||
from incident_intelligence.models import Incident
|
||||
count = Incident.objects.count()
|
||||
threshold = float(condition.split('>')[1].strip())
|
||||
return count > threshold
|
||||
|
||||
elif 'sla_breach_count' in condition:
|
||||
from sla_oncall.models import SLAInstance
|
||||
count = SLAInstance.objects.filter(status='breached').count()
|
||||
threshold = float(condition.split('>')[1].strip())
|
||||
return count > threshold
|
||||
|
||||
return False
|
||||
|
||||
def _trigger_alert(self, rule: Dict[str, Any]):
|
||||
"""Trigger alert for rule violation"""
|
||||
alert = {
|
||||
'rule_name': rule['name'],
|
||||
'severity': rule['severity'],
|
||||
'message': f"Alert: {rule['name']} - {rule['condition']}",
|
||||
'timestamp': timezone.now().isoformat(),
|
||||
'metadata': {
|
||||
'condition': rule['condition'],
|
||||
'duration': rule['duration'],
|
||||
}
|
||||
}
|
||||
|
||||
# Send notifications
|
||||
self._send_notifications(alert)
|
||||
|
||||
# Store alert
|
||||
self._store_alert(alert)
|
||||
|
||||
logger.warning(f"Alert triggered: {rule['name']}")
|
||||
|
||||
def _send_notifications(self, alert: Dict[str, Any]):
|
||||
"""Send alert notifications"""
|
||||
for channel in self.notification_channels:
|
||||
if not channel['enabled']:
|
||||
continue
|
||||
|
||||
try:
|
||||
if channel['type'] == 'email':
|
||||
self._send_email_notification(alert, channel)
|
||||
elif channel['type'] == 'slack':
|
||||
self._send_slack_notification(alert, channel)
|
||||
elif channel['type'] == 'webhook':
|
||||
self._send_webhook_notification(alert, channel)
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending notification via {channel['name']}: {str(e)}")
|
||||
|
||||
def _send_email_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
|
||||
"""Send email notification"""
|
||||
from django.core.mail import send_mail
|
||||
|
||||
subject = f"ETB-API Alert: {alert['rule_name']}"
|
||||
message = f"""
|
||||
Alert: {alert['rule_name']}
|
||||
Severity: {alert['severity']}
|
||||
Message: {alert['message']}
|
||||
Time: {alert['timestamp']}
|
||||
"""
|
||||
|
||||
send_mail(
|
||||
subject=subject,
|
||||
message=message,
|
||||
from_email=settings.DEFAULT_FROM_EMAIL,
|
||||
recipient_list=channel['config']['recipients'],
|
||||
fail_silently=False,
|
||||
)
|
||||
|
||||
def _send_slack_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
|
||||
"""Send Slack notification"""
|
||||
webhook_url = channel['config']['webhook_url']
|
||||
if not webhook_url:
|
||||
return
|
||||
|
||||
payload = {
|
||||
'channel': channel['config']['channel'],
|
||||
'text': f"🚨 ETB-API Alert: {alert['rule_name']}",
|
||||
'attachments': [
|
||||
{
|
||||
'color': 'danger' if alert['severity'] == 'critical' else 'warning',
|
||||
'fields': [
|
||||
{'title': 'Severity', 'value': alert['severity'], 'short': True},
|
||||
{'title': 'Message', 'value': alert['message'], 'short': False},
|
||||
{'title': 'Time', 'value': alert['timestamp'], 'short': True},
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
response = requests.post(webhook_url, json=payload, timeout=10)
|
||||
response.raise_for_status()
|
||||
|
||||
def _send_webhook_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
|
||||
"""Send webhook notification"""
|
||||
webhook_url = channel['config']['url']
|
||||
if not webhook_url:
|
||||
return
|
||||
|
||||
headers = channel['config'].get('headers', {})
|
||||
response = requests.post(webhook_url, json=alert, headers=headers, timeout=10)
|
||||
response.raise_for_status()
|
||||
|
||||
def _store_alert(self, alert: Dict[str, Any]):
|
||||
"""Store alert in database"""
|
||||
try:
|
||||
from monitoring.models import Alert
|
||||
Alert.objects.create(
|
||||
rule_name=alert['rule_name'],
|
||||
severity=alert['severity'],
|
||||
message=alert['message'],
|
||||
metadata=alert['metadata'],
|
||||
timestamp=timezone.now(),
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error storing alert: {str(e)}")
|
||||
|
||||
|
||||
class PerformanceProfiler:
|
||||
"""Enterprise performance profiling system"""
|
||||
|
||||
def __init__(self):
|
||||
self.profiles = {}
|
||||
self.is_enabled = True
|
||||
|
||||
def start_profile(self, name: str) -> str:
|
||||
"""Start profiling a function or operation"""
|
||||
if not self.is_enabled:
|
||||
return None
|
||||
|
||||
profile_id = f"{name}_{int(time.time() * 1000)}"
|
||||
self.profiles[profile_id] = {
|
||||
'name': name,
|
||||
'start_time': time.time(),
|
||||
'start_memory': psutil.Process().memory_info().rss,
|
||||
'start_cpu': psutil.cpu_percent(),
|
||||
}
|
||||
|
||||
return profile_id
|
||||
|
||||
def end_profile(self, profile_id: str) -> Dict[str, Any]:
|
||||
"""End profiling and return results"""
|
||||
if not profile_id or profile_id not in self.profiles:
|
||||
return None
|
||||
|
||||
profile = self.profiles.pop(profile_id)
|
||||
|
||||
end_time = time.time()
|
||||
end_memory = psutil.Process().memory_info().rss
|
||||
end_cpu = psutil.cpu_percent()
|
||||
|
||||
result = {
|
||||
'name': profile['name'],
|
||||
'duration': end_time - profile['start_time'],
|
||||
'memory_delta': end_memory - profile['start_memory'],
|
||||
'cpu_delta': end_cpu - profile['start_cpu'],
|
||||
'timestamp': timezone.now().isoformat(),
|
||||
}
|
||||
|
||||
# Log slow operations
|
||||
if result['duration'] > 1.0: # 1 second
|
||||
logger.warning(f"Slow operation detected: {result['name']} took {result['duration']:.2f}s")
|
||||
|
||||
return result
|
||||
|
||||
def profile_function(self, func):
|
||||
"""Decorator to profile function execution"""
|
||||
def wrapper(*args, **kwargs):
|
||||
profile_id = self.start_profile(func.__name__)
|
||||
try:
|
||||
result = func(*args, **kwargs)
|
||||
return result
|
||||
finally:
|
||||
if profile_id:
|
||||
self.end_profile(profile_id)
|
||||
return wrapper
|
||||
|
||||
|
||||
# Global instances
|
||||
metrics_collector = MetricsCollector()
|
||||
alert_manager = AlertManager()
|
||||
performance_profiler = PerformanceProfiler()
|
||||
|
||||
|
||||
# API Views for monitoring
|
||||
@api_view(['GET'])
|
||||
@permission_classes([IsAuthenticated])
|
||||
def metrics_endpoint(request):
|
||||
"""Prometheus metrics endpoint"""
|
||||
try:
|
||||
metrics_data = metrics_collector.get_metrics()
|
||||
return Response(metrics_data, content_type=CONTENT_TYPE_LATEST)
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting metrics: {str(e)}")
|
||||
return Response(
|
||||
{'error': 'Failed to get metrics'},
|
||||
status=status.HTTP_500_INTERNAL_SERVER_ERROR
|
||||
)
|
||||
|
||||
|
||||
@api_view(['GET'])
|
||||
@permission_classes([IsAuthenticated])
|
||||
def monitoring_dashboard(request):
|
||||
"""Get monitoring dashboard data"""
|
||||
try:
|
||||
# Get system metrics
|
||||
system_metrics = {
|
||||
'cpu_usage': psutil.cpu_percent(),
|
||||
'memory_usage': psutil.virtual_memory().percent,
|
||||
'disk_usage': (psutil.disk_usage('/').used / psutil.disk_usage('/').total) * 100,
|
||||
'load_average': psutil.getloadavg() if hasattr(psutil, 'getloadavg') else [0, 0, 0],
|
||||
}
|
||||
|
||||
# Get application metrics
|
||||
from incident_intelligence.models import Incident
|
||||
from sla_oncall.models import SLAInstance
|
||||
|
||||
application_metrics = {
|
||||
'total_incidents': Incident.objects.count(),
|
||||
'active_incidents': Incident.objects.filter(status='active').count(),
|
||||
'resolved_incidents': Incident.objects.filter(status='resolved').count(),
|
||||
'sla_breaches': SLAInstance.objects.filter(status='breached').count(),
|
||||
'active_users': cache.get('active_users_count', 0),
|
||||
}
|
||||
|
||||
# Get recent alerts
|
||||
from monitoring.models import Alert
|
||||
recent_alerts = Alert.objects.filter(
|
||||
timestamp__gte=timezone.now() - timedelta(hours=24)
|
||||
).order_by('-timestamp')[:10]
|
||||
|
||||
return Response({
|
||||
'system_metrics': system_metrics,
|
||||
'application_metrics': application_metrics,
|
||||
'recent_alerts': [
|
||||
{
|
||||
'rule_name': alert.rule_name,
|
||||
'severity': alert.severity,
|
||||
'message': alert.message,
|
||||
'timestamp': alert.timestamp.isoformat(),
|
||||
}
|
||||
for alert in recent_alerts
|
||||
],
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Monitoring dashboard error: {str(e)}")
|
||||
return Response(
|
||||
{'error': 'Failed to load monitoring dashboard'},
|
||||
status=status.HTTP_500_INTERNAL_SERVER_ERROR
|
||||
)
|
||||
|
||||
|
||||
@api_view(['POST'])
|
||||
@permission_classes([IsAuthenticated])
|
||||
def test_alert(request):
|
||||
"""Test alert notification"""
|
||||
try:
|
||||
test_alert = {
|
||||
'rule_name': 'test_alert',
|
||||
'severity': 'info',
|
||||
'message': 'This is a test alert',
|
||||
'timestamp': timezone.now().isoformat(),
|
||||
'metadata': {'test': True},
|
||||
}
|
||||
|
||||
alert_manager._send_notifications(test_alert)
|
||||
|
||||
return Response({
|
||||
'message': 'Test alert sent successfully',
|
||||
'alert': test_alert,
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Test alert error: {str(e)}")
|
||||
return Response(
|
||||
{'error': 'Failed to send test alert'},
|
||||
status=status.HTTP_500_INTERNAL_SERVER_ERROR
|
||||
)
|
||||
|
||||
|
||||
class MonitoringMiddleware:
|
||||
"""Middleware for request monitoring and metrics collection"""
|
||||
|
||||
def __init__(self, get_response):
|
||||
self.get_response = get_response
|
||||
|
||||
def __call__(self, request):
|
||||
start_time = time.time()
|
||||
|
||||
response = self.get_response(request)
|
||||
|
||||
# Calculate request duration
|
||||
duration = time.time() - start_time
|
||||
|
||||
# Record metrics
|
||||
metrics_collector.record_http_request(
|
||||
method=request.method,
|
||||
endpoint=request.path,
|
||||
status_code=response.status_code,
|
||||
duration=duration
|
||||
)
|
||||
|
||||
# Add performance headers
|
||||
response['X-Response-Time'] = f"{duration:.3f}s"
|
||||
response['X-Request-ID'] = request.META.get('HTTP_X_REQUEST_ID', 'unknown')
|
||||
|
||||
return response
|
||||
|
||||
|
||||
# Management command for starting monitoring services
|
||||
class StartMonitoringCommand(BaseCommand):
|
||||
"""Django management command to start monitoring services"""
|
||||
|
||||
help = 'Start monitoring services (metrics collection and alerting)'
|
||||
|
||||
def handle(self, *args, **options):
|
||||
self.stdout.write('Starting monitoring services...')
|
||||
|
||||
# Start metrics collection
|
||||
metrics_collector.start_collection()
|
||||
self.stdout.write(self.style.SUCCESS('Metrics collection started'))
|
||||
|
||||
# Start alert monitoring
|
||||
alert_manager.start_monitoring()
|
||||
self.stdout.write(self.style.SUCCESS('Alert monitoring started'))
|
||||
|
||||
self.stdout.write(self.style.SUCCESS('All monitoring services started successfully'))
|
||||
|
||||
# Keep running
|
||||
try:
|
||||
while True:
|
||||
time.sleep(1)
|
||||
except KeyboardInterrupt:
|
||||
self.stdout.write('Stopping monitoring services...')
|
||||
metrics_collector.stop_collection()
|
||||
alert_manager.stop_monitoring()
|
||||
self.stdout.write(self.style.SUCCESS('Monitoring services stopped'))
|
||||
1
ETB-API/monitoring/management/__init__.py
Normal file
1
ETB-API/monitoring/management/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Management commands for monitoring
|
||||
Binary file not shown.
1
ETB-API/monitoring/management/commands/__init__.py
Normal file
1
ETB-API/monitoring/management/commands/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Management commands
|
||||
Binary file not shown.
Binary file not shown.
665
ETB-API/monitoring/management/commands/setup_monitoring.py
Normal file
665
ETB-API/monitoring/management/commands/setup_monitoring.py
Normal file
@@ -0,0 +1,665 @@
|
||||
"""
|
||||
Management command to set up initial monitoring configuration
|
||||
"""
|
||||
from django.core.management.base import BaseCommand
|
||||
from django.contrib.auth import get_user_model
|
||||
from monitoring.models import (
|
||||
MonitoringTarget, SystemMetric, AlertRule, MonitoringDashboard
|
||||
)
|
||||
|
||||
User = get_user_model()
|
||||
|
||||
|
||||
class Command(BaseCommand):
|
||||
help = 'Set up initial monitoring configuration'
|
||||
|
||||
def add_arguments(self, parser):
|
||||
parser.add_argument(
|
||||
'--admin-user',
|
||||
type=str,
|
||||
help='Username of admin user to create monitoring objects',
|
||||
default='admin'
|
||||
)
|
||||
|
||||
def handle(self, *args, **options):
|
||||
admin_username = options['admin_user']
|
||||
|
||||
try:
|
||||
admin_user = User.objects.get(username=admin_username)
|
||||
except User.DoesNotExist:
|
||||
self.stdout.write(
|
||||
self.style.ERROR(f'Admin user "{admin_username}" not found')
|
||||
)
|
||||
return
|
||||
|
||||
self.stdout.write('Setting up monitoring configuration...')
|
||||
|
||||
# Create default monitoring targets
|
||||
self.create_default_targets(admin_user)
|
||||
|
||||
# Create default metrics
|
||||
self.create_default_metrics(admin_user)
|
||||
|
||||
# Create default alert rules
|
||||
self.create_default_alert_rules(admin_user)
|
||||
|
||||
# Create default dashboards
|
||||
self.create_default_dashboards(admin_user)
|
||||
|
||||
self.stdout.write(
|
||||
self.style.SUCCESS('Monitoring configuration setup completed!')
|
||||
)
|
||||
|
||||
def create_default_targets(self, admin_user):
|
||||
"""Create default monitoring targets"""
|
||||
self.stdout.write('Creating default monitoring targets...')
|
||||
|
||||
targets = [
|
||||
{
|
||||
'name': 'Django Application',
|
||||
'description': 'Main Django application health check',
|
||||
'target_type': 'APPLICATION',
|
||||
'endpoint_url': 'http://localhost:8000/health/',
|
||||
'related_module': 'core',
|
||||
'health_check_enabled': True,
|
||||
'expected_status_codes': [200]
|
||||
},
|
||||
{
|
||||
'name': 'Database',
|
||||
'description': 'Database connection health check',
|
||||
'target_type': 'DATABASE',
|
||||
'related_module': 'core',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Cache System',
|
||||
'description': 'Cache system health check',
|
||||
'target_type': 'CACHE',
|
||||
'related_module': 'core',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Celery Workers',
|
||||
'description': 'Celery worker health check',
|
||||
'target_type': 'QUEUE',
|
||||
'related_module': 'core',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Security Module',
|
||||
'description': 'Security module health check',
|
||||
'target_type': 'MODULE',
|
||||
'related_module': 'security',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Incident Intelligence Module',
|
||||
'description': 'Incident Intelligence module health check',
|
||||
'target_type': 'MODULE',
|
||||
'related_module': 'incident_intelligence',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Automation Orchestration Module',
|
||||
'description': 'Automation Orchestration module health check',
|
||||
'target_type': 'MODULE',
|
||||
'related_module': 'automation_orchestration',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'SLA OnCall Module',
|
||||
'description': 'SLA OnCall module health check',
|
||||
'target_type': 'MODULE',
|
||||
'related_module': 'sla_oncall',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Collaboration War Rooms Module',
|
||||
'description': 'Collaboration War Rooms module health check',
|
||||
'target_type': 'MODULE',
|
||||
'related_module': 'collaboration_war_rooms',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Compliance Governance Module',
|
||||
'description': 'Compliance Governance module health check',
|
||||
'target_type': 'MODULE',
|
||||
'related_module': 'compliance_governance',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Analytics Predictive Insights Module',
|
||||
'description': 'Analytics Predictive Insights module health check',
|
||||
'target_type': 'MODULE',
|
||||
'related_module': 'analytics_predictive_insights',
|
||||
'health_check_enabled': True
|
||||
},
|
||||
{
|
||||
'name': 'Knowledge Learning Module',
|
||||
'description': 'Knowledge Learning module health check',
|
||||
'target_type': 'MODULE',
|
||||
'related_module': 'knowledge_learning',
|
||||
'health_check_enabled': True
|
||||
}
|
||||
]
|
||||
|
||||
for target_data in targets:
|
||||
target, created = MonitoringTarget.objects.get_or_create(
|
||||
name=target_data['name'],
|
||||
defaults={
|
||||
**target_data,
|
||||
'created_by': admin_user
|
||||
}
|
||||
)
|
||||
if created:
|
||||
self.stdout.write(f' Created target: {target.name}')
|
||||
else:
|
||||
self.stdout.write(f' Target already exists: {target.name}')
|
||||
|
||||
def create_default_metrics(self, admin_user):
|
||||
"""Create default system metrics"""
|
||||
self.stdout.write('Creating default system metrics...')
|
||||
|
||||
metrics = [
|
||||
{
|
||||
'name': 'API Response Time',
|
||||
'description': 'Average API response time in milliseconds',
|
||||
'metric_type': 'PERFORMANCE',
|
||||
'category': 'API_RESPONSE_TIME',
|
||||
'unit': 'milliseconds',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 300,
|
||||
'warning_threshold': 1000,
|
||||
'critical_threshold': 2000,
|
||||
'is_system_metric': True
|
||||
},
|
||||
{
|
||||
'name': 'Request Throughput',
|
||||
'description': 'Number of requests per minute',
|
||||
'metric_type': 'PERFORMANCE',
|
||||
'category': 'THROUGHPUT',
|
||||
'unit': 'requests/minute',
|
||||
'aggregation_method': 'SUM',
|
||||
'collection_interval_seconds': 60,
|
||||
'warning_threshold': 1000,
|
||||
'critical_threshold': 2000,
|
||||
'is_system_metric': True
|
||||
},
|
||||
{
|
||||
'name': 'Error Rate',
|
||||
'description': 'Percentage of failed requests',
|
||||
'metric_type': 'PERFORMANCE',
|
||||
'category': 'ERROR_RATE',
|
||||
'unit': 'percentage',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 300,
|
||||
'warning_threshold': 5.0,
|
||||
'critical_threshold': 10.0,
|
||||
'is_system_metric': True
|
||||
},
|
||||
{
|
||||
'name': 'System Availability',
|
||||
'description': 'System availability percentage',
|
||||
'metric_type': 'INFRASTRUCTURE',
|
||||
'category': 'AVAILABILITY',
|
||||
'unit': 'percentage',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 300,
|
||||
'warning_threshold': 99.0,
|
||||
'critical_threshold': 95.0,
|
||||
'is_system_metric': True
|
||||
},
|
||||
{
|
||||
'name': 'Incident Count',
|
||||
'description': 'Number of incidents in the last 24 hours',
|
||||
'metric_type': 'BUSINESS',
|
||||
'category': 'INCIDENT_COUNT',
|
||||
'unit': 'count',
|
||||
'aggregation_method': 'COUNT',
|
||||
'collection_interval_seconds': 3600,
|
||||
'warning_threshold': 10,
|
||||
'critical_threshold': 20,
|
||||
'is_system_metric': True,
|
||||
'related_module': 'incident_intelligence'
|
||||
},
|
||||
{
|
||||
'name': 'Mean Time to Resolve',
|
||||
'description': 'Average time to resolve incidents in minutes',
|
||||
'metric_type': 'BUSINESS',
|
||||
'category': 'MTTR',
|
||||
'unit': 'minutes',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 3600,
|
||||
'warning_threshold': 120,
|
||||
'critical_threshold': 240,
|
||||
'is_system_metric': True,
|
||||
'related_module': 'incident_intelligence'
|
||||
},
|
||||
{
|
||||
'name': 'Mean Time to Acknowledge',
|
||||
'description': 'Average time to acknowledge incidents in minutes',
|
||||
'metric_type': 'BUSINESS',
|
||||
'category': 'MTTA',
|
||||
'unit': 'minutes',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 3600,
|
||||
'warning_threshold': 15,
|
||||
'critical_threshold': 30,
|
||||
'is_system_metric': True,
|
||||
'related_module': 'incident_intelligence'
|
||||
},
|
||||
{
|
||||
'name': 'SLA Compliance',
|
||||
'description': 'SLA compliance percentage',
|
||||
'metric_type': 'BUSINESS',
|
||||
'category': 'SLA_COMPLIANCE',
|
||||
'unit': 'percentage',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 3600,
|
||||
'warning_threshold': 95.0,
|
||||
'critical_threshold': 90.0,
|
||||
'is_system_metric': True,
|
||||
'related_module': 'sla_oncall'
|
||||
},
|
||||
{
|
||||
'name': 'Security Events',
|
||||
'description': 'Number of security events in the last hour',
|
||||
'metric_type': 'SECURITY',
|
||||
'category': 'SECURITY_EVENTS',
|
||||
'unit': 'count',
|
||||
'aggregation_method': 'COUNT',
|
||||
'collection_interval_seconds': 3600,
|
||||
'warning_threshold': 5,
|
||||
'critical_threshold': 10,
|
||||
'is_system_metric': True,
|
||||
'related_module': 'security'
|
||||
},
|
||||
{
|
||||
'name': 'Automation Success Rate',
|
||||
'description': 'Percentage of successful automation executions',
|
||||
'metric_type': 'BUSINESS',
|
||||
'category': 'AUTOMATION_SUCCESS',
|
||||
'unit': 'percentage',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 3600,
|
||||
'warning_threshold': 90.0,
|
||||
'critical_threshold': 80.0,
|
||||
'is_system_metric': True,
|
||||
'related_module': 'automation_orchestration'
|
||||
},
|
||||
{
|
||||
'name': 'AI Model Accuracy',
|
||||
'description': 'AI model accuracy percentage',
|
||||
'metric_type': 'BUSINESS',
|
||||
'category': 'AI_ACCURACY',
|
||||
'unit': 'percentage',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 3600,
|
||||
'warning_threshold': 85.0,
|
||||
'critical_threshold': 75.0,
|
||||
'is_system_metric': True,
|
||||
'related_module': 'incident_intelligence'
|
||||
},
|
||||
{
|
||||
'name': 'Cost Impact',
|
||||
'description': 'Total cost impact in USD for the last 30 days',
|
||||
'metric_type': 'BUSINESS',
|
||||
'category': 'COST_IMPACT',
|
||||
'unit': 'USD',
|
||||
'aggregation_method': 'SUM',
|
||||
'collection_interval_seconds': 86400,
|
||||
'warning_threshold': 10000,
|
||||
'critical_threshold': 50000,
|
||||
'is_system_metric': True,
|
||||
'related_module': 'analytics_predictive_insights'
|
||||
},
|
||||
{
|
||||
'name': 'User Activity',
|
||||
'description': 'Number of active users in the last hour',
|
||||
'metric_type': 'BUSINESS',
|
||||
'category': 'USER_ACTIVITY',
|
||||
'unit': 'count',
|
||||
'aggregation_method': 'COUNT',
|
||||
'collection_interval_seconds': 3600,
|
||||
'warning_threshold': 50,
|
||||
'critical_threshold': 100,
|
||||
'is_system_metric': True
|
||||
},
|
||||
{
|
||||
'name': 'CPU Usage',
|
||||
'description': 'System CPU usage percentage',
|
||||
'metric_type': 'INFRASTRUCTURE',
|
||||
'category': 'SYSTEM_RESOURCES',
|
||||
'unit': 'percentage',
|
||||
'aggregation_method': 'AVERAGE',
|
||||
'collection_interval_seconds': 300,
|
||||
'warning_threshold': 80.0,
|
||||
'critical_threshold': 90.0,
|
||||
'is_system_metric': True
|
||||
}
|
||||
]
|
||||
|
||||
for metric_data in metrics:
|
||||
metric, created = SystemMetric.objects.get_or_create(
|
||||
name=metric_data['name'],
|
||||
defaults={
|
||||
**metric_data,
|
||||
'created_by': admin_user
|
||||
}
|
||||
)
|
||||
if created:
|
||||
self.stdout.write(f' Created metric: {metric.name}')
|
||||
else:
|
||||
self.stdout.write(f' Metric already exists: {metric.name}')
|
||||
|
||||
def create_default_alert_rules(self, admin_user):
|
||||
"""Create default alert rules"""
|
||||
self.stdout.write('Creating default alert rules...')
|
||||
|
||||
# Get metrics for alert rules
|
||||
api_response_metric = SystemMetric.objects.filter(name='API Response Time').first()
|
||||
error_rate_metric = SystemMetric.objects.filter(name='Error Rate').first()
|
||||
availability_metric = SystemMetric.objects.filter(name='System Availability').first()
|
||||
incident_count_metric = SystemMetric.objects.filter(name='Incident Count').first()
|
||||
mttr_metric = SystemMetric.objects.filter(name='Mean Time to Resolve').first()
|
||||
security_events_metric = SystemMetric.objects.filter(name='Security Events').first()
|
||||
cpu_metric = SystemMetric.objects.filter(name='CPU Usage').first()
|
||||
|
||||
alert_rules = [
|
||||
{
|
||||
'name': 'High API Response Time',
|
||||
'description': 'Alert when API response time exceeds threshold',
|
||||
'alert_type': 'THRESHOLD',
|
||||
'severity': 'HIGH',
|
||||
'condition': {
|
||||
'type': 'THRESHOLD',
|
||||
'operator': '>',
|
||||
'threshold': 2000
|
||||
},
|
||||
'metric': api_response_metric,
|
||||
'notification_channels': [
|
||||
{
|
||||
'type': 'EMAIL',
|
||||
'recipients': ['admin@example.com']
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'High Error Rate',
|
||||
'description': 'Alert when error rate exceeds threshold',
|
||||
'alert_type': 'THRESHOLD',
|
||||
'severity': 'CRITICAL',
|
||||
'condition': {
|
||||
'type': 'THRESHOLD',
|
||||
'operator': '>',
|
||||
'threshold': 10.0
|
||||
},
|
||||
'metric': error_rate_metric,
|
||||
'notification_channels': [
|
||||
{
|
||||
'type': 'EMAIL',
|
||||
'recipients': ['admin@example.com']
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'Low System Availability',
|
||||
'description': 'Alert when system availability drops below threshold',
|
||||
'alert_type': 'AVAILABILITY',
|
||||
'severity': 'CRITICAL',
|
||||
'condition': {
|
||||
'type': 'THRESHOLD',
|
||||
'operator': '<',
|
||||
'threshold': 95.0
|
||||
},
|
||||
'metric': availability_metric,
|
||||
'notification_channels': [
|
||||
{
|
||||
'type': 'EMAIL',
|
||||
'recipients': ['admin@example.com']
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'High Incident Count',
|
||||
'description': 'Alert when incident count exceeds threshold',
|
||||
'alert_type': 'THRESHOLD',
|
||||
'severity': 'HIGH',
|
||||
'condition': {
|
||||
'type': 'THRESHOLD',
|
||||
'operator': '>',
|
||||
'threshold': 20
|
||||
},
|
||||
'metric': incident_count_metric,
|
||||
'notification_channels': [
|
||||
{
|
||||
'type': 'EMAIL',
|
||||
'recipients': ['admin@example.com']
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'High MTTR',
|
||||
'description': 'Alert when mean time to resolve exceeds threshold',
|
||||
'alert_type': 'THRESHOLD',
|
||||
'severity': 'MEDIUM',
|
||||
'condition': {
|
||||
'type': 'THRESHOLD',
|
||||
'operator': '>',
|
||||
'threshold': 240
|
||||
},
|
||||
'metric': mttr_metric,
|
||||
'notification_channels': [
|
||||
{
|
||||
'type': 'EMAIL',
|
||||
'recipients': ['admin@example.com']
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'High Security Events',
|
||||
'description': 'Alert when security events exceed threshold',
|
||||
'alert_type': 'THRESHOLD',
|
||||
'severity': 'HIGH',
|
||||
'condition': {
|
||||
'type': 'THRESHOLD',
|
||||
'operator': '>',
|
||||
'threshold': 10
|
||||
},
|
||||
'metric': security_events_metric,
|
||||
'notification_channels': [
|
||||
{
|
||||
'type': 'EMAIL',
|
||||
'recipients': ['admin@example.com']
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'High CPU Usage',
|
||||
'description': 'Alert when CPU usage exceeds threshold',
|
||||
'alert_type': 'THRESHOLD',
|
||||
'severity': 'HIGH',
|
||||
'condition': {
|
||||
'type': 'THRESHOLD',
|
||||
'operator': '>',
|
||||
'threshold': 90.0
|
||||
},
|
||||
'metric': cpu_metric,
|
||||
'notification_channels': [
|
||||
{
|
||||
'type': 'EMAIL',
|
||||
'recipients': ['admin@example.com']
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
for rule_data in alert_rules:
|
||||
if rule_data['metric']: # Only create if metric exists
|
||||
rule, created = AlertRule.objects.get_or_create(
|
||||
name=rule_data['name'],
|
||||
defaults={
|
||||
**rule_data,
|
||||
'created_by': admin_user
|
||||
}
|
||||
)
|
||||
if created:
|
||||
self.stdout.write(f' Created alert rule: {rule.name}')
|
||||
else:
|
||||
self.stdout.write(f' Alert rule already exists: {rule.name}')
|
||||
|
||||
def create_default_dashboards(self, admin_user):
|
||||
"""Create default monitoring dashboards"""
|
||||
self.stdout.write('Creating default monitoring dashboards...')
|
||||
|
||||
dashboards = [
|
||||
{
|
||||
'name': 'System Overview',
|
||||
'description': 'High-level system overview dashboard',
|
||||
'dashboard_type': 'SYSTEM_OVERVIEW',
|
||||
'is_public': True,
|
||||
'auto_refresh_enabled': True,
|
||||
'refresh_interval_seconds': 30,
|
||||
'layout_config': {
|
||||
'columns': 3,
|
||||
'rows': 4
|
||||
},
|
||||
'widget_configs': [
|
||||
{
|
||||
'type': 'system_status',
|
||||
'position': {'x': 0, 'y': 0, 'width': 3, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'health_summary',
|
||||
'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'alert_summary',
|
||||
'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'system_resources',
|
||||
'position': {'x': 2, 'y': 1, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'recent_incidents',
|
||||
'position': {'x': 0, 'y': 2, 'width': 2, 'height': 2}
|
||||
},
|
||||
{
|
||||
'type': 'metric_trends',
|
||||
'position': {'x': 2, 'y': 2, 'width': 1, 'height': 2}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'Performance Dashboard',
|
||||
'description': 'System performance metrics dashboard',
|
||||
'dashboard_type': 'PERFORMANCE',
|
||||
'is_public': True,
|
||||
'auto_refresh_enabled': True,
|
||||
'refresh_interval_seconds': 60,
|
||||
'layout_config': {
|
||||
'columns': 2,
|
||||
'rows': 3
|
||||
},
|
||||
'widget_configs': [
|
||||
{
|
||||
'type': 'api_response_time',
|
||||
'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'throughput',
|
||||
'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'error_rate',
|
||||
'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'availability',
|
||||
'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'system_resources',
|
||||
'position': {'x': 0, 'y': 2, 'width': 2, 'height': 1}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'Business Metrics Dashboard',
|
||||
'description': 'Business and operational metrics dashboard',
|
||||
'dashboard_type': 'BUSINESS_METRICS',
|
||||
'is_public': True,
|
||||
'auto_refresh_enabled': True,
|
||||
'refresh_interval_seconds': 300,
|
||||
'layout_config': {
|
||||
'columns': 2,
|
||||
'rows': 3
|
||||
},
|
||||
'widget_configs': [
|
||||
{
|
||||
'type': 'incident_count',
|
||||
'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'mttr',
|
||||
'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'mtta',
|
||||
'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'sla_compliance',
|
||||
'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'cost_impact',
|
||||
'position': {'x': 0, 'y': 2, 'width': 2, 'height': 1}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
'name': 'Security Dashboard',
|
||||
'description': 'Security monitoring dashboard',
|
||||
'dashboard_type': 'SECURITY',
|
||||
'is_public': False,
|
||||
'auto_refresh_enabled': True,
|
||||
'refresh_interval_seconds': 60,
|
||||
'layout_config': {
|
||||
'columns': 2,
|
||||
'rows': 2
|
||||
},
|
||||
'widget_configs': [
|
||||
{
|
||||
'type': 'security_events',
|
||||
'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'failed_logins',
|
||||
'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'risk_assessments',
|
||||
'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
|
||||
},
|
||||
{
|
||||
'type': 'device_posture',
|
||||
'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
for dashboard_data in dashboards:
|
||||
dashboard, created = MonitoringDashboard.objects.get_or_create(
|
||||
name=dashboard_data['name'],
|
||||
defaults={
|
||||
**dashboard_data,
|
||||
'created_by': admin_user
|
||||
}
|
||||
)
|
||||
if created:
|
||||
self.stdout.write(f' Created dashboard: {dashboard.name}')
|
||||
else:
|
||||
self.stdout.write(f' Dashboard already exists: {dashboard.name}')
|
||||
252
ETB-API/monitoring/migrations/0001_initial.py
Normal file
252
ETB-API/monitoring/migrations/0001_initial.py
Normal file
@@ -0,0 +1,252 @@
|
||||
# Generated by Django 5.2.6 on 2025-09-18 19:44
|
||||
|
||||
import django.db.models.deletion
|
||||
import uuid
|
||||
from django.conf import settings
|
||||
from django.db import migrations, models
|
||||
|
||||
|
||||
class Migration(migrations.Migration):
|
||||
|
||||
initial = True
|
||||
|
||||
dependencies = [
|
||||
migrations.swappable_dependency(settings.AUTH_USER_MODEL),
|
||||
]
|
||||
|
||||
operations = [
|
||||
migrations.CreateModel(
|
||||
name='MonitoringTarget',
|
||||
fields=[
|
||||
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
|
||||
('name', models.CharField(max_length=200, unique=True)),
|
||||
('description', models.TextField()),
|
||||
('target_type', models.CharField(choices=[('APPLICATION', 'Application'), ('DATABASE', 'Database'), ('CACHE', 'Cache'), ('QUEUE', 'Message Queue'), ('EXTERNAL_API', 'External API'), ('SERVICE', 'Internal Service'), ('INFRASTRUCTURE', 'Infrastructure'), ('MODULE', 'Django Module')], max_length=20)),
|
||||
('endpoint_url', models.URLField(blank=True, null=True)),
|
||||
('connection_config', models.JSONField(default=dict, help_text='Connection configuration (credentials, timeouts, etc.)')),
|
||||
('check_interval_seconds', models.PositiveIntegerField(default=60)),
|
||||
('timeout_seconds', models.PositiveIntegerField(default=30)),
|
||||
('retry_count', models.PositiveIntegerField(default=3)),
|
||||
('health_check_enabled', models.BooleanField(default=True)),
|
||||
('health_check_endpoint', models.CharField(blank=True, max_length=200, null=True)),
|
||||
('expected_status_codes', models.JSONField(default=list, help_text='Expected HTTP status codes for health checks')),
|
||||
('status', models.CharField(choices=[('ACTIVE', 'Active'), ('INACTIVE', 'Inactive'), ('MAINTENANCE', 'Maintenance'), ('ERROR', 'Error')], default='ACTIVE', max_length=20)),
|
||||
('last_checked', models.DateTimeField(blank=True, null=True)),
|
||||
('last_status', models.CharField(choices=[('HEALTHY', 'Healthy'), ('WARNING', 'Warning'), ('CRITICAL', 'Critical'), ('UNKNOWN', 'Unknown')], default='UNKNOWN', max_length=20)),
|
||||
('related_module', models.CharField(blank=True, help_text="Related Django module (e.g., 'security', 'incident_intelligence')", max_length=50, null=True)),
|
||||
('created_at', models.DateTimeField(auto_now_add=True)),
|
||||
('updated_at', models.DateTimeField(auto_now=True)),
|
||||
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
|
||||
],
|
||||
options={
|
||||
'ordering': ['name'],
|
||||
},
|
||||
),
|
||||
migrations.CreateModel(
|
||||
name='HealthCheck',
|
||||
fields=[
|
||||
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
|
||||
('check_type', models.CharField(choices=[('HTTP', 'HTTP Health Check'), ('DATABASE', 'Database Connection'), ('CACHE', 'Cache Connection'), ('QUEUE', 'Message Queue'), ('CUSTOM', 'Custom Check'), ('PING', 'Network Ping'), ('SSL', 'SSL Certificate')], max_length=20)),
|
||||
('status', models.CharField(choices=[('HEALTHY', 'Healthy'), ('WARNING', 'Warning'), ('CRITICAL', 'Critical'), ('UNKNOWN', 'Unknown')], max_length=20)),
|
||||
('response_time_ms', models.PositiveIntegerField(blank=True, null=True)),
|
||||
('status_code', models.PositiveIntegerField(blank=True, null=True)),
|
||||
('response_body', models.TextField(blank=True, null=True)),
|
||||
('error_message', models.TextField(blank=True, null=True)),
|
||||
('cpu_usage_percent', models.FloatField(blank=True, null=True)),
|
||||
('memory_usage_percent', models.FloatField(blank=True, null=True)),
|
||||
('disk_usage_percent', models.FloatField(blank=True, null=True)),
|
||||
('checked_at', models.DateTimeField(auto_now_add=True)),
|
||||
('target', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='health_checks', to='monitoring.monitoringtarget')),
|
||||
],
|
||||
options={
|
||||
'ordering': ['-checked_at'],
|
||||
},
|
||||
),
|
||||
migrations.CreateModel(
|
||||
name='SystemMetric',
|
||||
fields=[
|
||||
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
|
||||
('name', models.CharField(max_length=200)),
|
||||
('description', models.TextField()),
|
||||
('metric_type', models.CharField(choices=[('PERFORMANCE', 'Performance Metric'), ('BUSINESS', 'Business Metric'), ('SECURITY', 'Security Metric'), ('INFRASTRUCTURE', 'Infrastructure Metric'), ('CUSTOM', 'Custom Metric')], max_length=20)),
|
||||
('category', models.CharField(choices=[('API_RESPONSE_TIME', 'API Response Time'), ('THROUGHPUT', 'Throughput'), ('ERROR_RATE', 'Error Rate'), ('AVAILABILITY', 'Availability'), ('INCIDENT_COUNT', 'Incident Count'), ('MTTR', 'Mean Time to Resolve'), ('MTTA', 'Mean Time to Acknowledge'), ('SLA_COMPLIANCE', 'SLA Compliance'), ('SECURITY_EVENTS', 'Security Events'), ('AUTOMATION_SUCCESS', 'Automation Success Rate'), ('AI_ACCURACY', 'AI Model Accuracy'), ('COST_IMPACT', 'Cost Impact'), ('USER_ACTIVITY', 'User Activity'), ('SYSTEM_RESOURCES', 'System Resources')], max_length=30)),
|
||||
('unit', models.CharField(help_text='Unit of measurement', max_length=50)),
|
||||
('aggregation_method', models.CharField(choices=[('AVERAGE', 'Average'), ('SUM', 'Sum'), ('COUNT', 'Count'), ('MIN', 'Minimum'), ('MAX', 'Maximum'), ('PERCENTILE_95', '95th Percentile'), ('PERCENTILE_99', '99th Percentile')], max_length=20)),
|
||||
('collection_interval_seconds', models.PositiveIntegerField(default=300)),
|
||||
('retention_days', models.PositiveIntegerField(default=90)),
|
||||
('warning_threshold', models.FloatField(blank=True, null=True)),
|
||||
('critical_threshold', models.FloatField(blank=True, null=True)),
|
||||
('is_active', models.BooleanField(default=True)),
|
||||
('is_system_metric', models.BooleanField(default=False)),
|
||||
('related_module', models.CharField(blank=True, help_text='Related Django module', max_length=50, null=True)),
|
||||
('created_at', models.DateTimeField(auto_now_add=True)),
|
||||
('updated_at', models.DateTimeField(auto_now=True)),
|
||||
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
|
||||
],
|
||||
options={
|
||||
'ordering': ['name'],
|
||||
},
|
||||
),
|
||||
migrations.CreateModel(
|
||||
name='MetricMeasurement',
|
||||
fields=[
|
||||
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
|
||||
('value', models.DecimalField(decimal_places=4, max_digits=15)),
|
||||
('timestamp', models.DateTimeField(auto_now_add=True)),
|
||||
('tags', models.JSONField(default=dict, help_text='Additional tags for this measurement')),
|
||||
('metadata', models.JSONField(default=dict, help_text='Additional metadata')),
|
||||
('metric', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='measurements', to='monitoring.systemmetric')),
|
||||
],
|
||||
options={
|
||||
'ordering': ['-timestamp'],
|
||||
},
|
||||
),
|
||||
migrations.CreateModel(
|
||||
name='AlertRule',
|
||||
fields=[
|
||||
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
|
||||
('name', models.CharField(max_length=200)),
|
||||
('description', models.TextField()),
|
||||
('alert_type', models.CharField(choices=[('THRESHOLD', 'Threshold Alert'), ('ANOMALY', 'Anomaly Alert'), ('PATTERN', 'Pattern Alert'), ('AVAILABILITY', 'Availability Alert'), ('PERFORMANCE', 'Performance Alert')], max_length=20)),
|
||||
('severity', models.CharField(choices=[('LOW', 'Low'), ('MEDIUM', 'Medium'), ('HIGH', 'High'), ('CRITICAL', 'Critical')], max_length=20)),
|
||||
('condition', models.JSONField(help_text='Alert condition configuration')),
|
||||
('evaluation_interval_seconds', models.PositiveIntegerField(default=60)),
|
||||
('notification_channels', models.JSONField(default=list, help_text='List of notification channels (email, slack, webhook, etc.)')),
|
||||
('notification_template', models.TextField(blank=True, help_text='Custom notification template', null=True)),
|
||||
('status', models.CharField(choices=[('ACTIVE', 'Active'), ('INACTIVE', 'Inactive'), ('MAINTENANCE', 'Maintenance')], default='ACTIVE', max_length=20)),
|
||||
('is_enabled', models.BooleanField(default=True)),
|
||||
('created_at', models.DateTimeField(auto_now_add=True)),
|
||||
('updated_at', models.DateTimeField(auto_now=True)),
|
||||
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
|
||||
('target', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.CASCADE, related_name='alert_rules', to='monitoring.monitoringtarget')),
|
||||
('metric', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.CASCADE, related_name='alert_rules', to='monitoring.systemmetric')),
|
||||
],
|
||||
options={
|
||||
'ordering': ['name'],
|
||||
},
|
||||
),
|
||||
migrations.CreateModel(
|
||||
name='SystemStatus',
|
||||
fields=[
|
||||
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
|
||||
('status', models.CharField(choices=[('OPERATIONAL', 'Operational'), ('DEGRADED', 'Degraded'), ('PARTIAL_OUTAGE', 'Partial Outage'), ('MAJOR_OUTAGE', 'Major Outage'), ('MAINTENANCE', 'Maintenance')], max_length=20)),
|
||||
('message', models.TextField(help_text='Status message for users')),
|
||||
('affected_services', models.JSONField(default=list, help_text='List of affected services')),
|
||||
('estimated_resolution', models.DateTimeField(blank=True, null=True)),
|
||||
('started_at', models.DateTimeField(auto_now_add=True)),
|
||||
('updated_at', models.DateTimeField(auto_now=True)),
|
||||
('resolved_at', models.DateTimeField(blank=True, null=True)),
|
||||
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
|
||||
],
|
||||
options={
|
||||
'ordering': ['-started_at'],
|
||||
},
|
||||
),
|
||||
migrations.CreateModel(
|
||||
name='Alert',
|
||||
fields=[
|
||||
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
|
||||
('title', models.CharField(max_length=200)),
|
||||
('description', models.TextField()),
|
||||
('severity', models.CharField(choices=[('LOW', 'Low'), ('MEDIUM', 'Medium'), ('HIGH', 'High'), ('CRITICAL', 'Critical')], max_length=20)),
|
||||
('status', models.CharField(choices=[('TRIGGERED', 'Triggered'), ('ACKNOWLEDGED', 'Acknowledged'), ('RESOLVED', 'Resolved'), ('SUPPRESSED', 'Suppressed')], default='TRIGGERED', max_length=20)),
|
||||
('triggered_value', models.DecimalField(blank=True, decimal_places=4, max_digits=15, null=True)),
|
||||
('threshold_value', models.DecimalField(blank=True, decimal_places=4, max_digits=15, null=True)),
|
||||
('context_data', models.JSONField(default=dict, help_text='Additional context data for the alert')),
|
||||
('triggered_at', models.DateTimeField(auto_now_add=True)),
|
||||
('acknowledged_at', models.DateTimeField(blank=True, null=True)),
|
||||
('resolved_at', models.DateTimeField(blank=True, null=True)),
|
||||
('acknowledged_by', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, related_name='acknowledged_alerts', to=settings.AUTH_USER_MODEL)),
|
||||
('resolved_by', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, related_name='resolved_alerts', to=settings.AUTH_USER_MODEL)),
|
||||
('rule', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='alerts', to='monitoring.alertrule')),
|
||||
],
|
||||
options={
|
||||
'ordering': ['-triggered_at'],
|
||||
'indexes': [models.Index(fields=['rule', 'status'], name='monitoring__rule_id_0ff7d3_idx'), models.Index(fields=['severity', 'status'], name='monitoring__severit_1e6a2c_idx'), models.Index(fields=['triggered_at'], name='monitoring__trigger_743dcf_idx')],
|
||||
},
|
||||
),
|
||||
migrations.CreateModel(
|
||||
name='MonitoringDashboard',
|
||||
fields=[
|
||||
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
|
||||
('name', models.CharField(max_length=200)),
|
||||
('description', models.TextField()),
|
||||
('dashboard_type', models.CharField(choices=[('SYSTEM_OVERVIEW', 'System Overview'), ('PERFORMANCE', 'Performance'), ('BUSINESS_METRICS', 'Business Metrics'), ('SECURITY', 'Security'), ('INFRASTRUCTURE', 'Infrastructure'), ('CUSTOM', 'Custom')], max_length=20)),
|
||||
('layout_config', models.JSONField(default=dict, help_text='Dashboard layout configuration')),
|
||||
('widget_configs', models.JSONField(default=list, help_text='Configuration for dashboard widgets')),
|
||||
('is_public', models.BooleanField(default=False)),
|
||||
('allowed_roles', models.JSONField(default=list, help_text='List of roles that can access this dashboard')),
|
||||
('auto_refresh_enabled', models.BooleanField(default=True)),
|
||||
('refresh_interval_seconds', models.PositiveIntegerField(default=30)),
|
||||
('is_active', models.BooleanField(default=True)),
|
||||
('created_at', models.DateTimeField(auto_now_add=True)),
|
||||
('updated_at', models.DateTimeField(auto_now=True)),
|
||||
('allowed_users', models.ManyToManyField(blank=True, related_name='accessible_monitoring_dashboards', to=settings.AUTH_USER_MODEL)),
|
||||
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
|
||||
],
|
||||
options={
|
||||
'ordering': ['name'],
|
||||
'indexes': [models.Index(fields=['dashboard_type', 'is_active'], name='monitoring__dashboa_2e7a27_idx'), models.Index(fields=['is_public'], name='monitoring__is_publ_811f62_idx')],
|
||||
},
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='monitoringtarget',
|
||||
index=models.Index(fields=['target_type', 'status'], name='monitoring__target__f37347_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='monitoringtarget',
|
||||
index=models.Index(fields=['related_module'], name='monitoring__related_0c51fc_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='monitoringtarget',
|
||||
index=models.Index(fields=['last_checked'], name='monitoring__last_ch_83ce18_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='healthcheck',
|
||||
index=models.Index(fields=['target', 'checked_at'], name='monitoring__target__8d1cd6_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='healthcheck',
|
||||
index=models.Index(fields=['status', 'checked_at'], name='monitoring__status_636b2b_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='healthcheck',
|
||||
index=models.Index(fields=['check_type'], name='monitoring__check_t_b442f3_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='systemmetric',
|
||||
index=models.Index(fields=['metric_type', 'category'], name='monitoring__metric__df4606_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='systemmetric',
|
||||
index=models.Index(fields=['related_module'], name='monitoring__related_7b383b_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='systemmetric',
|
||||
index=models.Index(fields=['is_active'], name='monitoring__is_acti_c90676_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='metricmeasurement',
|
||||
index=models.Index(fields=['metric', 'timestamp'], name='monitoring__metric__216cac_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='metricmeasurement',
|
||||
index=models.Index(fields=['timestamp'], name='monitoring__timesta_75a739_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='alertrule',
|
||||
index=models.Index(fields=['alert_type', 'severity'], name='monitoring__alert_t_915b15_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='alertrule',
|
||||
index=models.Index(fields=['status', 'is_enabled'], name='monitoring__status_e905cc_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='systemstatus',
|
||||
index=models.Index(fields=['status', 'started_at'], name='monitoring__status_18966f_idx'),
|
||||
),
|
||||
migrations.AddIndex(
|
||||
model_name='systemstatus',
|
||||
index=models.Index(fields=['started_at'], name='monitoring__started_d85786_idx'),
|
||||
),
|
||||
]
|
||||
0
ETB-API/monitoring/migrations/__init__.py
Normal file
0
ETB-API/monitoring/migrations/__init__.py
Normal file
Binary file not shown.
Binary file not shown.
515
ETB-API/monitoring/models.py
Normal file
515
ETB-API/monitoring/models.py
Normal file
@@ -0,0 +1,515 @@
|
||||
"""
|
||||
Monitoring models for comprehensive system observability
|
||||
"""
|
||||
import uuid
|
||||
import json
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, Any, Optional, List
|
||||
from decimal import Decimal
|
||||
|
||||
from django.db import models
|
||||
from django.contrib.auth import get_user_model
|
||||
from django.core.validators import MinValueValidator, MaxValueValidator
|
||||
from django.utils import timezone
|
||||
from django.core.exceptions import ValidationError
|
||||
|
||||
User = get_user_model()
|
||||
|
||||
|
||||
class MonitoringTarget(models.Model):
|
||||
"""Target systems, services, or components to monitor"""
|
||||
|
||||
TARGET_TYPES = [
|
||||
('APPLICATION', 'Application'),
|
||||
('DATABASE', 'Database'),
|
||||
('CACHE', 'Cache'),
|
||||
('QUEUE', 'Message Queue'),
|
||||
('EXTERNAL_API', 'External API'),
|
||||
('SERVICE', 'Internal Service'),
|
||||
('INFRASTRUCTURE', 'Infrastructure'),
|
||||
('MODULE', 'Django Module'),
|
||||
]
|
||||
|
||||
STATUS_CHOICES = [
|
||||
('ACTIVE', 'Active'),
|
||||
('INACTIVE', 'Inactive'),
|
||||
('MAINTENANCE', 'Maintenance'),
|
||||
('ERROR', 'Error'),
|
||||
]
|
||||
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
name = models.CharField(max_length=200, unique=True)
|
||||
description = models.TextField()
|
||||
target_type = models.CharField(max_length=20, choices=TARGET_TYPES)
|
||||
|
||||
# Connection details
|
||||
endpoint_url = models.URLField(blank=True, null=True)
|
||||
connection_config = models.JSONField(
|
||||
default=dict,
|
||||
help_text="Connection configuration (credentials, timeouts, etc.)"
|
||||
)
|
||||
|
||||
# Monitoring configuration
|
||||
check_interval_seconds = models.PositiveIntegerField(default=60)
|
||||
timeout_seconds = models.PositiveIntegerField(default=30)
|
||||
retry_count = models.PositiveIntegerField(default=3)
|
||||
|
||||
# Health check configuration
|
||||
health_check_enabled = models.BooleanField(default=True)
|
||||
health_check_endpoint = models.CharField(max_length=200, blank=True, null=True)
|
||||
expected_status_codes = models.JSONField(
|
||||
default=list,
|
||||
help_text="Expected HTTP status codes for health checks"
|
||||
)
|
||||
|
||||
# Status and metadata
|
||||
status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='ACTIVE')
|
||||
last_checked = models.DateTimeField(null=True, blank=True)
|
||||
last_status = models.CharField(max_length=20, choices=[
|
||||
('HEALTHY', 'Healthy'),
|
||||
('WARNING', 'Warning'),
|
||||
('CRITICAL', 'Critical'),
|
||||
('UNKNOWN', 'Unknown'),
|
||||
], default='UNKNOWN')
|
||||
|
||||
# Related module (if applicable)
|
||||
related_module = models.CharField(
|
||||
max_length=50,
|
||||
blank=True,
|
||||
null=True,
|
||||
help_text="Related Django module (e.g., 'security', 'incident_intelligence')"
|
||||
)
|
||||
|
||||
# Metadata
|
||||
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
|
||||
created_at = models.DateTimeField(auto_now_add=True)
|
||||
updated_at = models.DateTimeField(auto_now=True)
|
||||
|
||||
class Meta:
|
||||
ordering = ['name']
|
||||
indexes = [
|
||||
models.Index(fields=['target_type', 'status']),
|
||||
models.Index(fields=['related_module']),
|
||||
models.Index(fields=['last_checked']),
|
||||
]
|
||||
|
||||
def __str__(self):
|
||||
return f"{self.name} ({self.target_type})"
|
||||
|
||||
|
||||
class HealthCheck(models.Model):
|
||||
"""Individual health check results"""
|
||||
|
||||
CHECK_TYPES = [
|
||||
('HTTP', 'HTTP Health Check'),
|
||||
('DATABASE', 'Database Connection'),
|
||||
('CACHE', 'Cache Connection'),
|
||||
('QUEUE', 'Message Queue'),
|
||||
('CUSTOM', 'Custom Check'),
|
||||
('PING', 'Network Ping'),
|
||||
('SSL', 'SSL Certificate'),
|
||||
]
|
||||
|
||||
STATUS_CHOICES = [
|
||||
('HEALTHY', 'Healthy'),
|
||||
('WARNING', 'Warning'),
|
||||
('CRITICAL', 'Critical'),
|
||||
('UNKNOWN', 'Unknown'),
|
||||
]
|
||||
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
target = models.ForeignKey(MonitoringTarget, on_delete=models.CASCADE, related_name='health_checks')
|
||||
|
||||
# Check details
|
||||
check_type = models.CharField(max_length=20, choices=CHECK_TYPES)
|
||||
status = models.CharField(max_length=20, choices=STATUS_CHOICES)
|
||||
response_time_ms = models.PositiveIntegerField(null=True, blank=True)
|
||||
|
||||
# Response details
|
||||
status_code = models.PositiveIntegerField(null=True, blank=True)
|
||||
response_body = models.TextField(blank=True, null=True)
|
||||
error_message = models.TextField(blank=True, null=True)
|
||||
|
||||
# Metrics
|
||||
cpu_usage_percent = models.FloatField(null=True, blank=True)
|
||||
memory_usage_percent = models.FloatField(null=True, blank=True)
|
||||
disk_usage_percent = models.FloatField(null=True, blank=True)
|
||||
|
||||
# Timestamps
|
||||
checked_at = models.DateTimeField(auto_now_add=True)
|
||||
|
||||
class Meta:
|
||||
ordering = ['-checked_at']
|
||||
indexes = [
|
||||
models.Index(fields=['target', 'checked_at']),
|
||||
models.Index(fields=['status', 'checked_at']),
|
||||
models.Index(fields=['check_type']),
|
||||
]
|
||||
|
||||
def __str__(self):
|
||||
return f"{self.target.name} - {self.status} ({self.checked_at})"
|
||||
|
||||
|
||||
class SystemMetric(models.Model):
|
||||
"""System performance and operational metrics"""
|
||||
|
||||
METRIC_TYPES = [
|
||||
('PERFORMANCE', 'Performance Metric'),
|
||||
('BUSINESS', 'Business Metric'),
|
||||
('SECURITY', 'Security Metric'),
|
||||
('INFRASTRUCTURE', 'Infrastructure Metric'),
|
||||
('CUSTOM', 'Custom Metric'),
|
||||
]
|
||||
|
||||
METRIC_CATEGORIES = [
|
||||
('API_RESPONSE_TIME', 'API Response Time'),
|
||||
('THROUGHPUT', 'Throughput'),
|
||||
('ERROR_RATE', 'Error Rate'),
|
||||
('AVAILABILITY', 'Availability'),
|
||||
('INCIDENT_COUNT', 'Incident Count'),
|
||||
('MTTR', 'Mean Time to Resolve'),
|
||||
('MTTA', 'Mean Time to Acknowledge'),
|
||||
('SLA_COMPLIANCE', 'SLA Compliance'),
|
||||
('SECURITY_EVENTS', 'Security Events'),
|
||||
('AUTOMATION_SUCCESS', 'Automation Success Rate'),
|
||||
('AI_ACCURACY', 'AI Model Accuracy'),
|
||||
('COST_IMPACT', 'Cost Impact'),
|
||||
('USER_ACTIVITY', 'User Activity'),
|
||||
('SYSTEM_RESOURCES', 'System Resources'),
|
||||
]
|
||||
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
name = models.CharField(max_length=200)
|
||||
description = models.TextField()
|
||||
metric_type = models.CharField(max_length=20, choices=METRIC_TYPES)
|
||||
category = models.CharField(max_length=30, choices=METRIC_CATEGORIES)
|
||||
|
||||
# Metric configuration
|
||||
unit = models.CharField(max_length=50, help_text="Unit of measurement")
|
||||
aggregation_method = models.CharField(
|
||||
max_length=20,
|
||||
choices=[
|
||||
('AVERAGE', 'Average'),
|
||||
('SUM', 'Sum'),
|
||||
('COUNT', 'Count'),
|
||||
('MIN', 'Minimum'),
|
||||
('MAX', 'Maximum'),
|
||||
('PERCENTILE_95', '95th Percentile'),
|
||||
('PERCENTILE_99', '99th Percentile'),
|
||||
]
|
||||
)
|
||||
|
||||
# Collection configuration
|
||||
collection_interval_seconds = models.PositiveIntegerField(default=300) # 5 minutes
|
||||
retention_days = models.PositiveIntegerField(default=90)
|
||||
|
||||
# Thresholds
|
||||
warning_threshold = models.FloatField(null=True, blank=True)
|
||||
critical_threshold = models.FloatField(null=True, blank=True)
|
||||
|
||||
# Status
|
||||
is_active = models.BooleanField(default=True)
|
||||
is_system_metric = models.BooleanField(default=False)
|
||||
|
||||
# Related module
|
||||
related_module = models.CharField(
|
||||
max_length=50,
|
||||
blank=True,
|
||||
null=True,
|
||||
help_text="Related Django module"
|
||||
)
|
||||
|
||||
# Metadata
|
||||
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
|
||||
created_at = models.DateTimeField(auto_now_add=True)
|
||||
updated_at = models.DateTimeField(auto_now=True)
|
||||
|
||||
class Meta:
|
||||
ordering = ['name']
|
||||
indexes = [
|
||||
models.Index(fields=['metric_type', 'category']),
|
||||
models.Index(fields=['related_module']),
|
||||
models.Index(fields=['is_active']),
|
||||
]
|
||||
|
||||
def __str__(self):
|
||||
return f"{self.name} ({self.category})"
|
||||
|
||||
|
||||
class MetricMeasurement(models.Model):
|
||||
"""Individual metric measurements"""
|
||||
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
metric = models.ForeignKey(SystemMetric, on_delete=models.CASCADE, related_name='measurements')
|
||||
|
||||
# Measurement details
|
||||
value = models.DecimalField(max_digits=15, decimal_places=4)
|
||||
timestamp = models.DateTimeField(auto_now_add=True)
|
||||
|
||||
# Context
|
||||
tags = models.JSONField(
|
||||
default=dict,
|
||||
help_text="Additional tags for this measurement"
|
||||
)
|
||||
metadata = models.JSONField(
|
||||
default=dict,
|
||||
help_text="Additional metadata"
|
||||
)
|
||||
|
||||
class Meta:
|
||||
ordering = ['-timestamp']
|
||||
indexes = [
|
||||
models.Index(fields=['metric', 'timestamp']),
|
||||
models.Index(fields=['timestamp']),
|
||||
]
|
||||
|
||||
def __str__(self):
|
||||
return f"{self.metric.name}: {self.value} ({self.timestamp})"
|
||||
|
||||
|
||||
class AlertRule(models.Model):
|
||||
"""Alert rules for monitoring thresholds"""
|
||||
|
||||
ALERT_TYPES = [
|
||||
('THRESHOLD', 'Threshold Alert'),
|
||||
('ANOMALY', 'Anomaly Alert'),
|
||||
('PATTERN', 'Pattern Alert'),
|
||||
('AVAILABILITY', 'Availability Alert'),
|
||||
('PERFORMANCE', 'Performance Alert'),
|
||||
]
|
||||
|
||||
SEVERITY_CHOICES = [
|
||||
('LOW', 'Low'),
|
||||
('MEDIUM', 'Medium'),
|
||||
('HIGH', 'High'),
|
||||
('CRITICAL', 'Critical'),
|
||||
]
|
||||
|
||||
STATUS_CHOICES = [
|
||||
('ACTIVE', 'Active'),
|
||||
('INACTIVE', 'Inactive'),
|
||||
('MAINTENANCE', 'Maintenance'),
|
||||
]
|
||||
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
name = models.CharField(max_length=200)
|
||||
description = models.TextField()
|
||||
alert_type = models.CharField(max_length=20, choices=ALERT_TYPES)
|
||||
severity = models.CharField(max_length=20, choices=SEVERITY_CHOICES)
|
||||
|
||||
# Rule configuration
|
||||
condition = models.JSONField(
|
||||
help_text="Alert condition configuration"
|
||||
)
|
||||
evaluation_interval_seconds = models.PositiveIntegerField(default=60)
|
||||
|
||||
# Related objects
|
||||
metric = models.ForeignKey(
|
||||
SystemMetric,
|
||||
on_delete=models.CASCADE,
|
||||
null=True,
|
||||
blank=True,
|
||||
related_name='alert_rules'
|
||||
)
|
||||
target = models.ForeignKey(
|
||||
MonitoringTarget,
|
||||
on_delete=models.CASCADE,
|
||||
null=True,
|
||||
blank=True,
|
||||
related_name='alert_rules'
|
||||
)
|
||||
|
||||
# Notification configuration
|
||||
notification_channels = models.JSONField(
|
||||
default=list,
|
||||
help_text="List of notification channels (email, slack, webhook, etc.)"
|
||||
)
|
||||
notification_template = models.TextField(
|
||||
blank=True,
|
||||
null=True,
|
||||
help_text="Custom notification template"
|
||||
)
|
||||
|
||||
# Status
|
||||
status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='ACTIVE')
|
||||
is_enabled = models.BooleanField(default=True)
|
||||
|
||||
# Metadata
|
||||
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
|
||||
created_at = models.DateTimeField(auto_now_add=True)
|
||||
updated_at = models.DateTimeField(auto_now=True)
|
||||
|
||||
class Meta:
|
||||
ordering = ['name']
|
||||
indexes = [
|
||||
models.Index(fields=['alert_type', 'severity']),
|
||||
models.Index(fields=['status', 'is_enabled']),
|
||||
]
|
||||
|
||||
def __str__(self):
|
||||
return f"{self.name} ({self.severity})"
|
||||
|
||||
|
||||
class Alert(models.Model):
|
||||
"""Alert instances"""
|
||||
|
||||
STATUS_CHOICES = [
|
||||
('TRIGGERED', 'Triggered'),
|
||||
('ACKNOWLEDGED', 'Acknowledged'),
|
||||
('RESOLVED', 'Resolved'),
|
||||
('SUPPRESSED', 'Suppressed'),
|
||||
]
|
||||
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
rule = models.ForeignKey(AlertRule, on_delete=models.CASCADE, related_name='alerts')
|
||||
|
||||
# Alert details
|
||||
title = models.CharField(max_length=200)
|
||||
description = models.TextField()
|
||||
severity = models.CharField(max_length=20, choices=AlertRule.SEVERITY_CHOICES)
|
||||
status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='TRIGGERED')
|
||||
|
||||
# Context
|
||||
triggered_value = models.DecimalField(max_digits=15, decimal_places=4, null=True, blank=True)
|
||||
threshold_value = models.DecimalField(max_digits=15, decimal_places=4, null=True, blank=True)
|
||||
context_data = models.JSONField(
|
||||
default=dict,
|
||||
help_text="Additional context data for the alert"
|
||||
)
|
||||
|
||||
# Timestamps
|
||||
triggered_at = models.DateTimeField(auto_now_add=True)
|
||||
acknowledged_at = models.DateTimeField(null=True, blank=True)
|
||||
resolved_at = models.DateTimeField(null=True, blank=True)
|
||||
|
||||
# Assignment
|
||||
acknowledged_by = models.ForeignKey(
|
||||
User,
|
||||
on_delete=models.SET_NULL,
|
||||
null=True,
|
||||
blank=True,
|
||||
related_name='acknowledged_alerts'
|
||||
)
|
||||
resolved_by = models.ForeignKey(
|
||||
User,
|
||||
on_delete=models.SET_NULL,
|
||||
null=True,
|
||||
blank=True,
|
||||
related_name='resolved_alerts'
|
||||
)
|
||||
|
||||
class Meta:
|
||||
ordering = ['-triggered_at']
|
||||
indexes = [
|
||||
models.Index(fields=['rule', 'status']),
|
||||
models.Index(fields=['severity', 'status']),
|
||||
models.Index(fields=['triggered_at']),
|
||||
]
|
||||
|
||||
def __str__(self):
|
||||
return f"{self.title} ({self.severity}) - {self.status}"
|
||||
|
||||
|
||||
class MonitoringDashboard(models.Model):
|
||||
"""Monitoring dashboard configurations"""
|
||||
|
||||
DASHBOARD_TYPES = [
|
||||
('SYSTEM_OVERVIEW', 'System Overview'),
|
||||
('PERFORMANCE', 'Performance'),
|
||||
('BUSINESS_METRICS', 'Business Metrics'),
|
||||
('SECURITY', 'Security'),
|
||||
('INFRASTRUCTURE', 'Infrastructure'),
|
||||
('CUSTOM', 'Custom'),
|
||||
]
|
||||
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
name = models.CharField(max_length=200)
|
||||
description = models.TextField()
|
||||
dashboard_type = models.CharField(max_length=20, choices=DASHBOARD_TYPES)
|
||||
|
||||
# Dashboard configuration
|
||||
layout_config = models.JSONField(
|
||||
default=dict,
|
||||
help_text="Dashboard layout configuration"
|
||||
)
|
||||
widget_configs = models.JSONField(
|
||||
default=list,
|
||||
help_text="Configuration for dashboard widgets"
|
||||
)
|
||||
|
||||
# Access control
|
||||
is_public = models.BooleanField(default=False)
|
||||
allowed_users = models.ManyToManyField(
|
||||
User,
|
||||
blank=True,
|
||||
related_name='accessible_monitoring_dashboards'
|
||||
)
|
||||
allowed_roles = models.JSONField(
|
||||
default=list,
|
||||
help_text="List of roles that can access this dashboard"
|
||||
)
|
||||
|
||||
# Refresh configuration
|
||||
auto_refresh_enabled = models.BooleanField(default=True)
|
||||
refresh_interval_seconds = models.PositiveIntegerField(default=30)
|
||||
|
||||
# Status
|
||||
is_active = models.BooleanField(default=True)
|
||||
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
|
||||
created_at = models.DateTimeField(auto_now_add=True)
|
||||
updated_at = models.DateTimeField(auto_now=True)
|
||||
|
||||
class Meta:
|
||||
ordering = ['name']
|
||||
indexes = [
|
||||
models.Index(fields=['dashboard_type', 'is_active']),
|
||||
models.Index(fields=['is_public']),
|
||||
]
|
||||
|
||||
def __str__(self):
|
||||
return f"{self.name} ({self.dashboard_type})"
|
||||
|
||||
|
||||
class SystemStatus(models.Model):
|
||||
"""Overall system status tracking"""
|
||||
|
||||
STATUS_CHOICES = [
|
||||
('OPERATIONAL', 'Operational'),
|
||||
('DEGRADED', 'Degraded'),
|
||||
('PARTIAL_OUTAGE', 'Partial Outage'),
|
||||
('MAJOR_OUTAGE', 'Major Outage'),
|
||||
('MAINTENANCE', 'Maintenance'),
|
||||
]
|
||||
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
status = models.CharField(max_length=20, choices=STATUS_CHOICES)
|
||||
message = models.TextField(help_text="Status message for users")
|
||||
|
||||
# Impact details
|
||||
affected_services = models.JSONField(
|
||||
default=list,
|
||||
help_text="List of affected services"
|
||||
)
|
||||
estimated_resolution = models.DateTimeField(null=True, blank=True)
|
||||
|
||||
# Timestamps
|
||||
started_at = models.DateTimeField(auto_now_add=True)
|
||||
updated_at = models.DateTimeField(auto_now=True)
|
||||
resolved_at = models.DateTimeField(null=True, blank=True)
|
||||
|
||||
# Metadata
|
||||
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
|
||||
|
||||
class Meta:
|
||||
ordering = ['-started_at']
|
||||
indexes = [
|
||||
models.Index(fields=['status', 'started_at']),
|
||||
models.Index(fields=['started_at']),
|
||||
]
|
||||
|
||||
def __str__(self):
|
||||
return f"System Status: {self.status} ({self.started_at})"
|
||||
|
||||
@property
|
||||
def is_resolved(self):
|
||||
return self.resolved_at is not None
|
||||
200
ETB-API/monitoring/serializers.py
Normal file
200
ETB-API/monitoring/serializers.py
Normal file
@@ -0,0 +1,200 @@
|
||||
"""
|
||||
Serializers for monitoring models
|
||||
"""
|
||||
from rest_framework import serializers
|
||||
from monitoring.models import (
|
||||
MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
|
||||
AlertRule, Alert, MonitoringDashboard, SystemStatus
|
||||
)
|
||||
|
||||
|
||||
class MonitoringTargetSerializer(serializers.ModelSerializer):
|
||||
"""Serializer for MonitoringTarget model"""
|
||||
|
||||
last_status_display = serializers.CharField(source='get_last_status_display', read_only=True)
|
||||
target_type_display = serializers.CharField(source='get_target_type_display', read_only=True)
|
||||
|
||||
class Meta:
|
||||
model = MonitoringTarget
|
||||
fields = [
|
||||
'id', 'name', 'description', 'target_type', 'target_type_display',
|
||||
'endpoint_url', 'connection_config', 'check_interval_seconds',
|
||||
'timeout_seconds', 'retry_count', 'health_check_enabled',
|
||||
'health_check_endpoint', 'expected_status_codes', 'status',
|
||||
'last_checked', 'last_status', 'last_status_display',
|
||||
'related_module', 'created_by', 'created_at', 'updated_at'
|
||||
]
|
||||
read_only_fields = ['id', 'created_at', 'updated_at', 'last_checked']
|
||||
|
||||
|
||||
class HealthCheckSerializer(serializers.ModelSerializer):
|
||||
"""Serializer for HealthCheck model"""
|
||||
|
||||
target_name = serializers.CharField(source='target.name', read_only=True)
|
||||
status_display = serializers.CharField(source='get_status_display', read_only=True)
|
||||
check_type_display = serializers.CharField(source='get_check_type_display', read_only=True)
|
||||
|
||||
class Meta:
|
||||
model = HealthCheck
|
||||
fields = [
|
||||
'id', 'target', 'target_name', 'check_type', 'check_type_display',
|
||||
'status', 'status_display', 'response_time_ms', 'status_code',
|
||||
'response_body', 'error_message', 'cpu_usage_percent',
|
||||
'memory_usage_percent', 'disk_usage_percent', 'checked_at'
|
||||
]
|
||||
read_only_fields = ['id', 'checked_at']
|
||||
|
||||
|
||||
class SystemMetricSerializer(serializers.ModelSerializer):
|
||||
"""Serializer for SystemMetric model"""
|
||||
|
||||
metric_type_display = serializers.CharField(source='get_metric_type_display', read_only=True)
|
||||
category_display = serializers.CharField(source='get_category_display', read_only=True)
|
||||
aggregation_method_display = serializers.CharField(source='get_aggregation_method_display', read_only=True)
|
||||
|
||||
class Meta:
|
||||
model = SystemMetric
|
||||
fields = [
|
||||
'id', 'name', 'description', 'metric_type', 'metric_type_display',
|
||||
'category', 'category_display', 'unit', 'aggregation_method',
|
||||
'aggregation_method_display', 'collection_interval_seconds',
|
||||
'retention_days', 'warning_threshold', 'critical_threshold',
|
||||
'is_active', 'is_system_metric', 'related_module',
|
||||
'created_by', 'created_at', 'updated_at'
|
||||
]
|
||||
read_only_fields = ['id', 'created_at', 'updated_at']
|
||||
|
||||
|
||||
class MetricMeasurementSerializer(serializers.ModelSerializer):
|
||||
"""Serializer for MetricMeasurement model"""
|
||||
|
||||
metric_name = serializers.CharField(source='metric.name', read_only=True)
|
||||
metric_unit = serializers.CharField(source='metric.unit', read_only=True)
|
||||
|
||||
class Meta:
|
||||
model = MetricMeasurement
|
||||
fields = [
|
||||
'id', 'metric', 'metric_name', 'metric_unit', 'value',
|
||||
'timestamp', 'tags', 'metadata'
|
||||
]
|
||||
read_only_fields = ['id', 'timestamp']
|
||||
|
||||
|
||||
class AlertRuleSerializer(serializers.ModelSerializer):
|
||||
"""Serializer for AlertRule model"""
|
||||
|
||||
alert_type_display = serializers.CharField(source='get_alert_type_display', read_only=True)
|
||||
severity_display = serializers.CharField(source='get_severity_display', read_only=True)
|
||||
status_display = serializers.CharField(source='get_status_display', read_only=True)
|
||||
metric_name = serializers.CharField(source='metric.name', read_only=True)
|
||||
target_name = serializers.CharField(source='target.name', read_only=True)
|
||||
|
||||
class Meta:
|
||||
model = AlertRule
|
||||
fields = [
|
||||
'id', 'name', 'description', 'alert_type', 'alert_type_display',
|
||||
'severity', 'severity_display', 'condition', 'evaluation_interval_seconds',
|
||||
'metric', 'metric_name', 'target', 'target_name',
|
||||
'notification_channels', 'notification_template', 'status',
|
||||
'status_display', 'is_enabled', 'created_by', 'created_at', 'updated_at'
|
||||
]
|
||||
read_only_fields = ['id', 'created_at', 'updated_at']
|
||||
|
||||
|
||||
class AlertSerializer(serializers.ModelSerializer):
|
||||
"""Serializer for Alert model"""
|
||||
|
||||
rule_name = serializers.CharField(source='rule.name', read_only=True)
|
||||
severity_display = serializers.CharField(source='get_severity_display', read_only=True)
|
||||
status_display = serializers.CharField(source='get_status_display', read_only=True)
|
||||
acknowledged_by_username = serializers.CharField(source='acknowledged_by.username', read_only=True)
|
||||
resolved_by_username = serializers.CharField(source='resolved_by.username', read_only=True)
|
||||
|
||||
class Meta:
|
||||
model = Alert
|
||||
fields = [
|
||||
'id', 'rule', 'rule_name', 'title', 'description', 'severity',
|
||||
'severity_display', 'status', 'status_display', 'triggered_value',
|
||||
'threshold_value', 'context_data', 'triggered_at', 'acknowledged_at',
|
||||
'resolved_at', 'acknowledged_by', 'acknowledged_by_username',
|
||||
'resolved_by', 'resolved_by_username'
|
||||
]
|
||||
read_only_fields = ['id', 'triggered_at']
|
||||
|
||||
|
||||
class MonitoringDashboardSerializer(serializers.ModelSerializer):
|
||||
"""Serializer for MonitoringDashboard model"""
|
||||
|
||||
dashboard_type_display = serializers.CharField(source='get_dashboard_type_display', read_only=True)
|
||||
created_by_username = serializers.CharField(source='created_by.username', read_only=True)
|
||||
|
||||
class Meta:
|
||||
model = MonitoringDashboard
|
||||
fields = [
|
||||
'id', 'name', 'description', 'dashboard_type', 'dashboard_type_display',
|
||||
'layout_config', 'widget_configs', 'is_public', 'allowed_users',
|
||||
'allowed_roles', 'auto_refresh_enabled', 'refresh_interval_seconds',
|
||||
'is_active', 'created_by', 'created_by_username', 'created_at', 'updated_at'
|
||||
]
|
||||
read_only_fields = ['id', 'created_at', 'updated_at']
|
||||
|
||||
|
||||
class SystemStatusSerializer(serializers.ModelSerializer):
|
||||
"""Serializer for SystemStatus model"""
|
||||
|
||||
status_display = serializers.CharField(source='get_status_display', read_only=True)
|
||||
created_by_username = serializers.CharField(source='created_by.username', read_only=True)
|
||||
is_resolved = serializers.BooleanField(read_only=True)
|
||||
|
||||
class Meta:
|
||||
model = SystemStatus
|
||||
fields = [
|
||||
'id', 'status', 'status_display', 'message', 'affected_services',
|
||||
'estimated_resolution', 'started_at', 'updated_at', 'resolved_at',
|
||||
'created_by', 'created_by_username', 'is_resolved'
|
||||
]
|
||||
read_only_fields = ['id', 'started_at', 'updated_at']
|
||||
|
||||
|
||||
class HealthCheckSummarySerializer(serializers.Serializer):
|
||||
"""Serializer for health check summary"""
|
||||
|
||||
overall_status = serializers.CharField()
|
||||
total_targets = serializers.IntegerField()
|
||||
healthy_targets = serializers.IntegerField()
|
||||
warning_targets = serializers.IntegerField()
|
||||
critical_targets = serializers.IntegerField()
|
||||
health_percentage = serializers.FloatField()
|
||||
last_updated = serializers.DateTimeField()
|
||||
|
||||
|
||||
class MetricTrendSerializer(serializers.Serializer):
|
||||
"""Serializer for metric trends"""
|
||||
|
||||
metric_name = serializers.CharField()
|
||||
period_days = serializers.IntegerField()
|
||||
daily_data = serializers.ListField()
|
||||
trend = serializers.CharField()
|
||||
|
||||
|
||||
class AlertSummarySerializer(serializers.Serializer):
|
||||
"""Serializer for alert summary"""
|
||||
|
||||
total_alerts = serializers.IntegerField()
|
||||
critical_alerts = serializers.IntegerField()
|
||||
high_alerts = serializers.IntegerField()
|
||||
medium_alerts = serializers.IntegerField()
|
||||
low_alerts = serializers.IntegerField()
|
||||
acknowledged_alerts = serializers.IntegerField()
|
||||
resolved_alerts = serializers.IntegerField()
|
||||
|
||||
|
||||
class SystemOverviewSerializer(serializers.Serializer):
|
||||
"""Serializer for system overview"""
|
||||
|
||||
system_status = SystemStatusSerializer()
|
||||
health_summary = HealthCheckSummarySerializer()
|
||||
alert_summary = AlertSummarySerializer()
|
||||
recent_incidents = serializers.ListField()
|
||||
top_metrics = serializers.ListField()
|
||||
system_resources = serializers.DictField()
|
||||
1
ETB-API/monitoring/services/__init__.py
Normal file
1
ETB-API/monitoring/services/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Monitoring services
|
||||
BIN
ETB-API/monitoring/services/__pycache__/__init__.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/services/__pycache__/__init__.cpython-312.pyc
Normal file
Binary file not shown.
BIN
ETB-API/monitoring/services/__pycache__/alerting.cpython-312.pyc
Normal file
BIN
ETB-API/monitoring/services/__pycache__/alerting.cpython-312.pyc
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
449
ETB-API/monitoring/services/alerting.py
Normal file
449
ETB-API/monitoring/services/alerting.py
Normal file
@@ -0,0 +1,449 @@
|
||||
"""
|
||||
Alerting service for monitoring system
|
||||
"""
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional
|
||||
from datetime import datetime, timedelta
|
||||
from django.utils import timezone
|
||||
from django.core.mail import send_mail
|
||||
from django.conf import settings
|
||||
from django.contrib.auth import get_user_model
|
||||
|
||||
from monitoring.models import AlertRule, Alert, SystemMetric, MetricMeasurement, MonitoringTarget
|
||||
|
||||
User = get_user_model()
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class AlertEvaluator:
|
||||
"""Service for evaluating alert conditions"""
|
||||
|
||||
def __init__(self):
|
||||
self.aggregator = None # Will be imported to avoid circular imports
|
||||
|
||||
def evaluate_alert_rules(self) -> List[Dict[str, Any]]:
|
||||
"""Evaluate all active alert rules"""
|
||||
triggered_alerts = []
|
||||
|
||||
active_rules = AlertRule.objects.filter(
|
||||
status='ACTIVE',
|
||||
is_enabled=True
|
||||
)
|
||||
|
||||
for rule in active_rules:
|
||||
try:
|
||||
if self._evaluate_rule(rule):
|
||||
alert_data = self._create_alert(rule)
|
||||
triggered_alerts.append(alert_data)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to evaluate alert rule {rule.name}: {e}")
|
||||
|
||||
return triggered_alerts
|
||||
|
||||
def _evaluate_rule(self, rule: AlertRule) -> bool:
|
||||
"""Evaluate if an alert rule condition is met"""
|
||||
condition = rule.condition
|
||||
condition_type = condition.get('type')
|
||||
|
||||
if condition_type == 'THRESHOLD':
|
||||
return self._evaluate_threshold_condition(rule, condition)
|
||||
elif condition_type == 'ANOMALY':
|
||||
return self._evaluate_anomaly_condition(rule, condition)
|
||||
elif condition_type == 'AVAILABILITY':
|
||||
return self._evaluate_availability_condition(rule, condition)
|
||||
elif condition_type == 'PATTERN':
|
||||
return self._evaluate_pattern_condition(rule, condition)
|
||||
else:
|
||||
logger.warning(f"Unknown condition type: {condition_type}")
|
||||
return False
|
||||
|
||||
def _evaluate_threshold_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
|
||||
"""Evaluate threshold-based alert conditions"""
|
||||
if not rule.metric:
|
||||
return False
|
||||
|
||||
# Get latest metric value
|
||||
latest_measurement = MetricMeasurement.objects.filter(
|
||||
metric=rule.metric
|
||||
).order_by('-timestamp').first()
|
||||
|
||||
if not latest_measurement:
|
||||
return False
|
||||
|
||||
current_value = float(latest_measurement.value)
|
||||
threshold_value = condition.get('threshold')
|
||||
operator = condition.get('operator', '>')
|
||||
|
||||
if operator == '>':
|
||||
return current_value > threshold_value
|
||||
elif operator == '>=':
|
||||
return current_value >= threshold_value
|
||||
elif operator == '<':
|
||||
return current_value < threshold_value
|
||||
elif operator == '<=':
|
||||
return current_value <= threshold_value
|
||||
elif operator == '==':
|
||||
return current_value == threshold_value
|
||||
elif operator == '!=':
|
||||
return current_value != threshold_value
|
||||
else:
|
||||
logger.warning(f"Unknown operator: {operator}")
|
||||
return False
|
||||
|
||||
def _evaluate_anomaly_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
|
||||
"""Evaluate anomaly-based alert conditions"""
|
||||
# This would integrate with anomaly detection models
|
||||
# For now, implement a simple statistical anomaly detection
|
||||
|
||||
if not rule.metric:
|
||||
return False
|
||||
|
||||
# Get recent measurements
|
||||
since = timezone.now() - timedelta(hours=24)
|
||||
measurements = MetricMeasurement.objects.filter(
|
||||
metric=rule.metric,
|
||||
timestamp__gte=since
|
||||
).order_by('-timestamp')[:100] # Last 100 measurements
|
||||
|
||||
if len(measurements) < 10: # Need minimum data points
|
||||
return False
|
||||
|
||||
values = [float(m.value) for m in measurements]
|
||||
|
||||
# Calculate mean and standard deviation
|
||||
mean = sum(values) / len(values)
|
||||
variance = sum((x - mean) ** 2 for x in values) / len(values)
|
||||
std_dev = variance ** 0.5
|
||||
|
||||
# Check if latest value is an anomaly (more than 2 standard deviations)
|
||||
latest_value = values[0]
|
||||
anomaly_threshold = condition.get('threshold', 2.0) # Default 2 sigma
|
||||
|
||||
return abs(latest_value - mean) > (anomaly_threshold * std_dev)
|
||||
|
||||
def _evaluate_availability_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
|
||||
"""Evaluate availability-based alert conditions"""
|
||||
if not rule.target:
|
||||
return False
|
||||
|
||||
# Check if target is in critical state
|
||||
return rule.target.last_status == 'CRITICAL'
|
||||
|
||||
def _evaluate_pattern_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
|
||||
"""Evaluate pattern-based alert conditions"""
|
||||
# This would integrate with pattern detection algorithms
|
||||
# For now, return False as placeholder
|
||||
return False
|
||||
|
||||
def _create_alert(self, rule: AlertRule) -> Dict[str, Any]:
|
||||
"""Create an alert instance"""
|
||||
# Get current value for context
|
||||
current_value = None
|
||||
threshold_value = None
|
||||
|
||||
if rule.metric:
|
||||
latest_measurement = MetricMeasurement.objects.filter(
|
||||
metric=rule.metric
|
||||
).order_by('-timestamp').first()
|
||||
if latest_measurement:
|
||||
current_value = float(latest_measurement.value)
|
||||
threshold_value = rule.metric.critical_threshold
|
||||
|
||||
# Create alert
|
||||
alert = Alert.objects.create(
|
||||
rule=rule,
|
||||
title=f"{rule.name} - {rule.severity}",
|
||||
description=self._generate_alert_description(rule, current_value, threshold_value),
|
||||
severity=rule.severity,
|
||||
triggered_value=current_value,
|
||||
threshold_value=threshold_value,
|
||||
context_data={
|
||||
'rule_id': str(rule.id),
|
||||
'metric_name': rule.metric.name if rule.metric else None,
|
||||
'target_name': rule.target.name if rule.target else None,
|
||||
'condition': rule.condition
|
||||
}
|
||||
)
|
||||
|
||||
return {
|
||||
'alert_id': str(alert.id),
|
||||
'rule_name': rule.name,
|
||||
'severity': rule.severity,
|
||||
'title': alert.title,
|
||||
'description': alert.description,
|
||||
'current_value': current_value,
|
||||
'threshold_value': threshold_value
|
||||
}
|
||||
|
||||
def _generate_alert_description(self, rule: AlertRule, current_value: Optional[float], threshold_value: Optional[float]) -> str:
|
||||
"""Generate alert description"""
|
||||
description = f"Alert rule '{rule.name}' has been triggered.\n"
|
||||
|
||||
if rule.metric and current_value is not None:
|
||||
description += f"Current value: {current_value} {rule.metric.unit}\n"
|
||||
|
||||
if threshold_value is not None:
|
||||
description += f"Threshold: {threshold_value} {rule.metric.unit if rule.metric else ''}\n"
|
||||
|
||||
if rule.target:
|
||||
description += f"Target: {rule.target.name}\n"
|
||||
|
||||
description += f"Severity: {rule.severity}\n"
|
||||
description += f"Time: {timezone.now().strftime('%Y-%m-%d %H:%M:%S')}"
|
||||
|
||||
return description
|
||||
|
||||
|
||||
class NotificationService:
|
||||
"""Service for sending alert notifications"""
|
||||
|
||||
def __init__(self):
|
||||
self.evaluator = AlertEvaluator()
|
||||
|
||||
def send_alert_notifications(self, alert_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Send notifications for an alert"""
|
||||
results = {}
|
||||
|
||||
# Get alert rule to determine notification channels
|
||||
rule_id = alert_data.get('rule_id')
|
||||
if not rule_id:
|
||||
return {'error': 'No rule ID provided'}
|
||||
|
||||
try:
|
||||
rule = AlertRule.objects.get(id=rule_id)
|
||||
except AlertRule.DoesNotExist:
|
||||
return {'error': 'Alert rule not found'}
|
||||
|
||||
notification_channels = rule.notification_channels or []
|
||||
|
||||
for channel in notification_channels:
|
||||
try:
|
||||
if channel['type'] == 'EMAIL':
|
||||
result = self._send_email_notification(alert_data, channel)
|
||||
elif channel['type'] == 'SLACK':
|
||||
result = self._send_slack_notification(alert_data, channel)
|
||||
elif channel['type'] == 'WEBHOOK':
|
||||
result = self._send_webhook_notification(alert_data, channel)
|
||||
else:
|
||||
result = {'error': f'Unknown notification channel type: {channel["type"]}'}
|
||||
|
||||
results[channel['type']] = result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to send {channel['type']} notification: {e}")
|
||||
results[channel['type']] = {'error': str(e)}
|
||||
|
||||
return results
|
||||
|
||||
def _send_email_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Send email notification"""
|
||||
try:
|
||||
recipients = channel.get('recipients', [])
|
||||
if not recipients:
|
||||
return {'error': 'No email recipients configured'}
|
||||
|
||||
subject = f"[{alert_data.get('severity', 'ALERT')}] {alert_data.get('title', 'System Alert')}"
|
||||
message = alert_data.get('description', '')
|
||||
|
||||
send_mail(
|
||||
subject=subject,
|
||||
message=message,
|
||||
from_email=settings.DEFAULT_FROM_EMAIL,
|
||||
recipient_list=recipients,
|
||||
fail_silently=False
|
||||
)
|
||||
|
||||
return {'status': 'sent', 'recipients': recipients}
|
||||
|
||||
except Exception as e:
|
||||
return {'error': str(e)}
|
||||
|
||||
def _send_slack_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Send Slack notification"""
|
||||
try:
|
||||
webhook_url = channel.get('webhook_url')
|
||||
if not webhook_url:
|
||||
return {'error': 'No Slack webhook URL configured'}
|
||||
|
||||
# Create Slack message
|
||||
color = self._get_slack_color(alert_data.get('severity', 'MEDIUM'))
|
||||
|
||||
slack_message = {
|
||||
"text": alert_data.get('title', 'System Alert'),
|
||||
"attachments": [
|
||||
{
|
||||
"color": color,
|
||||
"fields": [
|
||||
{
|
||||
"title": "Description",
|
||||
"value": alert_data.get('description', ''),
|
||||
"short": False
|
||||
},
|
||||
{
|
||||
"title": "Severity",
|
||||
"value": alert_data.get('severity', 'UNKNOWN'),
|
||||
"short": True
|
||||
},
|
||||
{
|
||||
"title": "Time",
|
||||
"value": timezone.now().strftime('%Y-%m-%d %H:%M:%S'),
|
||||
"short": True
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Send to Slack (would use requests in real implementation)
|
||||
# requests.post(webhook_url, json=slack_message)
|
||||
|
||||
return {'status': 'sent', 'channel': channel.get('channel', '#alerts')}
|
||||
|
||||
except Exception as e:
|
||||
return {'error': str(e)}
|
||||
|
||||
def _send_webhook_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Send webhook notification"""
|
||||
try:
|
||||
webhook_url = channel.get('url')
|
||||
if not webhook_url:
|
||||
return {'error': 'No webhook URL configured'}
|
||||
|
||||
# Prepare webhook payload
|
||||
payload = {
|
||||
'alert': alert_data,
|
||||
'timestamp': timezone.now().isoformat(),
|
||||
'source': 'ETB-API-Monitoring'
|
||||
}
|
||||
|
||||
# Send webhook (would use requests in real implementation)
|
||||
# requests.post(webhook_url, json=payload)
|
||||
|
||||
return {'status': 'sent', 'url': webhook_url}
|
||||
|
||||
except Exception as e:
|
||||
return {'error': str(e)}
|
||||
|
||||
def _get_slack_color(self, severity: str) -> str:
|
||||
"""Get Slack color based on severity"""
|
||||
color_map = {
|
||||
'LOW': 'good',
|
||||
'MEDIUM': 'warning',
|
||||
'HIGH': 'danger',
|
||||
'CRITICAL': 'danger'
|
||||
}
|
||||
return color_map.get(severity, 'warning')
|
||||
|
||||
|
||||
class AlertingService:
|
||||
"""Main alerting service that coordinates alert evaluation and notification"""
|
||||
|
||||
def __init__(self):
|
||||
self.evaluator = AlertEvaluator()
|
||||
self.notification_service = NotificationService()
|
||||
|
||||
def run_alert_evaluation(self) -> Dict[str, Any]:
|
||||
"""Run alert evaluation and send notifications"""
|
||||
results = {
|
||||
'evaluated_rules': 0,
|
||||
'triggered_alerts': 0,
|
||||
'notifications_sent': 0,
|
||||
'errors': []
|
||||
}
|
||||
|
||||
try:
|
||||
# Evaluate all alert rules
|
||||
triggered_alerts = self.evaluator.evaluate_alert_rules()
|
||||
results['triggered_alerts'] = len(triggered_alerts)
|
||||
|
||||
# Send notifications for triggered alerts
|
||||
for alert_data in triggered_alerts:
|
||||
try:
|
||||
notification_results = self.notification_service.send_alert_notifications(alert_data)
|
||||
results['notifications_sent'] += 1
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to send notifications for alert {alert_data.get('alert_id')}: {e}")
|
||||
results['errors'].append(str(e))
|
||||
|
||||
# Count evaluated rules
|
||||
results['evaluated_rules'] = AlertRule.objects.filter(
|
||||
status='ACTIVE',
|
||||
is_enabled=True
|
||||
).count()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Alert evaluation failed: {e}")
|
||||
results['errors'].append(str(e))
|
||||
|
||||
return results
|
||||
|
||||
def acknowledge_alert(self, alert_id: str, user: User) -> Dict[str, Any]:
|
||||
"""Acknowledge an alert"""
|
||||
try:
|
||||
alert = Alert.objects.get(id=alert_id)
|
||||
alert.status = 'ACKNOWLEDGED'
|
||||
alert.acknowledged_by = user
|
||||
alert.acknowledged_at = timezone.now()
|
||||
alert.save()
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'message': f'Alert {alert_id} acknowledged by {user.username}'
|
||||
}
|
||||
|
||||
except Alert.DoesNotExist:
|
||||
return {
|
||||
'status': 'error',
|
||||
'message': f'Alert {alert_id} not found'
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'error',
|
||||
'message': str(e)
|
||||
}
|
||||
|
||||
def resolve_alert(self, alert_id: str, user: User) -> Dict[str, Any]:
|
||||
"""Resolve an alert"""
|
||||
try:
|
||||
alert = Alert.objects.get(id=alert_id)
|
||||
alert.status = 'RESOLVED'
|
||||
alert.resolved_by = user
|
||||
alert.resolved_at = timezone.now()
|
||||
alert.save()
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'message': f'Alert {alert_id} resolved by {user.username}'
|
||||
}
|
||||
|
||||
except Alert.DoesNotExist:
|
||||
return {
|
||||
'status': 'error',
|
||||
'message': f'Alert {alert_id} not found'
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'error',
|
||||
'message': str(e)
|
||||
}
|
||||
|
||||
def get_active_alerts(self, severity: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||
"""Get active alerts"""
|
||||
alerts = Alert.objects.filter(status='TRIGGERED')
|
||||
|
||||
if severity:
|
||||
alerts = alerts.filter(severity=severity)
|
||||
|
||||
return [
|
||||
{
|
||||
'id': str(alert.id),
|
||||
'title': alert.title,
|
||||
'description': alert.description,
|
||||
'severity': alert.severity,
|
||||
'triggered_at': alert.triggered_at,
|
||||
'rule_name': alert.rule.name,
|
||||
'current_value': float(alert.triggered_value) if alert.triggered_value else None,
|
||||
'threshold_value': float(alert.threshold_value) if alert.threshold_value else None
|
||||
}
|
||||
for alert in alerts.order_by('-triggered_at')
|
||||
]
|
||||
372
ETB-API/monitoring/services/health_checks.py
Normal file
372
ETB-API/monitoring/services/health_checks.py
Normal file
@@ -0,0 +1,372 @@
|
||||
"""
|
||||
Health check services for monitoring system components
|
||||
"""
|
||||
import time
|
||||
import requests
|
||||
import psutil
|
||||
import logging
|
||||
from typing import Dict, Any, Optional, Tuple
|
||||
from django.conf import settings
|
||||
from django.db import connection
|
||||
from django.core.cache import cache
|
||||
from django.utils import timezone
|
||||
from celery import current_app as celery_app
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class BaseHealthCheck:
|
||||
"""Base class for health checks"""
|
||||
|
||||
def __init__(self, target):
|
||||
self.target = target
|
||||
self.start_time = None
|
||||
self.end_time = None
|
||||
|
||||
def execute(self) -> Dict[str, Any]:
|
||||
"""Execute the health check and return results"""
|
||||
self.start_time = time.time()
|
||||
try:
|
||||
result = self._perform_check()
|
||||
self.end_time = time.time()
|
||||
|
||||
result.update({
|
||||
'response_time_ms': int((self.end_time - self.start_time) * 1000),
|
||||
'checked_at': timezone.now(),
|
||||
'error_message': None
|
||||
})
|
||||
|
||||
return result
|
||||
except Exception as e:
|
||||
self.end_time = time.time()
|
||||
logger.error(f"Health check failed for {self.target.name}: {e}")
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'response_time_ms': int((self.end_time - self.start_time) * 1000),
|
||||
'checked_at': timezone.now(),
|
||||
'error_message': str(e)
|
||||
}
|
||||
|
||||
def _perform_check(self) -> Dict[str, Any]:
|
||||
"""Override in subclasses to implement specific checks"""
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
class HTTPHealthCheck(BaseHealthCheck):
|
||||
"""HTTP-based health check"""
|
||||
|
||||
def _perform_check(self) -> Dict[str, Any]:
|
||||
url = self.target.endpoint_url
|
||||
if not url:
|
||||
raise ValueError("No endpoint URL configured")
|
||||
|
||||
timeout = self.target.timeout_seconds
|
||||
expected_codes = self.target.expected_status_codes or [200]
|
||||
|
||||
response = requests.get(url, timeout=timeout)
|
||||
|
||||
if response.status_code in expected_codes:
|
||||
status = 'HEALTHY'
|
||||
elif response.status_code >= 500:
|
||||
status = 'CRITICAL'
|
||||
else:
|
||||
status = 'WARNING'
|
||||
|
||||
return {
|
||||
'status': status,
|
||||
'status_code': response.status_code,
|
||||
'response_body': response.text[:1000] # Limit response body size
|
||||
}
|
||||
|
||||
|
||||
class DatabaseHealthCheck(BaseHealthCheck):
|
||||
"""Database connection health check"""
|
||||
|
||||
def _perform_check(self) -> Dict[str, Any]:
|
||||
try:
|
||||
with connection.cursor() as cursor:
|
||||
cursor.execute("SELECT 1")
|
||||
result = cursor.fetchone()
|
||||
|
||||
if result and result[0] == 1:
|
||||
return {
|
||||
'status': 'HEALTHY',
|
||||
'status_code': 200
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'status_code': 500,
|
||||
'error_message': 'Database query returned unexpected result'
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'status_code': 500,
|
||||
'error_message': f'Database connection failed: {str(e)}'
|
||||
}
|
||||
|
||||
|
||||
class CacheHealthCheck(BaseHealthCheck):
|
||||
"""Cache system health check"""
|
||||
|
||||
def _perform_check(self) -> Dict[str, Any]:
|
||||
try:
|
||||
# Test cache write/read
|
||||
test_key = f"health_check_{int(time.time())}"
|
||||
test_value = "health_check_value"
|
||||
|
||||
cache.set(test_key, test_value, timeout=10)
|
||||
retrieved_value = cache.get(test_key)
|
||||
|
||||
if retrieved_value == test_value:
|
||||
cache.delete(test_key) # Clean up
|
||||
return {
|
||||
'status': 'HEALTHY',
|
||||
'status_code': 200
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'status_code': 500,
|
||||
'error_message': 'Cache read/write test failed'
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'status_code': 500,
|
||||
'error_message': f'Cache operation failed: {str(e)}'
|
||||
}
|
||||
|
||||
|
||||
class CeleryHealthCheck(BaseHealthCheck):
|
||||
"""Celery worker health check"""
|
||||
|
||||
def _perform_check(self) -> Dict[str, Any]:
|
||||
try:
|
||||
# Check if Celery workers are active
|
||||
inspect = celery_app.control.inspect()
|
||||
active_workers = inspect.active()
|
||||
|
||||
if active_workers:
|
||||
worker_count = len(active_workers)
|
||||
return {
|
||||
'status': 'HEALTHY',
|
||||
'status_code': 200,
|
||||
'response_body': f'Active workers: {worker_count}'
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'status_code': 500,
|
||||
'error_message': 'No active Celery workers found'
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'status_code': 500,
|
||||
'error_message': f'Celery health check failed: {str(e)}'
|
||||
}
|
||||
|
||||
|
||||
class SystemResourceHealthCheck(BaseHealthCheck):
|
||||
"""System resource health check"""
|
||||
|
||||
def _perform_check(self) -> Dict[str, Any]:
|
||||
try:
|
||||
# Get system metrics
|
||||
cpu_percent = psutil.cpu_percent(interval=1)
|
||||
memory = psutil.virtual_memory()
|
||||
disk = psutil.disk_usage('/')
|
||||
|
||||
# Determine status based on thresholds
|
||||
status = 'HEALTHY'
|
||||
if cpu_percent > 90 or memory.percent > 90 or disk.percent > 90:
|
||||
status = 'CRITICAL'
|
||||
elif cpu_percent > 80 or memory.percent > 80 or disk.percent > 80:
|
||||
status = 'WARNING'
|
||||
|
||||
return {
|
||||
'status': status,
|
||||
'status_code': 200,
|
||||
'cpu_usage_percent': cpu_percent,
|
||||
'memory_usage_percent': memory.percent,
|
||||
'disk_usage_percent': disk.percent,
|
||||
'response_body': f'CPU: {cpu_percent}%, Memory: {memory.percent}%, Disk: {disk.percent}%'
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'status_code': 500,
|
||||
'error_message': f'System resource check failed: {str(e)}'
|
||||
}
|
||||
|
||||
|
||||
class ModuleHealthCheck(BaseHealthCheck):
|
||||
"""Django module health check"""
|
||||
|
||||
def _perform_check(self) -> Dict[str, Any]:
|
||||
try:
|
||||
module_name = self.target.related_module
|
||||
if not module_name:
|
||||
raise ValueError("No module specified for module health check")
|
||||
|
||||
# Import the module to check if it's accessible
|
||||
__import__(module_name)
|
||||
|
||||
# Check if module has required models/views
|
||||
from django.apps import apps
|
||||
app_config = apps.get_app_config(module_name)
|
||||
|
||||
if app_config:
|
||||
return {
|
||||
'status': 'HEALTHY',
|
||||
'status_code': 200,
|
||||
'response_body': f'Module {module_name} is accessible'
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'status': 'WARNING',
|
||||
'status_code': 200,
|
||||
'error_message': f'Module {module_name} not found in Django apps'
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'CRITICAL',
|
||||
'status_code': 500,
|
||||
'error_message': f'Module health check failed: {str(e)}'
|
||||
}
|
||||
|
||||
|
||||
class HealthCheckFactory:
|
||||
"""Factory for creating health check instances"""
|
||||
|
||||
CHECK_CLASSES = {
|
||||
'HTTP': HTTPHealthCheck,
|
||||
'DATABASE': DatabaseHealthCheck,
|
||||
'CACHE': CacheHealthCheck,
|
||||
'QUEUE': CeleryHealthCheck,
|
||||
'CUSTOM': BaseHealthCheck,
|
||||
'PING': HTTPHealthCheck, # Use HTTP for ping
|
||||
'SSL': HTTPHealthCheck, # Use HTTP for SSL
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def create_health_check(cls, target, check_type: str) -> BaseHealthCheck:
|
||||
"""Create a health check instance based on type"""
|
||||
check_class = cls.CHECK_CLASSES.get(check_type, BaseHealthCheck)
|
||||
return check_class(target)
|
||||
|
||||
@classmethod
|
||||
def get_available_check_types(cls) -> list:
|
||||
"""Get list of available health check types"""
|
||||
return list(cls.CHECK_CLASSES.keys())
|
||||
|
||||
|
||||
class HealthCheckService:
|
||||
"""Service for managing health checks"""
|
||||
|
||||
def __init__(self):
|
||||
self.factory = HealthCheckFactory()
|
||||
|
||||
def execute_health_check(self, target, check_type: str) -> Dict[str, Any]:
|
||||
"""Execute a health check for a target"""
|
||||
health_check = self.factory.create_health_check(target, check_type)
|
||||
return health_check.execute()
|
||||
|
||||
def execute_all_health_checks(self) -> Dict[str, Any]:
|
||||
"""Execute health checks for all active targets"""
|
||||
from monitoring.models import MonitoringTarget, HealthCheck
|
||||
|
||||
results = {}
|
||||
active_targets = MonitoringTarget.objects.filter(
|
||||
status='ACTIVE',
|
||||
health_check_enabled=True
|
||||
)
|
||||
|
||||
for target in active_targets:
|
||||
try:
|
||||
# Determine check type based on target type
|
||||
check_type = self._get_check_type_for_target(target)
|
||||
|
||||
# Execute health check
|
||||
result = self.execute_health_check(target, check_type)
|
||||
|
||||
# Save result to database
|
||||
HealthCheck.objects.create(
|
||||
target=target,
|
||||
check_type=check_type,
|
||||
status=result['status'],
|
||||
response_time_ms=result.get('response_time_ms'),
|
||||
status_code=result.get('status_code'),
|
||||
response_body=result.get('response_body'),
|
||||
error_message=result.get('error_message'),
|
||||
cpu_usage_percent=result.get('cpu_usage_percent'),
|
||||
memory_usage_percent=result.get('memory_usage_percent'),
|
||||
disk_usage_percent=result.get('disk_usage_percent')
|
||||
)
|
||||
|
||||
# Update target status
|
||||
target.last_checked = timezone.now()
|
||||
target.last_status = result['status']
|
||||
target.save(update_fields=['last_checked', 'last_status'])
|
||||
|
||||
results[target.name] = result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to execute health check for {target.name}: {e}")
|
||||
results[target.name] = {
|
||||
'status': 'CRITICAL',
|
||||
'error_message': str(e)
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
def _get_check_type_for_target(self, target) -> str:
|
||||
"""Determine the appropriate check type for a target"""
|
||||
target_type_mapping = {
|
||||
'APPLICATION': 'HTTP',
|
||||
'DATABASE': 'DATABASE',
|
||||
'CACHE': 'CACHE',
|
||||
'QUEUE': 'QUEUE',
|
||||
'EXTERNAL_API': 'HTTP',
|
||||
'SERVICE': 'HTTP',
|
||||
'INFRASTRUCTURE': 'HTTP',
|
||||
'MODULE': 'CUSTOM',
|
||||
}
|
||||
|
||||
return target_type_mapping.get(target.target_type, 'HTTP')
|
||||
|
||||
def get_system_health_summary(self) -> Dict[str, Any]:
|
||||
"""Get overall system health summary"""
|
||||
from monitoring.models import HealthCheck, MonitoringTarget
|
||||
|
||||
# Get latest health check for each target
|
||||
latest_checks = HealthCheck.objects.filter(
|
||||
target__status='ACTIVE'
|
||||
).order_by('target', '-checked_at').distinct('target')
|
||||
|
||||
total_targets = MonitoringTarget.objects.filter(status='ACTIVE').count()
|
||||
healthy_targets = latest_checks.filter(status='HEALTHY').count()
|
||||
warning_targets = latest_checks.filter(status='WARNING').count()
|
||||
critical_targets = latest_checks.filter(status='CRITICAL').count()
|
||||
|
||||
# Calculate overall status
|
||||
if critical_targets > 0:
|
||||
overall_status = 'CRITICAL'
|
||||
elif warning_targets > 0:
|
||||
overall_status = 'WARNING'
|
||||
elif healthy_targets == total_targets:
|
||||
overall_status = 'HEALTHY'
|
||||
else:
|
||||
overall_status = 'UNKNOWN'
|
||||
|
||||
return {
|
||||
'overall_status': overall_status,
|
||||
'total_targets': total_targets,
|
||||
'healthy_targets': healthy_targets,
|
||||
'warning_targets': warning_targets,
|
||||
'critical_targets': critical_targets,
|
||||
'health_percentage': (healthy_targets / total_targets * 100) if total_targets > 0 else 0,
|
||||
'last_updated': timezone.now()
|
||||
}
|
||||
364
ETB-API/monitoring/services/metrics_collector.py
Normal file
364
ETB-API/monitoring/services/metrics_collector.py
Normal file
@@ -0,0 +1,364 @@
|
||||
"""
|
||||
Metrics collection service for system monitoring
|
||||
"""
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Any, List, Optional
|
||||
from datetime import datetime, timedelta
|
||||
from django.utils import timezone
|
||||
from django.db import connection
|
||||
from django.core.cache import cache
|
||||
from django.conf import settings
|
||||
from django.contrib.auth import get_user_model
|
||||
|
||||
from monitoring.models import SystemMetric, MetricMeasurement
|
||||
|
||||
User = get_user_model()
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class MetricsCollector:
|
||||
"""Service for collecting and storing system metrics"""
|
||||
|
||||
def __init__(self):
|
||||
self.collected_metrics = {}
|
||||
|
||||
def collect_all_metrics(self) -> Dict[str, Any]:
|
||||
"""Collect all configured metrics"""
|
||||
results = {}
|
||||
|
||||
# Get all active metrics
|
||||
active_metrics = SystemMetric.objects.filter(is_active=True)
|
||||
|
||||
for metric in active_metrics:
|
||||
try:
|
||||
value = self._collect_metric_value(metric)
|
||||
if value is not None:
|
||||
# Store measurement
|
||||
measurement = MetricMeasurement.objects.create(
|
||||
metric=metric,
|
||||
value=value,
|
||||
tags=self._get_metric_tags(metric),
|
||||
metadata=self._get_metric_metadata(metric)
|
||||
)
|
||||
|
||||
results[metric.name] = {
|
||||
'value': value,
|
||||
'measurement_id': measurement.id,
|
||||
'timestamp': measurement.timestamp
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to collect metric {metric.name}: {e}")
|
||||
results[metric.name] = {
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
def _collect_metric_value(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect value for a specific metric"""
|
||||
category = metric.category
|
||||
|
||||
if category == 'API_RESPONSE_TIME':
|
||||
return self._collect_api_response_time(metric)
|
||||
elif category == 'THROUGHPUT':
|
||||
return self._collect_throughput(metric)
|
||||
elif category == 'ERROR_RATE':
|
||||
return self._collect_error_rate(metric)
|
||||
elif category == 'AVAILABILITY':
|
||||
return self._collect_availability(metric)
|
||||
elif category == 'INCIDENT_COUNT':
|
||||
return self._collect_incident_count(metric)
|
||||
elif category == 'MTTR':
|
||||
return self._collect_mttr(metric)
|
||||
elif category == 'MTTA':
|
||||
return self._collect_mtta(metric)
|
||||
elif category == 'SLA_COMPLIANCE':
|
||||
return self._collect_sla_compliance(metric)
|
||||
elif category == 'SECURITY_EVENTS':
|
||||
return self._collect_security_events(metric)
|
||||
elif category == 'AUTOMATION_SUCCESS':
|
||||
return self._collect_automation_success(metric)
|
||||
elif category == 'AI_ACCURACY':
|
||||
return self._collect_ai_accuracy(metric)
|
||||
elif category == 'COST_IMPACT':
|
||||
return self._collect_cost_impact(metric)
|
||||
elif category == 'USER_ACTIVITY':
|
||||
return self._collect_user_activity(metric)
|
||||
elif category == 'SYSTEM_RESOURCES':
|
||||
return self._collect_system_resources(metric)
|
||||
else:
|
||||
logger.warning(f"Unknown metric category: {category}")
|
||||
return None
|
||||
|
||||
def _collect_api_response_time(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect API response time metrics"""
|
||||
# This would typically come from middleware or APM tools
|
||||
# For now, return a mock value
|
||||
return 150.5 # milliseconds
|
||||
|
||||
def _collect_throughput(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect throughput metrics (requests per minute)"""
|
||||
# Count requests in the last minute
|
||||
# This would typically come from access logs or middleware
|
||||
return 120.0 # requests per minute
|
||||
|
||||
def _collect_error_rate(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect error rate metrics"""
|
||||
# Count errors in the last hour
|
||||
# This would typically come from logs or error tracking
|
||||
return 0.02 # 2% error rate
|
||||
|
||||
def _collect_availability(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect availability metrics"""
|
||||
# Calculate availability percentage
|
||||
# This would typically come from uptime monitoring
|
||||
return 99.9 # 99.9% availability
|
||||
|
||||
def _collect_incident_count(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect incident count metrics"""
|
||||
from incident_intelligence.models import Incident
|
||||
|
||||
# Count incidents in the last 24 hours
|
||||
since = timezone.now() - timedelta(hours=24)
|
||||
count = Incident.objects.filter(created_at__gte=since).count()
|
||||
return float(count)
|
||||
|
||||
def _collect_mttr(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect Mean Time to Resolve metrics"""
|
||||
from incident_intelligence.models import Incident
|
||||
|
||||
# Calculate MTTR for resolved incidents in the last 7 days
|
||||
since = timezone.now() - timedelta(days=7)
|
||||
resolved_incidents = Incident.objects.filter(
|
||||
status__in=['RESOLVED', 'CLOSED'],
|
||||
resolved_at__isnull=False,
|
||||
resolved_at__gte=since
|
||||
)
|
||||
|
||||
if not resolved_incidents.exists():
|
||||
return None
|
||||
|
||||
total_resolution_time = 0
|
||||
count = 0
|
||||
|
||||
for incident in resolved_incidents:
|
||||
if incident.resolved_at and incident.created_at:
|
||||
resolution_time = incident.resolved_at - incident.created_at
|
||||
total_resolution_time += resolution_time.total_seconds()
|
||||
count += 1
|
||||
|
||||
if count > 0:
|
||||
return total_resolution_time / count / 60 # Convert to minutes
|
||||
return None
|
||||
|
||||
def _collect_mtta(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect Mean Time to Acknowledge metrics"""
|
||||
# This would require tracking when incidents are first acknowledged
|
||||
# For now, return a mock value
|
||||
return 15.5 # minutes
|
||||
|
||||
def _collect_sla_compliance(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect SLA compliance metrics"""
|
||||
from sla_oncall.models import SLAInstance
|
||||
|
||||
# Calculate SLA compliance percentage
|
||||
total_slas = SLAInstance.objects.count()
|
||||
if total_slas == 0:
|
||||
return None
|
||||
|
||||
# This would require more complex SLA compliance calculation
|
||||
# For now, return a mock value
|
||||
return 95.5 # 95.5% SLA compliance
|
||||
|
||||
def _collect_security_events(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect security events metrics"""
|
||||
# Count security events in the last hour
|
||||
# This would come from security logs or audit trails
|
||||
return 3.0 # 3 security events in the last hour
|
||||
|
||||
def _collect_automation_success(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect automation success rate metrics"""
|
||||
from automation_orchestration.models import RunbookExecution
|
||||
|
||||
# Calculate success rate for runbook executions in the last 24 hours
|
||||
since = timezone.now() - timedelta(hours=24)
|
||||
executions = RunbookExecution.objects.filter(created_at__gte=since)
|
||||
|
||||
if not executions.exists():
|
||||
return None
|
||||
|
||||
successful = executions.filter(status='COMPLETED').count()
|
||||
total = executions.count()
|
||||
|
||||
return (successful / total * 100) if total > 0 else None
|
||||
|
||||
def _collect_ai_accuracy(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect AI model accuracy metrics"""
|
||||
from incident_intelligence.models import IncidentClassification
|
||||
|
||||
# Calculate accuracy for AI classifications
|
||||
classifications = IncidentClassification.objects.all()
|
||||
|
||||
if not classifications.exists():
|
||||
return None
|
||||
|
||||
# This would require comparing predictions with actual outcomes
|
||||
# For now, return average confidence score
|
||||
total_confidence = sum(c.confidence_score for c in classifications)
|
||||
return (total_confidence / classifications.count() * 100) if classifications.count() > 0 else None
|
||||
|
||||
def _collect_cost_impact(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect cost impact metrics"""
|
||||
from analytics_predictive_insights.models import CostImpactAnalysis
|
||||
|
||||
# Calculate total cost impact for the last 30 days
|
||||
since = timezone.now() - timedelta(days=30)
|
||||
cost_analyses = CostImpactAnalysis.objects.filter(created_at__gte=since)
|
||||
|
||||
total_cost = sum(float(ca.cost_amount) for ca in cost_analyses)
|
||||
return total_cost
|
||||
|
||||
def _collect_user_activity(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect user activity metrics"""
|
||||
# Count active users in the last hour
|
||||
since = timezone.now() - timedelta(hours=1)
|
||||
# This would require user activity tracking
|
||||
return 25.0 # 25 active users in the last hour
|
||||
|
||||
def _collect_system_resources(self, metric: SystemMetric) -> Optional[float]:
|
||||
"""Collect system resource metrics"""
|
||||
import psutil
|
||||
|
||||
# Get CPU usage
|
||||
cpu_percent = psutil.cpu_percent(interval=1)
|
||||
return cpu_percent
|
||||
|
||||
def _get_metric_tags(self, metric: SystemMetric) -> Dict[str, str]:
|
||||
"""Get tags for a metric measurement"""
|
||||
tags = {
|
||||
'metric_type': metric.metric_type,
|
||||
'category': metric.category,
|
||||
}
|
||||
|
||||
if metric.related_module:
|
||||
tags['module'] = metric.related_module
|
||||
|
||||
return tags
|
||||
|
||||
def _get_metric_metadata(self, metric: SystemMetric) -> Dict[str, Any]:
|
||||
"""Get metadata for a metric measurement"""
|
||||
return {
|
||||
'unit': metric.unit,
|
||||
'aggregation_method': metric.aggregation_method,
|
||||
'collection_interval': metric.collection_interval_seconds,
|
||||
}
|
||||
|
||||
|
||||
class MetricsAggregator:
|
||||
"""Service for aggregating metrics over time periods"""
|
||||
|
||||
def __init__(self):
|
||||
self.collector = MetricsCollector()
|
||||
|
||||
def aggregate_metrics(self, metric: SystemMetric, start_time: datetime, end_time: datetime) -> Dict[str, Any]:
|
||||
"""Aggregate metrics over a time period"""
|
||||
measurements = MetricMeasurement.objects.filter(
|
||||
metric=metric,
|
||||
timestamp__gte=start_time,
|
||||
timestamp__lte=end_time
|
||||
).order_by('timestamp')
|
||||
|
||||
if not measurements.exists():
|
||||
return {
|
||||
'count': 0,
|
||||
'values': [],
|
||||
'aggregated_value': None
|
||||
}
|
||||
|
||||
values = [float(m.value) for m in measurements]
|
||||
aggregated_value = self._aggregate_values(values, metric.aggregation_method)
|
||||
|
||||
return {
|
||||
'count': len(values),
|
||||
'values': values,
|
||||
'aggregated_value': aggregated_value,
|
||||
'start_time': start_time,
|
||||
'end_time': end_time,
|
||||
'unit': metric.unit
|
||||
}
|
||||
|
||||
def _aggregate_values(self, values: List[float], method: str) -> Optional[float]:
|
||||
"""Aggregate a list of values using the specified method"""
|
||||
if not values:
|
||||
return None
|
||||
|
||||
if method == 'AVERAGE':
|
||||
return sum(values) / len(values)
|
||||
elif method == 'SUM':
|
||||
return sum(values)
|
||||
elif method == 'COUNT':
|
||||
return len(values)
|
||||
elif method == 'MIN':
|
||||
return min(values)
|
||||
elif method == 'MAX':
|
||||
return max(values)
|
||||
elif method == 'PERCENTILE_95':
|
||||
return self._calculate_percentile(values, 95)
|
||||
elif method == 'PERCENTILE_99':
|
||||
return self._calculate_percentile(values, 99)
|
||||
else:
|
||||
return sum(values) / len(values) # Default to average
|
||||
|
||||
def _calculate_percentile(self, values: List[float], percentile: int) -> float:
|
||||
"""Calculate percentile of values"""
|
||||
sorted_values = sorted(values)
|
||||
index = int((percentile / 100) * len(sorted_values))
|
||||
return sorted_values[min(index, len(sorted_values) - 1)]
|
||||
|
||||
def get_metric_trends(self, metric: SystemMetric, days: int = 7) -> Dict[str, Any]:
|
||||
"""Get metric trends over a period"""
|
||||
end_time = timezone.now()
|
||||
start_time = end_time - timedelta(days=days)
|
||||
|
||||
# Get daily aggregations
|
||||
daily_data = []
|
||||
for i in range(days):
|
||||
day_start = start_time + timedelta(days=i)
|
||||
day_end = day_start + timedelta(days=1)
|
||||
|
||||
day_aggregation = self.aggregate_metrics(metric, day_start, day_end)
|
||||
daily_data.append({
|
||||
'date': day_start.date(),
|
||||
'value': day_aggregation['aggregated_value'],
|
||||
'count': day_aggregation['count']
|
||||
})
|
||||
|
||||
return {
|
||||
'metric_name': metric.name,
|
||||
'period_days': days,
|
||||
'daily_data': daily_data,
|
||||
'trend': self._calculate_trend([d['value'] for d in daily_data if d['value'] is not None])
|
||||
}
|
||||
|
||||
def _calculate_trend(self, values: List[float]) -> str:
|
||||
"""Calculate trend direction from values"""
|
||||
if len(values) < 2:
|
||||
return 'STABLE'
|
||||
|
||||
# Simple linear trend calculation
|
||||
first_half = values[:len(values)//2]
|
||||
second_half = values[len(values)//2:]
|
||||
|
||||
first_avg = sum(first_half) / len(first_half)
|
||||
second_avg = sum(second_half) / len(second_half)
|
||||
|
||||
change_percent = ((second_avg - first_avg) / first_avg) * 100 if first_avg != 0 else 0
|
||||
|
||||
if change_percent > 5:
|
||||
return 'INCREASING'
|
||||
elif change_percent < -5:
|
||||
return 'DECREASING'
|
||||
else:
|
||||
return 'STABLE'
|
||||
88
ETB-API/monitoring/signals.py
Normal file
88
ETB-API/monitoring/signals.py
Normal file
@@ -0,0 +1,88 @@
|
||||
"""
|
||||
Signals for monitoring system
|
||||
"""
|
||||
import logging
|
||||
from django.db.models.signals import post_save, post_delete
|
||||
from django.dispatch import receiver
|
||||
from django.utils import timezone
|
||||
|
||||
from monitoring.models import Alert, SystemStatus
|
||||
from monitoring.services.alerting import AlertingService
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@receiver(post_save, sender=Alert)
|
||||
def alert_created_handler(sender, instance, created, **kwargs):
|
||||
"""Handle alert creation"""
|
||||
if created:
|
||||
logger.info(f"New alert created: {instance.title} ({instance.severity})")
|
||||
|
||||
# Send notifications for new alerts
|
||||
try:
|
||||
alerting_service = AlertingService()
|
||||
alert_data = {
|
||||
'rule_id': str(instance.rule.id),
|
||||
'title': instance.title,
|
||||
'description': instance.description,
|
||||
'severity': instance.severity,
|
||||
'current_value': float(instance.triggered_value) if instance.triggered_value else None,
|
||||
'threshold_value': float(instance.threshold_value) if instance.threshold_value else None
|
||||
}
|
||||
|
||||
notification_results = alerting_service.notification_service.send_alert_notifications(alert_data)
|
||||
logger.info(f"Alert notifications sent: {notification_results}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to send alert notifications: {e}")
|
||||
|
||||
|
||||
@receiver(post_save, sender=SystemStatus)
|
||||
def system_status_changed_handler(sender, instance, created, **kwargs):
|
||||
"""Handle system status changes"""
|
||||
if created or instance.tracker.has_changed('status'):
|
||||
logger.info(f"System status changed to: {instance.status}")
|
||||
|
||||
# Update system status in cache or external systems
|
||||
try:
|
||||
# This could trigger notifications to external systems
|
||||
# or update status pages
|
||||
pass
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to update system status: {e}")
|
||||
|
||||
|
||||
# Add tracker to SystemStatus model for change detection
|
||||
from django.db import models
|
||||
|
||||
class SystemStatusTracker:
|
||||
"""Track changes to SystemStatus model"""
|
||||
|
||||
def __init__(self, instance):
|
||||
self.instance = instance
|
||||
self._initial_data = {}
|
||||
if instance.pk:
|
||||
self._initial_data = {
|
||||
'status': instance.status,
|
||||
'message': instance.message
|
||||
}
|
||||
|
||||
def has_changed(self, field):
|
||||
"""Check if a field has changed"""
|
||||
if not self.instance.pk:
|
||||
return True
|
||||
return getattr(self.instance, field) != self._initial_data.get(field)
|
||||
|
||||
# Monkey patch the SystemStatus model to add tracker
|
||||
def add_tracker_to_system_status():
|
||||
"""Add tracker to SystemStatus instances"""
|
||||
original_init = SystemStatus.__init__
|
||||
|
||||
def new_init(self, *args, **kwargs):
|
||||
original_init(self, *args, **kwargs)
|
||||
self.tracker = SystemStatusTracker(self)
|
||||
|
||||
SystemStatus.__init__ = new_init
|
||||
|
||||
# Call the function to add tracker
|
||||
add_tracker_to_system_status()
|
||||
319
ETB-API/monitoring/tasks.py
Normal file
319
ETB-API/monitoring/tasks.py
Normal file
@@ -0,0 +1,319 @@
|
||||
"""
|
||||
Celery tasks for automated monitoring
|
||||
"""
|
||||
import logging
|
||||
from celery import shared_task
|
||||
from django.utils import timezone
|
||||
from datetime import timedelta
|
||||
|
||||
from monitoring.services.health_checks import HealthCheckService
|
||||
from monitoring.services.metrics_collector import MetricsCollector
|
||||
from monitoring.services.alerting import AlertingService
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@shared_task(bind=True, max_retries=3)
|
||||
def execute_health_checks(self):
|
||||
"""Execute health checks for all monitoring targets"""
|
||||
try:
|
||||
logger.info("Starting health check execution")
|
||||
|
||||
health_service = HealthCheckService()
|
||||
results = health_service.execute_all_health_checks()
|
||||
|
||||
logger.info(f"Health checks completed. Results: {len(results)} targets checked")
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'targets_checked': len(results),
|
||||
'results': results
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Health check execution failed: {e}")
|
||||
|
||||
# Retry with exponential backoff
|
||||
if self.request.retries < self.max_retries:
|
||||
countdown = 2 ** self.request.retries
|
||||
logger.info(f"Retrying health checks in {countdown} seconds")
|
||||
raise self.retry(countdown=countdown)
|
||||
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
|
||||
@shared_task(bind=True, max_retries=3)
|
||||
def collect_metrics(self):
|
||||
"""Collect metrics from all configured sources"""
|
||||
try:
|
||||
logger.info("Starting metrics collection")
|
||||
|
||||
collector = MetricsCollector()
|
||||
results = collector.collect_all_metrics()
|
||||
|
||||
successful_metrics = len([r for r in results.values() if 'error' not in r])
|
||||
failed_metrics = len([r for r in results.values() if 'error' in r])
|
||||
|
||||
logger.info(f"Metrics collection completed. Success: {successful_metrics}, Failed: {failed_metrics}")
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'successful_metrics': successful_metrics,
|
||||
'failed_metrics': failed_metrics,
|
||||
'results': results
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Metrics collection failed: {e}")
|
||||
|
||||
# Retry with exponential backoff
|
||||
if self.request.retries < self.max_retries:
|
||||
countdown = 2 ** self.request.retries
|
||||
logger.info(f"Retrying metrics collection in {countdown} seconds")
|
||||
raise self.retry(countdown=countdown)
|
||||
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
|
||||
@shared_task(bind=True, max_retries=3)
|
||||
def evaluate_alerts(self):
|
||||
"""Evaluate alert rules and send notifications"""
|
||||
try:
|
||||
logger.info("Starting alert evaluation")
|
||||
|
||||
alerting_service = AlertingService()
|
||||
results = alerting_service.run_alert_evaluation()
|
||||
|
||||
logger.info(f"Alert evaluation completed. Triggered: {results['triggered_alerts']}, Notifications: {results['notifications_sent']}")
|
||||
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Alert evaluation failed: {e}")
|
||||
|
||||
# Retry with exponential backoff
|
||||
if self.request.retries < self.max_retries:
|
||||
countdown = 2 ** self.request.retries
|
||||
logger.info(f"Retrying alert evaluation in {countdown} seconds")
|
||||
raise self.retry(countdown=countdown)
|
||||
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
|
||||
@shared_task(bind=True, max_retries=3)
|
||||
def cleanup_old_data(self):
|
||||
"""Clean up old monitoring data"""
|
||||
try:
|
||||
logger.info("Starting data cleanup")
|
||||
|
||||
from monitoring.models import HealthCheck, MetricMeasurement, Alert
|
||||
|
||||
# Clean up old health checks (keep last 7 days)
|
||||
cutoff_date = timezone.now() - timedelta(days=7)
|
||||
old_health_checks = HealthCheck.objects.filter(checked_at__lt=cutoff_date)
|
||||
health_checks_deleted = old_health_checks.count()
|
||||
old_health_checks.delete()
|
||||
|
||||
# Clean up old metric measurements (keep last 90 days)
|
||||
cutoff_date = timezone.now() - timedelta(days=90)
|
||||
old_measurements = MetricMeasurement.objects.filter(timestamp__lt=cutoff_date)
|
||||
measurements_deleted = old_measurements.count()
|
||||
old_measurements.delete()
|
||||
|
||||
# Clean up resolved alerts older than 30 days
|
||||
cutoff_date = timezone.now() - timedelta(days=30)
|
||||
old_alerts = Alert.objects.filter(
|
||||
status='RESOLVED',
|
||||
resolved_at__lt=cutoff_date
|
||||
)
|
||||
alerts_deleted = old_alerts.count()
|
||||
old_alerts.delete()
|
||||
|
||||
logger.info(f"Data cleanup completed. Health checks: {health_checks_deleted}, Measurements: {measurements_deleted}, Alerts: {alerts_deleted}")
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'health_checks_deleted': health_checks_deleted,
|
||||
'measurements_deleted': measurements_deleted,
|
||||
'alerts_deleted': alerts_deleted
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Data cleanup failed: {e}")
|
||||
|
||||
# Retry with exponential backoff
|
||||
if self.request.retries < self.max_retries:
|
||||
countdown = 2 ** self.request.retries
|
||||
logger.info(f"Retrying data cleanup in {countdown} seconds")
|
||||
raise self.retry(countdown=countdown)
|
||||
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
|
||||
@shared_task(bind=True, max_retries=3)
|
||||
def generate_system_status_report(self):
|
||||
"""Generate system status report"""
|
||||
try:
|
||||
logger.info("Generating system status report")
|
||||
|
||||
from monitoring.models import SystemStatus
|
||||
from monitoring.services.health_checks import HealthCheckService
|
||||
|
||||
health_service = HealthCheckService()
|
||||
health_summary = health_service.get_system_health_summary()
|
||||
|
||||
# Determine overall system status
|
||||
if health_summary['critical_targets'] > 0:
|
||||
status = 'MAJOR_OUTAGE'
|
||||
message = f"Critical issues detected in {health_summary['critical_targets']} systems"
|
||||
elif health_summary['warning_targets'] > 0:
|
||||
status = 'DEGRADED'
|
||||
message = f"Performance issues detected in {health_summary['warning_targets']} systems"
|
||||
else:
|
||||
status = 'OPERATIONAL'
|
||||
message = "All systems operational"
|
||||
|
||||
# Create system status record
|
||||
system_status = SystemStatus.objects.create(
|
||||
status=status,
|
||||
message=message,
|
||||
affected_services=[] # Would be populated based on actual issues
|
||||
)
|
||||
|
||||
logger.info(f"System status report generated: {status}")
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'system_status': status,
|
||||
'message': message,
|
||||
'health_summary': health_summary
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"System status report generation failed: {e}")
|
||||
|
||||
# Retry with exponential backoff
|
||||
if self.request.retries < self.max_retries:
|
||||
countdown = 2 ** self.request.retries
|
||||
logger.info(f"Retrying system status report in {countdown} seconds")
|
||||
raise self.retry(countdown=countdown)
|
||||
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
|
||||
@shared_task(bind=True, max_retries=3)
|
||||
def monitor_external_integrations(self):
|
||||
"""Monitor external integrations and services"""
|
||||
try:
|
||||
logger.info("Starting external integrations monitoring")
|
||||
|
||||
from monitoring.models import MonitoringTarget
|
||||
from monitoring.services.health_checks import HealthCheckService
|
||||
|
||||
health_service = HealthCheckService()
|
||||
|
||||
# Get external integration targets
|
||||
external_targets = MonitoringTarget.objects.filter(
|
||||
target_type='EXTERNAL_API',
|
||||
status='ACTIVE'
|
||||
)
|
||||
|
||||
results = {}
|
||||
for target in external_targets:
|
||||
try:
|
||||
result = health_service.execute_health_check(target, 'HTTP')
|
||||
results[target.name] = result
|
||||
|
||||
# Log integration status
|
||||
if result['status'] == 'CRITICAL':
|
||||
logger.warning(f"External integration {target.name} is critical")
|
||||
elif result['status'] == 'WARNING':
|
||||
logger.info(f"External integration {target.name} has warnings")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to check external integration {target.name}: {e}")
|
||||
results[target.name] = {'status': 'CRITICAL', 'error': str(e)}
|
||||
|
||||
logger.info(f"External integrations monitoring completed. Checked: {len(results)} integrations")
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'integrations_checked': len(results),
|
||||
'results': results
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"External integrations monitoring failed: {e}")
|
||||
|
||||
# Retry with exponential backoff
|
||||
if self.request.retries < self.max_retries:
|
||||
countdown = 2 ** self.request.retries
|
||||
logger.info(f"Retrying external integrations monitoring in {countdown} seconds")
|
||||
raise self.retry(countdown=countdown)
|
||||
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
|
||||
@shared_task(bind=True, max_retries=3)
|
||||
def update_monitoring_dashboards(self):
|
||||
"""Update monitoring dashboards with latest data"""
|
||||
try:
|
||||
logger.info("Updating monitoring dashboards")
|
||||
|
||||
from monitoring.models import MonitoringDashboard
|
||||
from monitoring.services.metrics_collector import MetricsAggregator
|
||||
|
||||
aggregator = MetricsAggregator()
|
||||
|
||||
# Get active dashboards
|
||||
active_dashboards = MonitoringDashboard.objects.filter(is_active=True)
|
||||
|
||||
updated_dashboards = 0
|
||||
for dashboard in active_dashboards:
|
||||
try:
|
||||
# Update dashboard data (this would typically involve caching or real-time updates)
|
||||
# For now, just log the update
|
||||
logger.info(f"Updating dashboard: {dashboard.name}")
|
||||
updated_dashboards += 1
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to update dashboard {dashboard.name}: {e}")
|
||||
|
||||
logger.info(f"Dashboard updates completed. Updated: {updated_dashboards} dashboards")
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'dashboards_updated': updated_dashboards
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Dashboard update failed: {e}")
|
||||
|
||||
# Retry with exponential backoff
|
||||
if self.request.retries < self.max_retries:
|
||||
countdown = 2 ** self.request.retries
|
||||
logger.info(f"Retrying dashboard update in {countdown} seconds")
|
||||
raise self.retry(countdown=countdown)
|
||||
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}
|
||||
30
ETB-API/monitoring/urls.py
Normal file
30
ETB-API/monitoring/urls.py
Normal file
@@ -0,0 +1,30 @@
|
||||
"""
|
||||
URL configuration for monitoring app
|
||||
"""
|
||||
from django.urls import path, include
|
||||
from rest_framework.routers import DefaultRouter
|
||||
|
||||
from monitoring.views import (
|
||||
MonitoringTargetViewSet, HealthCheckViewSet, SystemMetricViewSet,
|
||||
MetricMeasurementViewSet, AlertRuleViewSet, AlertViewSet,
|
||||
MonitoringDashboardViewSet, SystemStatusViewSet, SystemOverviewView,
|
||||
MonitoringTasksView
|
||||
)
|
||||
|
||||
router = DefaultRouter()
|
||||
router.register(r'targets', MonitoringTargetViewSet)
|
||||
router.register(r'health-checks', HealthCheckViewSet)
|
||||
router.register(r'metrics', SystemMetricViewSet)
|
||||
router.register(r'measurements', MetricMeasurementViewSet)
|
||||
router.register(r'alert-rules', AlertRuleViewSet)
|
||||
router.register(r'alerts', AlertViewSet)
|
||||
router.register(r'dashboards', MonitoringDashboardViewSet)
|
||||
router.register(r'status', SystemStatusViewSet)
|
||||
|
||||
app_name = 'monitoring'
|
||||
|
||||
urlpatterns = [
|
||||
path('', include(router.urls)),
|
||||
path('overview/', SystemOverviewView.as_view(), name='system-overview'),
|
||||
path('tasks/', MonitoringTasksView.as_view(), name='monitoring-tasks'),
|
||||
]
|
||||
480
ETB-API/monitoring/views.py
Normal file
480
ETB-API/monitoring/views.py
Normal file
@@ -0,0 +1,480 @@
|
||||
"""
|
||||
Views for monitoring system
|
||||
"""
|
||||
import logging
|
||||
from rest_framework import viewsets, status, permissions
|
||||
from rest_framework.decorators import action
|
||||
from rest_framework.response import Response
|
||||
from rest_framework.views import APIView
|
||||
from django_filters.rest_framework import DjangoFilterBackend
|
||||
from rest_framework.filters import SearchFilter, OrderingFilter
|
||||
from django.utils import timezone
|
||||
from datetime import timedelta
|
||||
|
||||
from monitoring.models import (
|
||||
MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
|
||||
AlertRule, Alert, MonitoringDashboard, SystemStatus
|
||||
)
|
||||
from monitoring.serializers import (
|
||||
MonitoringTargetSerializer, HealthCheckSerializer, SystemMetricSerializer,
|
||||
MetricMeasurementSerializer, AlertRuleSerializer, AlertSerializer,
|
||||
MonitoringDashboardSerializer, SystemStatusSerializer,
|
||||
HealthCheckSummarySerializer, MetricTrendSerializer, AlertSummarySerializer,
|
||||
SystemOverviewSerializer
|
||||
)
|
||||
from monitoring.services.health_checks import HealthCheckService
|
||||
from monitoring.services.metrics_collector import MetricsCollector, MetricsAggregator
|
||||
from monitoring.services.alerting import AlertingService
|
||||
from monitoring.tasks import (
|
||||
execute_health_checks, collect_metrics, evaluate_alerts,
|
||||
generate_system_status_report
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class MonitoringTargetViewSet(viewsets.ModelViewSet):
|
||||
"""ViewSet for MonitoringTarget model"""
|
||||
|
||||
queryset = MonitoringTarget.objects.all()
|
||||
serializer_class = MonitoringTargetSerializer
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
|
||||
filterset_fields = ['target_type', 'status', 'last_status', 'related_module']
|
||||
search_fields = ['name', 'description']
|
||||
ordering_fields = ['name', 'created_at', 'last_checked']
|
||||
ordering = ['name']
|
||||
|
||||
def perform_create(self, serializer):
|
||||
"""Set the creator when creating a monitoring target"""
|
||||
serializer.save(created_by=self.request.user)
|
||||
|
||||
@action(detail=True, methods=['post'])
|
||||
def test_connection(self, request, pk=None):
|
||||
"""Test connection to monitoring target"""
|
||||
target = self.get_object()
|
||||
|
||||
try:
|
||||
health_service = HealthCheckService()
|
||||
result = health_service.execute_health_check(target, 'HTTP')
|
||||
|
||||
return Response({
|
||||
'status': 'success',
|
||||
'result': result
|
||||
})
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
@action(detail=True, methods=['post'])
|
||||
def enable_monitoring(self, request, pk=None):
|
||||
"""Enable monitoring for a target"""
|
||||
target = self.get_object()
|
||||
target.status = 'ACTIVE'
|
||||
target.save()
|
||||
|
||||
return Response({
|
||||
'status': 'success',
|
||||
'message': f'Monitoring enabled for {target.name}'
|
||||
})
|
||||
|
||||
@action(detail=True, methods=['post'])
|
||||
def disable_monitoring(self, request, pk=None):
|
||||
"""Disable monitoring for a target"""
|
||||
target = self.get_object()
|
||||
target.status = 'INACTIVE'
|
||||
target.save()
|
||||
|
||||
return Response({
|
||||
'status': 'success',
|
||||
'message': f'Monitoring disabled for {target.name}'
|
||||
})
|
||||
|
||||
|
||||
class HealthCheckViewSet(viewsets.ReadOnlyModelViewSet):
|
||||
"""ViewSet for HealthCheck model (read-only)"""
|
||||
|
||||
queryset = HealthCheck.objects.all()
|
||||
serializer_class = HealthCheckSerializer
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
|
||||
filterset_fields = ['target', 'check_type', 'status']
|
||||
ordering_fields = ['checked_at', 'response_time_ms']
|
||||
ordering = ['-checked_at']
|
||||
|
||||
@action(detail=False, methods=['get'])
|
||||
def summary(self, request):
|
||||
"""Get health check summary"""
|
||||
try:
|
||||
health_service = HealthCheckService()
|
||||
summary = health_service.get_system_health_summary()
|
||||
|
||||
serializer = HealthCheckSummarySerializer(summary)
|
||||
return Response(serializer.data)
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
@action(detail=False, methods=['post'])
|
||||
def run_all_checks(self, request):
|
||||
"""Run health checks for all targets"""
|
||||
try:
|
||||
# Execute health checks asynchronously
|
||||
task = execute_health_checks.delay()
|
||||
|
||||
return Response({
|
||||
'status': 'success',
|
||||
'message': 'Health checks started',
|
||||
'task_id': task.id
|
||||
})
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
|
||||
class SystemMetricViewSet(viewsets.ModelViewSet):
|
||||
"""ViewSet for SystemMetric model"""
|
||||
|
||||
queryset = SystemMetric.objects.all()
|
||||
serializer_class = SystemMetricSerializer
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
|
||||
filterset_fields = ['metric_type', 'category', 'is_active', 'related_module']
|
||||
search_fields = ['name', 'description']
|
||||
ordering_fields = ['name', 'created_at']
|
||||
ordering = ['name']
|
||||
|
||||
def perform_create(self, serializer):
|
||||
"""Set the creator when creating a metric"""
|
||||
serializer.save(created_by=self.request.user)
|
||||
|
||||
@action(detail=True, methods=['get'])
|
||||
def measurements(self, request, pk=None):
|
||||
"""Get measurements for a metric"""
|
||||
metric = self.get_object()
|
||||
|
||||
# Get query parameters
|
||||
hours = int(request.query_params.get('hours', 24))
|
||||
limit = int(request.query_params.get('limit', 100))
|
||||
|
||||
since = timezone.now() - timedelta(hours=hours)
|
||||
measurements = MetricMeasurement.objects.filter(
|
||||
metric=metric,
|
||||
timestamp__gte=since
|
||||
).order_by('-timestamp')[:limit]
|
||||
|
||||
serializer = MetricMeasurementSerializer(measurements, many=True)
|
||||
return Response(serializer.data)
|
||||
|
||||
@action(detail=True, methods=['get'])
|
||||
def trends(self, request, pk=None):
|
||||
"""Get metric trends"""
|
||||
metric = self.get_object()
|
||||
days = int(request.query_params.get('days', 7))
|
||||
|
||||
try:
|
||||
aggregator = MetricsAggregator()
|
||||
trends = aggregator.get_metric_trends(metric, days)
|
||||
|
||||
serializer = MetricTrendSerializer(trends)
|
||||
return Response(serializer.data)
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
|
||||
class MetricMeasurementViewSet(viewsets.ReadOnlyModelViewSet):
|
||||
"""ViewSet for MetricMeasurement model (read-only)"""
|
||||
|
||||
queryset = MetricMeasurement.objects.all()
|
||||
serializer_class = MetricMeasurementSerializer
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
filter_backends = [DjangoFilterBackend, OrderingFilter]
|
||||
filterset_fields = ['metric']
|
||||
ordering_fields = ['timestamp', 'value']
|
||||
ordering = ['-timestamp']
|
||||
|
||||
|
||||
class AlertRuleViewSet(viewsets.ModelViewSet):
|
||||
"""ViewSet for AlertRule model"""
|
||||
|
||||
queryset = AlertRule.objects.all()
|
||||
serializer_class = AlertRuleSerializer
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
|
||||
filterset_fields = ['alert_type', 'severity', 'status', 'is_enabled']
|
||||
search_fields = ['name', 'description']
|
||||
ordering_fields = ['name', 'created_at']
|
||||
ordering = ['name']
|
||||
|
||||
def perform_create(self, serializer):
|
||||
"""Set the creator when creating an alert rule"""
|
||||
serializer.save(created_by=self.request.user)
|
||||
|
||||
@action(detail=True, methods=['post'])
|
||||
def test_rule(self, request, pk=None):
|
||||
"""Test an alert rule"""
|
||||
rule = self.get_object()
|
||||
|
||||
try:
|
||||
alerting_service = AlertingService()
|
||||
# This would test the rule without creating an alert
|
||||
return Response({
|
||||
'status': 'success',
|
||||
'message': f'Alert rule {rule.name} test completed'
|
||||
})
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
@action(detail=True, methods=['post'])
|
||||
def enable_rule(self, request, pk=None):
|
||||
"""Enable an alert rule"""
|
||||
rule = self.get_object()
|
||||
rule.is_enabled = True
|
||||
rule.save()
|
||||
|
||||
return Response({
|
||||
'status': 'success',
|
||||
'message': f'Alert rule {rule.name} enabled'
|
||||
})
|
||||
|
||||
@action(detail=True, methods=['post'])
|
||||
def disable_rule(self, request, pk=None):
|
||||
"""Disable an alert rule"""
|
||||
rule = self.get_object()
|
||||
rule.is_enabled = False
|
||||
rule.save()
|
||||
|
||||
return Response({
|
||||
'status': 'success',
|
||||
'message': f'Alert rule {rule.name} disabled'
|
||||
})
|
||||
|
||||
|
||||
class AlertViewSet(viewsets.ModelViewSet):
|
||||
"""ViewSet for Alert model"""
|
||||
|
||||
queryset = Alert.objects.all()
|
||||
serializer_class = AlertSerializer
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
|
||||
filterset_fields = ['rule', 'severity', 'status']
|
||||
search_fields = ['title', 'description']
|
||||
ordering_fields = ['triggered_at', 'severity']
|
||||
ordering = ['-triggered_at']
|
||||
|
||||
@action(detail=True, methods=['post'])
|
||||
def acknowledge(self, request, pk=None):
|
||||
"""Acknowledge an alert"""
|
||||
alert = self.get_object()
|
||||
|
||||
try:
|
||||
alerting_service = AlertingService()
|
||||
result = alerting_service.acknowledge_alert(str(alert.id), request.user)
|
||||
|
||||
return Response(result)
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
@action(detail=True, methods=['post'])
|
||||
def resolve(self, request, pk=None):
|
||||
"""Resolve an alert"""
|
||||
alert = self.get_object()
|
||||
|
||||
try:
|
||||
alerting_service = AlertingService()
|
||||
result = alerting_service.resolve_alert(str(alert.id), request.user)
|
||||
|
||||
return Response(result)
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
@action(detail=False, methods=['get'])
|
||||
def summary(self, request):
|
||||
"""Get alert summary"""
|
||||
try:
|
||||
alerting_service = AlertingService()
|
||||
active_alerts = alerting_service.get_active_alerts()
|
||||
|
||||
# Calculate summary
|
||||
total_alerts = Alert.objects.count()
|
||||
critical_alerts = Alert.objects.filter(severity='CRITICAL', status='TRIGGERED').count()
|
||||
high_alerts = Alert.objects.filter(severity='HIGH', status='TRIGGERED').count()
|
||||
medium_alerts = Alert.objects.filter(severity='MEDIUM', status='TRIGGERED').count()
|
||||
low_alerts = Alert.objects.filter(severity='LOW', status='TRIGGERED').count()
|
||||
acknowledged_alerts = Alert.objects.filter(status='ACKNOWLEDGED').count()
|
||||
resolved_alerts = Alert.objects.filter(status='RESOLVED').count()
|
||||
|
||||
summary = {
|
||||
'total_alerts': total_alerts,
|
||||
'critical_alerts': critical_alerts,
|
||||
'high_alerts': high_alerts,
|
||||
'medium_alerts': medium_alerts,
|
||||
'low_alerts': low_alerts,
|
||||
'acknowledged_alerts': acknowledged_alerts,
|
||||
'resolved_alerts': resolved_alerts
|
||||
}
|
||||
|
||||
serializer = AlertSummarySerializer(summary)
|
||||
return Response(serializer.data)
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
|
||||
class MonitoringDashboardViewSet(viewsets.ModelViewSet):
|
||||
"""ViewSet for MonitoringDashboard model"""
|
||||
|
||||
queryset = MonitoringDashboard.objects.all()
|
||||
serializer_class = MonitoringDashboardSerializer
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
|
||||
filterset_fields = ['dashboard_type', 'is_active', 'is_public']
|
||||
search_fields = ['name', 'description']
|
||||
ordering_fields = ['name', 'created_at']
|
||||
ordering = ['name']
|
||||
|
||||
def perform_create(self, serializer):
|
||||
"""Set the creator when creating a dashboard"""
|
||||
serializer.save(created_by=self.request.user)
|
||||
|
||||
def get_queryset(self):
|
||||
"""Filter dashboards based on user access"""
|
||||
queryset = super().get_queryset()
|
||||
|
||||
if not self.request.user.is_staff:
|
||||
# Non-staff users can only see public dashboards or dashboards they have access to
|
||||
queryset = queryset.filter(
|
||||
models.Q(is_public=True) |
|
||||
models.Q(allowed_users=self.request.user)
|
||||
).distinct()
|
||||
|
||||
return queryset
|
||||
|
||||
|
||||
class SystemStatusViewSet(viewsets.ReadOnlyModelViewSet):
|
||||
"""ViewSet for SystemStatus model (read-only)"""
|
||||
|
||||
queryset = SystemStatus.objects.all()
|
||||
serializer_class = SystemStatusSerializer
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
ordering = ['-started_at']
|
||||
|
||||
|
||||
class SystemOverviewView(APIView):
|
||||
"""System overview endpoint"""
|
||||
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
|
||||
def get(self, request):
|
||||
"""Get system overview"""
|
||||
try:
|
||||
# Get current system status
|
||||
current_status = SystemStatus.objects.filter(
|
||||
resolved_at__isnull=True
|
||||
).order_by('-started_at').first()
|
||||
|
||||
if not current_status:
|
||||
# Create default operational status
|
||||
current_status = SystemStatus.objects.create(
|
||||
status='OPERATIONAL',
|
||||
message='All systems operational',
|
||||
created_by=request.user
|
||||
)
|
||||
|
||||
# Get health summary
|
||||
health_service = HealthCheckService()
|
||||
health_summary = health_service.get_system_health_summary()
|
||||
|
||||
# Get alert summary
|
||||
alerting_service = AlertingService()
|
||||
active_alerts = alerting_service.get_active_alerts()
|
||||
|
||||
alert_summary = {
|
||||
'total_alerts': len(active_alerts),
|
||||
'critical_alerts': len([a for a in active_alerts if a['severity'] == 'CRITICAL']),
|
||||
'high_alerts': len([a for a in active_alerts if a['severity'] == 'HIGH']),
|
||||
'medium_alerts': len([a for a in active_alerts if a['severity'] == 'MEDIUM']),
|
||||
'low_alerts': len([a for a in active_alerts if a['severity'] == 'LOW']),
|
||||
'acknowledged_alerts': 0, # Would be calculated from database
|
||||
'resolved_alerts': 0 # Would be calculated from database
|
||||
}
|
||||
|
||||
# Get recent incidents (mock data for now)
|
||||
recent_incidents = []
|
||||
|
||||
# Get top metrics (mock data for now)
|
||||
top_metrics = []
|
||||
|
||||
# Get system resources
|
||||
import psutil
|
||||
system_resources = {
|
||||
'cpu_percent': psutil.cpu_percent(interval=1),
|
||||
'memory_percent': psutil.virtual_memory().percent,
|
||||
'disk_percent': psutil.disk_usage('/').percent
|
||||
}
|
||||
|
||||
overview = {
|
||||
'system_status': current_status,
|
||||
'health_summary': health_summary,
|
||||
'alert_summary': alert_summary,
|
||||
'recent_incidents': recent_incidents,
|
||||
'top_metrics': top_metrics,
|
||||
'system_resources': system_resources
|
||||
}
|
||||
|
||||
serializer = SystemOverviewSerializer(overview)
|
||||
return Response(serializer.data)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get system overview: {e}")
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
|
||||
|
||||
class MonitoringTasksView(APIView):
|
||||
"""Monitoring tasks management"""
|
||||
|
||||
permission_classes = [permissions.IsAuthenticated]
|
||||
|
||||
def post(self, request):
|
||||
"""Execute monitoring tasks"""
|
||||
task_type = request.data.get('task_type')
|
||||
|
||||
try:
|
||||
if task_type == 'health_checks':
|
||||
task = execute_health_checks.delay()
|
||||
elif task_type == 'metrics_collection':
|
||||
task = collect_metrics.delay()
|
||||
elif task_type == 'alert_evaluation':
|
||||
task = evaluate_alerts.delay()
|
||||
elif task_type == 'system_status_report':
|
||||
task = generate_system_status_report.delay()
|
||||
else:
|
||||
return Response({
|
||||
'error': 'Invalid task type'
|
||||
}, status=status.HTTP_400_BAD_REQUEST)
|
||||
|
||||
return Response({
|
||||
'status': 'success',
|
||||
'message': f'{task_type} task started',
|
||||
'task_id': task.id
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
return Response({
|
||||
'error': str(e)
|
||||
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
|
||||
Reference in New Issue
Block a user