This commit is contained in:
Iliyan Angelov
2025-09-19 11:58:53 +03:00
parent 306b20e24a
commit 6b247e5b9f
11423 changed files with 1500615 additions and 778 deletions

View File

@@ -0,0 +1,459 @@
# ETB-API Monitoring System Documentation
## Overview
The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.
## Features
### 1. Health Monitoring
- **System Health Checks**: Monitor application, database, cache, and queue health
- **Module Health**: Individual module status and dependency tracking
- **External Integrations**: Third-party service health monitoring
- **Infrastructure Monitoring**: Server resources and network connectivity
### 2. Metrics Collection
- **Performance Metrics**: API response times, throughput, error rates
- **Business Metrics**: Incident counts, MTTR, MTTA, SLA compliance
- **Security Metrics**: Security events, failed logins, risk assessments
- **Infrastructure Metrics**: CPU, memory, disk usage
- **AI/ML Metrics**: Model accuracy, automation success rates
### 3. Intelligent Alerting
- **Threshold Alerts**: Configurable thresholds for all metrics
- **Anomaly Detection**: Statistical anomaly detection
- **Pattern Alerts**: Pattern-based alerting
- **Multi-Channel Notifications**: Email, Slack, Webhook support
- **Alert Management**: Acknowledge, resolve, and track alerts
### 4. Monitoring Dashboards
- **System Overview**: High-level system status
- **Performance Dashboard**: Performance metrics visualization
- **Business Metrics**: Operational metrics dashboard
- **Security Dashboard**: Security monitoring dashboard
- **Custom Dashboards**: User-configurable dashboards
## API Endpoints
### Base URL
```
http://localhost:8000/api/monitoring/
```
### Authentication
All endpoints require authentication using Django REST Framework token authentication.
### Health Checks
#### Get Health Check Summary
```http
GET /api/monitoring/health-checks/summary/
Authorization: Token your-token-here
```
**Response:**
```json
{
"overall_status": "HEALTHY",
"total_targets": 12,
"healthy_targets": 11,
"warning_targets": 1,
"critical_targets": 0,
"health_percentage": 91.67,
"last_updated": "2024-01-15T10:30:00Z"
}
```
#### Run All Health Checks
```http
POST /api/monitoring/health-checks/run_all_checks/
Authorization: Token your-token-here
```
**Response:**
```json
{
"status": "success",
"message": "Health checks started",
"task_id": "celery-task-id"
}
```
#### Test Target Connection
```http
POST /api/monitoring/targets/{target_id}/test_connection/
Authorization: Token your-token-here
```
### Metrics
#### Get Metric Measurements
```http
GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
Authorization: Token your-token-here
```
#### Get Metric Trends
```http
GET /api/monitoring/metrics/{metric_id}/trends/?days=7
Authorization: Token your-token-here
```
**Response:**
```json
{
"metric_name": "API Response Time",
"period_days": 7,
"daily_data": [
{
"date": "2024-01-08",
"value": 150.5,
"count": 1440
}
],
"trend": "STABLE"
}
```
### Alerts
#### Get Alert Summary
```http
GET /api/monitoring/alerts/summary/
Authorization: Token your-token-here
```
**Response:**
```json
{
"total_alerts": 25,
"critical_alerts": 2,
"high_alerts": 5,
"medium_alerts": 8,
"low_alerts": 10,
"acknowledged_alerts": 15,
"resolved_alerts": 20
}
```
#### Acknowledge Alert
```http
POST /api/monitoring/alerts/{alert_id}/acknowledge/
Authorization: Token your-token-here
```
#### Resolve Alert
```http
POST /api/monitoring/alerts/{alert_id}/resolve/
Authorization: Token your-token-here
```
### System Overview
#### Get System Overview
```http
GET /api/monitoring/overview/
Authorization: Token your-token-here
```
**Response:**
```json
{
"system_status": {
"status": "OPERATIONAL",
"message": "All systems operational",
"started_at": "2024-01-15T09:00:00Z"
},
"health_summary": {
"overall_status": "HEALTHY",
"total_targets": 12,
"healthy_targets": 12,
"health_percentage": 100.0
},
"alert_summary": {
"total_alerts": 0,
"critical_alerts": 0
},
"system_resources": {
"cpu_percent": 45.2,
"memory_percent": 67.8,
"disk_percent": 34.5
}
}
```
### Monitoring Tasks
#### Execute Monitoring Tasks
```http
POST /api/monitoring/tasks/
Authorization: Token your-token-here
Content-Type: application/json
{
"task_type": "health_checks"
}
```
**Available task types:**
- `health_checks`: Execute health checks for all targets
- `metrics_collection`: Collect metrics from all sources
- `alert_evaluation`: Evaluate alert rules and send notifications
- `system_status_report`: Generate system status report
## Data Models
### MonitoringTarget
Represents a system, service, or component to monitor.
**Fields:**
- `name`: Target name
- `target_type`: Type (APPLICATION, DATABASE, CACHE, etc.)
- `endpoint_url`: Health check endpoint
- `status`: Current status (ACTIVE, INACTIVE, etc.)
- `last_status`: Last health check result
- `health_check_enabled`: Whether health checks are enabled
### SystemMetric
Defines metrics to collect and monitor.
**Fields:**
- `name`: Metric name
- `metric_type`: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)
- `category`: Category (API_RESPONSE_TIME, MTTR, etc.)
- `unit`: Unit of measurement
- `aggregation_method`: How to aggregate values
- `warning_threshold`: Warning threshold
- `critical_threshold`: Critical threshold
### AlertRule
Defines alert conditions and notifications.
**Fields:**
- `name`: Rule name
- `alert_type`: Type (THRESHOLD, ANOMALY, etc.)
- `severity`: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)
- `condition`: Alert condition configuration
- `notification_channels`: Notification channels
- `is_enabled`: Whether rule is enabled
### Alert
Represents triggered alerts.
**Fields:**
- `title`: Alert title
- `description`: Alert description
- `severity`: Alert severity
- `status`: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)
- `triggered_value`: Value that triggered the alert
- `threshold_value`: Threshold that was exceeded
## Configuration
### Environment Variables
```bash
# Monitoring Settings
MONITORING_ENABLED=true
MONITORING_HEALTH_CHECK_INTERVAL=60
MONITORING_METRICS_COLLECTION_INTERVAL=300
MONITORING_ALERT_EVALUATION_INTERVAL=60
# Alerting Settings
ALERTING_EMAIL_FROM=monitoring@etb-api.com
ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts
# Performance Thresholds
PERFORMANCE_API_RESPONSE_THRESHOLD=2000
PERFORMANCE_CPU_THRESHOLD=80
PERFORMANCE_MEMORY_THRESHOLD=80
```
### Celery Configuration
Add to your Celery configuration:
```python
from celery.schedules import crontab
CELERY_BEAT_SCHEDULE = {
'health-checks': {
'task': 'monitoring.tasks.execute_health_checks',
'schedule': 60.0, # Every minute
},
'metrics-collection': {
'task': 'monitoring.tasks.collect_metrics',
'schedule': 300.0, # Every 5 minutes
},
'alert-evaluation': {
'task': 'monitoring.tasks.evaluate_alerts',
'schedule': 60.0, # Every minute
},
'data-cleanup': {
'task': 'monitoring.tasks.cleanup_old_data',
'schedule': crontab(hour=2, minute=0), # Daily at 2 AM
},
}
```
## Setup Instructions
### 1. Install Dependencies
Add to `requirements.txt`:
```
psutil>=5.9.0
requests>=2.31.0
```
### 2. Run Migrations
```bash
python manage.py makemigrations monitoring
python manage.py migrate
```
### 3. Set Up Initial Configuration
```bash
python manage.py setup_monitoring --admin-user admin
```
### 4. Start Celery Workers
```bash
celery -A core worker -l info
celery -A core beat -l info
```
### 5. Access Monitoring
- **Admin Interface**: `http://localhost:8000/admin/monitoring/`
- **API Documentation**: `http://localhost:8000/api/monitoring/`
- **System Overview**: `http://localhost:8000/api/monitoring/overview/`
## Monitoring Best Practices
### 1. Health Checks
- Set appropriate check intervals (not too frequent)
- Use timeouts to prevent hanging checks
- Monitor dependencies and external services
- Implement graceful degradation
### 2. Metrics Collection
- Collect metrics at appropriate intervals
- Use proper aggregation methods
- Set meaningful thresholds
- Monitor both technical and business metrics
### 3. Alerting
- Set up alert rules with appropriate severity levels
- Use multiple notification channels
- Implement alert fatigue prevention
- Regularly review and tune alert thresholds
### 4. Dashboards
- Create role-based dashboards
- Use appropriate refresh intervals
- Include both real-time and historical data
- Make dashboards actionable
## Troubleshooting
### Common Issues
1. **Health Checks Failing**
- Check network connectivity
- Verify endpoint URLs
- Check authentication credentials
- Review timeout settings
2. **Metrics Not Collecting**
- Verify Celery workers are running
- Check metric configuration
- Review collection intervals
- Check for errors in logs
3. **Alerts Not Triggering**
- Verify alert rules are enabled
- Check threshold values
- Review notification channel configuration
- Check alert evaluation task is running
4. **Performance Issues**
- Monitor system resources
- Check database query performance
- Review metric retention settings
- Optimize collection intervals
### Debug Commands
```bash
# Check monitoring status
python manage.py shell
>>> from monitoring.services.health_checks import HealthCheckService
>>> service = HealthCheckService()
>>> service.get_system_health_summary()
# Test health checks
>>> from monitoring.models import MonitoringTarget
>>> target = MonitoringTarget.objects.first()
>>> service.execute_health_check(target, 'HTTP')
# Check metrics collection
>>> from monitoring.services.metrics_collector import MetricsCollector
>>> collector = MetricsCollector()
>>> collector.collect_all_metrics()
```
## Integration with Other Modules
### Security Module
- Monitor authentication failures
- Track security events
- Monitor device posture assessments
- Alert on risk assessment anomalies
### Incident Intelligence
- Monitor incident processing times
- Track AI model performance
- Monitor correlation engine health
- Alert on incident volume spikes
### Automation & Orchestration
- Monitor runbook execution success
- Track integration health
- Monitor ChatOps command usage
- Alert on automation failures
### SLA & On-Call
- Monitor SLA compliance
- Track escalation times
- Monitor on-call assignments
- Alert on SLA breaches
### Analytics & Predictive Insights
- Monitor ML model accuracy
- Track prediction performance
- Monitor cost impact calculations
- Alert on anomaly detections
## Future Enhancements
### Planned Features
1. **Advanced Anomaly Detection**: Machine learning-based anomaly detection
2. **Predictive Alerting**: Predict and prevent issues before they occur
3. **Custom Metrics**: User-defined custom metrics
4. **Advanced Dashboards**: Interactive dashboards with drill-down capabilities
5. **Mobile App**: Mobile monitoring application
6. **Integration APIs**: APIs for external monitoring tools
7. **Cost Optimization**: Resource usage optimization recommendations
8. **Compliance Reporting**: Automated compliance reporting
### Integration Roadmap
1. **APM Tools**: New Relic, DataDog, AppDynamics
2. **Log Aggregation**: ELK Stack, Splunk, Fluentd
3. **Infrastructure Monitoring**: Prometheus, Grafana, InfluxDB
4. **Cloud Platforms**: AWS CloudWatch, Azure Monitor, GCP Monitoring
5. **Communication Platforms**: PagerDuty, OpsGenie, VictorOps

View File

@@ -0,0 +1 @@
# Monitoring module for ETB-API system

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

289
ETB-API/monitoring/admin.py Normal file
View File

@@ -0,0 +1,289 @@
"""
Admin configuration for monitoring models
"""
from django.contrib import admin
from django.utils.html import format_html
from django.urls import reverse
from django.utils import timezone
from monitoring.models import (
MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
AlertRule, Alert, MonitoringDashboard, SystemStatus
)
@admin.register(MonitoringTarget)
class MonitoringTargetAdmin(admin.ModelAdmin):
"""Admin for MonitoringTarget model"""
list_display = [
'name', 'target_type', 'status', 'last_status', 'last_checked',
'health_check_enabled', 'related_module', 'created_at'
]
list_filter = ['target_type', 'status', 'last_status', 'health_check_enabled', 'related_module']
search_fields = ['name', 'description', 'endpoint_url']
readonly_fields = ['id', 'created_at', 'updated_at', 'last_checked']
fieldsets = (
('Basic Information', {
'fields': ('id', 'name', 'description', 'target_type', 'related_module')
}),
('Connection Details', {
'fields': ('endpoint_url', 'connection_config')
}),
('Monitoring Configuration', {
'fields': (
'check_interval_seconds', 'timeout_seconds', 'retry_count',
'health_check_enabled', 'health_check_endpoint', 'expected_status_codes'
)
}),
('Status', {
'fields': ('status', 'last_checked', 'last_status')
}),
('Metadata', {
'fields': ('created_by', 'created_at', 'updated_at'),
'classes': ('collapse',)
})
)
def get_queryset(self, request):
return super().get_queryset(request).select_related('created_by')
@admin.register(HealthCheck)
class HealthCheckAdmin(admin.ModelAdmin):
"""Admin for HealthCheck model"""
list_display = [
'target_name', 'check_type', 'status', 'response_time_ms',
'status_code', 'checked_at'
]
list_filter = ['check_type', 'status', 'target__target_type']
search_fields = ['target__name', 'error_message']
readonly_fields = ['id', 'checked_at']
date_hierarchy = 'checked_at'
def target_name(self, obj):
return obj.target.name
target_name.short_description = 'Target'
def get_queryset(self, request):
return super().get_queryset(request).select_related('target')
@admin.register(SystemMetric)
class SystemMetricAdmin(admin.ModelAdmin):
"""Admin for SystemMetric model"""
list_display = [
'name', 'metric_type', 'category', 'unit', 'is_active',
'is_system_metric', 'related_module', 'created_at'
]
list_filter = ['metric_type', 'category', 'is_active', 'is_system_metric', 'related_module']
search_fields = ['name', 'description']
readonly_fields = ['id', 'created_at', 'updated_at']
fieldsets = (
('Basic Information', {
'fields': ('id', 'name', 'description', 'metric_type', 'category', 'unit')
}),
('Configuration', {
'fields': (
'aggregation_method', 'collection_interval_seconds', 'retention_days',
'warning_threshold', 'critical_threshold'
)
}),
('Status', {
'fields': ('is_active', 'is_system_metric', 'related_module')
}),
('Metadata', {
'fields': ('created_by', 'created_at', 'updated_at'),
'classes': ('collapse',)
})
)
def get_queryset(self, request):
return super().get_queryset(request).select_related('created_by')
@admin.register(MetricMeasurement)
class MetricMeasurementAdmin(admin.ModelAdmin):
"""Admin for MetricMeasurement model"""
list_display = [
'metric_name', 'value', 'unit', 'timestamp'
]
list_filter = ['metric__metric_type', 'metric__category', 'timestamp']
search_fields = ['metric__name']
readonly_fields = ['id', 'timestamp']
date_hierarchy = 'timestamp'
def metric_name(self, obj):
return obj.metric.name
metric_name.short_description = 'Metric'
def unit(self, obj):
return obj.metric.unit
unit.short_description = 'Unit'
def get_queryset(self, request):
return super().get_queryset(request).select_related('metric')
@admin.register(AlertRule)
class AlertRuleAdmin(admin.ModelAdmin):
"""Admin for AlertRule model"""
list_display = [
'name', 'alert_type', 'severity', 'status', 'is_enabled',
'metric_name', 'target_name', 'created_at'
]
list_filter = ['alert_type', 'severity', 'status', 'is_enabled']
search_fields = ['name', 'description']
readonly_fields = ['id', 'created_at', 'updated_at']
fieldsets = (
('Basic Information', {
'fields': ('id', 'name', 'description', 'alert_type', 'severity')
}),
('Rule Configuration', {
'fields': ('condition', 'evaluation_interval_seconds')
}),
('Related Objects', {
'fields': ('metric', 'target')
}),
('Notifications', {
'fields': ('notification_channels', 'notification_template')
}),
('Status', {
'fields': ('status', 'is_enabled')
}),
('Metadata', {
'fields': ('created_by', 'created_at', 'updated_at'),
'classes': ('collapse',)
})
)
def metric_name(self, obj):
return obj.metric.name if obj.metric else '-'
metric_name.short_description = 'Metric'
def target_name(self, obj):
return obj.target.name if obj.target else '-'
target_name.short_description = 'Target'
def get_queryset(self, request):
return super().get_queryset(request).select_related('metric', 'target', 'created_by')
@admin.register(Alert)
class AlertAdmin(admin.ModelAdmin):
"""Admin for Alert model"""
list_display = [
'title', 'severity', 'status', 'rule_name', 'triggered_value',
'threshold_value', 'triggered_at', 'acknowledged_by', 'resolved_by'
]
list_filter = ['severity', 'status', 'rule__alert_type', 'triggered_at']
search_fields = ['title', 'description', 'rule__name']
readonly_fields = ['id', 'triggered_at']
date_hierarchy = 'triggered_at'
fieldsets = (
('Alert Information', {
'fields': ('id', 'rule', 'title', 'description', 'severity', 'status')
}),
('Values', {
'fields': ('triggered_value', 'threshold_value', 'context_data')
}),
('Timestamps', {
'fields': ('triggered_at', 'acknowledged_at', 'resolved_at')
}),
('Assignment', {
'fields': ('acknowledged_by', 'resolved_by')
})
)
def rule_name(self, obj):
return obj.rule.name
rule_name.short_description = 'Rule'
def get_queryset(self, request):
return super().get_queryset(request).select_related(
'rule', 'acknowledged_by', 'resolved_by'
)
@admin.register(MonitoringDashboard)
class MonitoringDashboardAdmin(admin.ModelAdmin):
"""Admin for MonitoringDashboard model"""
list_display = [
'name', 'dashboard_type', 'is_active', 'is_public',
'auto_refresh_enabled', 'created_by', 'created_at'
]
list_filter = ['dashboard_type', 'is_active', 'is_public', 'auto_refresh_enabled']
search_fields = ['name', 'description']
readonly_fields = ['id', 'created_at', 'updated_at']
filter_horizontal = ['allowed_users']
fieldsets = (
('Basic Information', {
'fields': ('id', 'name', 'description', 'dashboard_type')
}),
('Configuration', {
'fields': ('layout_config', 'widget_configs')
}),
('Access Control', {
'fields': ('is_public', 'allowed_users', 'allowed_roles')
}),
('Refresh Settings', {
'fields': ('auto_refresh_enabled', 'refresh_interval_seconds')
}),
('Status', {
'fields': ('is_active',)
}),
('Metadata', {
'fields': ('created_by', 'created_at', 'updated_at'),
'classes': ('collapse',)
})
)
def get_queryset(self, request):
return super().get_queryset(request).select_related('created_by')
@admin.register(SystemStatus)
class SystemStatusAdmin(admin.ModelAdmin):
"""Admin for SystemStatus model"""
list_display = [
'status', 'message', 'started_at', 'resolved_at', 'is_resolved',
'created_by'
]
list_filter = ['status', 'started_at', 'resolved_at']
search_fields = ['message', 'affected_services']
readonly_fields = ['id', 'started_at', 'updated_at', 'is_resolved']
date_hierarchy = 'started_at'
fieldsets = (
('Status Information', {
'fields': ('id', 'status', 'message', 'affected_services')
}),
('Timeline', {
'fields': ('started_at', 'updated_at', 'resolved_at', 'estimated_resolution')
}),
('Metadata', {
'fields': ('created_by', 'is_resolved'),
'classes': ('collapse',)
})
)
def get_queryset(self, request):
return super().get_queryset(request).select_related('created_by')
# Custom admin site configuration
admin.site.site_header = "ETB-API Monitoring Administration"
admin.site.site_title = "ETB-API Monitoring"
admin.site.index_title = "Monitoring System Administration"

View File

@@ -0,0 +1,12 @@
from django.apps import AppConfig
class MonitoringConfig(AppConfig):
default_auto_field = 'django.db.models.BigAutoField'
name = 'monitoring'
verbose_name = 'System Monitoring'
def ready(self):
"""Initialize monitoring when Django starts"""
import monitoring.signals
import monitoring.tasks

View File

@@ -0,0 +1,795 @@
"""
Enterprise Monitoring System for ETB-API
Advanced monitoring with metrics, alerting, and observability
"""
import logging
import time
import psutil
import json
import os
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Any, Union
from django.http import HttpRequest, JsonResponse
from django.conf import settings
from django.utils import timezone
from django.core.cache import cache
from django.db import connection
from django.core.management import call_command
from rest_framework import status
from rest_framework.response import Response
from rest_framework.views import APIView
from rest_framework.decorators import api_view, permission_classes
from rest_framework.permissions import IsAuthenticated
from django.core.management.base import BaseCommand
import requests
import redis
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
from prometheus_client.core import CollectorRegistry
import threading
import queue
logger = logging.getLogger(__name__)
class MetricsCollector:
"""Enterprise metrics collection system"""
def __init__(self):
self.registry = CollectorRegistry()
self.metrics = self._initialize_metrics()
self.collection_interval = 60 # seconds
self.is_running = False
self.collection_thread = None
def _initialize_metrics(self):
"""Initialize Prometheus metrics"""
metrics = {}
# Application metrics
metrics['http_requests_total'] = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status_code'],
registry=self.registry
)
metrics['http_request_duration_seconds'] = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
registry=self.registry
)
metrics['active_users'] = Gauge(
'active_users',
'Number of active users',
registry=self.registry
)
metrics['incident_count'] = Gauge(
'incident_count',
'Total number of incidents',
['status', 'priority'],
registry=self.registry
)
metrics['sla_breach_count'] = Gauge(
'sla_breach_count',
'Number of SLA breaches',
['sla_type'],
registry=self.registry
)
# System metrics
metrics['system_cpu_usage'] = Gauge(
'system_cpu_usage_percent',
'System CPU usage percentage',
registry=self.registry
)
metrics['system_memory_usage'] = Gauge(
'system_memory_usage_percent',
'System memory usage percentage',
registry=self.registry
)
metrics['system_disk_usage'] = Gauge(
'system_disk_usage_percent',
'System disk usage percentage',
registry=self.registry
)
metrics['database_connections'] = Gauge(
'database_connections_active',
'Active database connections',
registry=self.registry
)
metrics['cache_hit_ratio'] = Gauge(
'cache_hit_ratio',
'Cache hit ratio',
registry=self.registry
)
# Business metrics
metrics['incident_resolution_time'] = Histogram(
'incident_resolution_time_seconds',
'Incident resolution time in seconds',
['priority', 'category'],
registry=self.registry
)
metrics['automation_success_rate'] = Gauge(
'automation_success_rate',
'Automation success rate',
['automation_type'],
registry=self.registry
)
metrics['user_satisfaction_score'] = Gauge(
'user_satisfaction_score',
'User satisfaction score',
registry=self.registry
)
return metrics
def start_collection(self):
"""Start metrics collection in background thread"""
if self.is_running:
return
self.is_running = True
self.collection_thread = threading.Thread(target=self._collect_metrics_loop)
self.collection_thread.daemon = True
self.collection_thread.start()
logger.info("Metrics collection started")
def stop_collection(self):
"""Stop metrics collection"""
self.is_running = False
if self.collection_thread:
self.collection_thread.join()
logger.info("Metrics collection stopped")
def _collect_metrics_loop(self):
"""Main metrics collection loop"""
while self.is_running:
try:
self._collect_system_metrics()
self._collect_application_metrics()
self._collect_business_metrics()
time.sleep(self.collection_interval)
except Exception as e:
logger.error(f"Error collecting metrics: {str(e)}")
time.sleep(self.collection_interval)
def _collect_system_metrics(self):
"""Collect system-level metrics"""
try:
# CPU usage
cpu_percent = psutil.cpu_percent(interval=1)
self.metrics['system_cpu_usage'].set(cpu_percent)
# Memory usage
memory = psutil.virtual_memory()
self.metrics['system_memory_usage'].set(memory.percent)
# Disk usage
disk = psutil.disk_usage('/')
disk_percent = (disk.used / disk.total) * 100
self.metrics['system_disk_usage'].set(disk_percent)
# Database connections
with connection.cursor() as cursor:
cursor.execute("SELECT COUNT(*) FROM pg_stat_activity")
db_connections = cursor.fetchone()[0]
self.metrics['database_connections'].set(db_connections)
# Cache hit ratio
cache_stats = cache._cache.get_stats()
if cache_stats:
hit_ratio = cache_stats.get('hit_ratio', 0)
self.metrics['cache_hit_ratio'].set(hit_ratio)
except Exception as e:
logger.error(f"Error collecting system metrics: {str(e)}")
def _collect_application_metrics(self):
"""Collect application-level metrics"""
try:
# Active users (from cache)
active_users = cache.get('active_users_count', 0)
self.metrics['active_users'].set(active_users)
# Incident counts
from incident_intelligence.models import Incident
from django.db import models
incident_counts = Incident.objects.values('status', 'priority').annotate(
count=models.Count('id')
)
for incident in incident_counts:
self.metrics['incident_count'].labels(
status=incident['status'],
priority=incident['priority']
).set(incident['count'])
# SLA breach counts
from sla_oncall.models import SLAInstance
sla_breaches = SLAInstance.objects.filter(
status='breached'
).values('sla_type').annotate(
count=models.Count('id')
)
for breach in sla_breaches:
self.metrics['sla_breach_count'].labels(
sla_type=breach['sla_type']
).set(breach['count'])
except Exception as e:
logger.error(f"Error collecting application metrics: {str(e)}")
def _collect_business_metrics(self):
"""Collect business-level metrics"""
try:
# Incident resolution times
from incident_intelligence.models import Incident
from django.db import models
resolved_incidents = Incident.objects.filter(
status='resolved',
resolved_at__isnull=False
).values('priority', 'category')
for incident in resolved_incidents:
resolution_time = (incident['resolved_at'] - incident['created_at']).total_seconds()
self.metrics['incident_resolution_time'].labels(
priority=incident['priority'],
category=incident['category']
).observe(resolution_time)
# Automation success rates
from automation_orchestration.models import AutomationExecution
from django.db import models
automation_stats = AutomationExecution.objects.values('automation_type').annotate(
total=models.Count('id'),
successful=models.Count('id', filter=models.Q(status='success'))
)
for stat in automation_stats:
success_rate = (stat['successful'] / stat['total']) * 100 if stat['total'] > 0 else 0
self.metrics['automation_success_rate'].labels(
automation_type=stat['automation_type']
).set(success_rate)
# User satisfaction score (from feedback)
from knowledge_learning.models import UserFeedback
from django.db import models
feedback_scores = UserFeedback.objects.values('rating').annotate(
count=models.Count('id')
)
total_feedback = sum(f['count'] for f in feedback_scores)
if total_feedback > 0:
weighted_score = sum(f['rating'] * f['count'] for f in feedback_scores) / total_feedback
self.metrics['user_satisfaction_score'].set(weighted_score)
except Exception as e:
logger.error(f"Error collecting business metrics: {str(e)}")
def record_http_request(self, method: str, endpoint: str, status_code: int, duration: float):
"""Record HTTP request metrics"""
self.metrics['http_requests_total'].labels(
method=method,
endpoint=endpoint,
status_code=str(status_code)
).inc()
self.metrics['http_request_duration_seconds'].labels(
method=method,
endpoint=endpoint
).observe(duration)
def get_metrics(self) -> str:
"""Get metrics in Prometheus format"""
return generate_latest(self.registry)
class AlertManager:
"""Enterprise alert management system"""
def __init__(self):
self.alert_rules = self._load_alert_rules()
self.notification_channels = self._load_notification_channels()
self.alert_queue = queue.Queue()
self.is_running = False
self.alert_thread = None
def _load_alert_rules(self) -> List[Dict[str, Any]]:
"""Load alert rules from configuration"""
return [
{
'name': 'high_cpu_usage',
'condition': 'system_cpu_usage > 80',
'severity': 'warning',
'duration': 300, # 5 minutes
'enabled': True,
},
{
'name': 'high_memory_usage',
'condition': 'system_memory_usage > 85',
'severity': 'warning',
'duration': 300,
'enabled': True,
},
{
'name': 'disk_space_low',
'condition': 'system_disk_usage > 90',
'severity': 'critical',
'duration': 60,
'enabled': True,
},
{
'name': 'database_connections_high',
'condition': 'database_connections > 50',
'severity': 'warning',
'duration': 300,
'enabled': True,
},
{
'name': 'incident_volume_high',
'condition': 'incident_count > 100',
'severity': 'warning',
'duration': 600,
'enabled': True,
},
{
'name': 'sla_breach_detected',
'condition': 'sla_breach_count > 0',
'severity': 'critical',
'duration': 0,
'enabled': True,
},
]
def _load_notification_channels(self) -> List[Dict[str, Any]]:
"""Load notification channels"""
return [
{
'name': 'email',
'type': 'email',
'enabled': True,
'config': {
'recipients': ['admin@company.com'],
'template': 'alert_email.html',
}
},
{
'name': 'slack',
'type': 'slack',
'enabled': True,
'config': {
'webhook_url': os.getenv('SLACK_WEBHOOK_URL'),
'channel': '#alerts',
}
},
{
'name': 'webhook',
'type': 'webhook',
'enabled': True,
'config': {
'url': os.getenv('ALERT_WEBHOOK_URL'),
'headers': {'Authorization': f'Bearer {os.getenv("ALERT_WEBHOOK_TOKEN")}'},
}
},
]
def start_monitoring(self):
"""Start alert monitoring"""
if self.is_running:
return
self.is_running = True
self.alert_thread = threading.Thread(target=self._monitor_alerts)
self.alert_thread.daemon = True
self.alert_thread.start()
logger.info("Alert monitoring started")
def stop_monitoring(self):
"""Stop alert monitoring"""
self.is_running = False
if self.alert_thread:
self.alert_thread.join()
logger.info("Alert monitoring stopped")
def _monitor_alerts(self):
"""Main alert monitoring loop"""
while self.is_running:
try:
self._check_alert_rules()
time.sleep(60) # Check every minute
except Exception as e:
logger.error(f"Error monitoring alerts: {str(e)}")
time.sleep(60)
def _check_alert_rules(self):
"""Check all alert rules"""
for rule in self.alert_rules:
if not rule['enabled']:
continue
try:
if self._evaluate_rule(rule):
self._trigger_alert(rule)
except Exception as e:
logger.error(f"Error checking rule {rule['name']}: {str(e)}")
def _evaluate_rule(self, rule: Dict[str, Any]) -> bool:
"""Evaluate alert rule condition"""
condition = rule['condition']
# Parse condition (simplified)
if 'system_cpu_usage' in condition:
cpu_usage = psutil.cpu_percent()
threshold = float(condition.split('>')[1].strip())
return cpu_usage > threshold
elif 'system_memory_usage' in condition:
memory = psutil.virtual_memory()
threshold = float(condition.split('>')[1].strip())
return memory.percent > threshold
elif 'system_disk_usage' in condition:
disk = psutil.disk_usage('/')
disk_percent = (disk.used / disk.total) * 100
threshold = float(condition.split('>')[1].strip())
return disk_percent > threshold
elif 'database_connections' in condition:
with connection.cursor() as cursor:
cursor.execute("SELECT COUNT(*) FROM pg_stat_activity")
connections = cursor.fetchone()[0]
threshold = float(condition.split('>')[1].strip())
return connections > threshold
elif 'incident_count' in condition:
from incident_intelligence.models import Incident
count = Incident.objects.count()
threshold = float(condition.split('>')[1].strip())
return count > threshold
elif 'sla_breach_count' in condition:
from sla_oncall.models import SLAInstance
count = SLAInstance.objects.filter(status='breached').count()
threshold = float(condition.split('>')[1].strip())
return count > threshold
return False
def _trigger_alert(self, rule: Dict[str, Any]):
"""Trigger alert for rule violation"""
alert = {
'rule_name': rule['name'],
'severity': rule['severity'],
'message': f"Alert: {rule['name']} - {rule['condition']}",
'timestamp': timezone.now().isoformat(),
'metadata': {
'condition': rule['condition'],
'duration': rule['duration'],
}
}
# Send notifications
self._send_notifications(alert)
# Store alert
self._store_alert(alert)
logger.warning(f"Alert triggered: {rule['name']}")
def _send_notifications(self, alert: Dict[str, Any]):
"""Send alert notifications"""
for channel in self.notification_channels:
if not channel['enabled']:
continue
try:
if channel['type'] == 'email':
self._send_email_notification(alert, channel)
elif channel['type'] == 'slack':
self._send_slack_notification(alert, channel)
elif channel['type'] == 'webhook':
self._send_webhook_notification(alert, channel)
except Exception as e:
logger.error(f"Error sending notification via {channel['name']}: {str(e)}")
def _send_email_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
"""Send email notification"""
from django.core.mail import send_mail
subject = f"ETB-API Alert: {alert['rule_name']}"
message = f"""
Alert: {alert['rule_name']}
Severity: {alert['severity']}
Message: {alert['message']}
Time: {alert['timestamp']}
"""
send_mail(
subject=subject,
message=message,
from_email=settings.DEFAULT_FROM_EMAIL,
recipient_list=channel['config']['recipients'],
fail_silently=False,
)
def _send_slack_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
"""Send Slack notification"""
webhook_url = channel['config']['webhook_url']
if not webhook_url:
return
payload = {
'channel': channel['config']['channel'],
'text': f"🚨 ETB-API Alert: {alert['rule_name']}",
'attachments': [
{
'color': 'danger' if alert['severity'] == 'critical' else 'warning',
'fields': [
{'title': 'Severity', 'value': alert['severity'], 'short': True},
{'title': 'Message', 'value': alert['message'], 'short': False},
{'title': 'Time', 'value': alert['timestamp'], 'short': True},
]
}
]
}
response = requests.post(webhook_url, json=payload, timeout=10)
response.raise_for_status()
def _send_webhook_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
"""Send webhook notification"""
webhook_url = channel['config']['url']
if not webhook_url:
return
headers = channel['config'].get('headers', {})
response = requests.post(webhook_url, json=alert, headers=headers, timeout=10)
response.raise_for_status()
def _store_alert(self, alert: Dict[str, Any]):
"""Store alert in database"""
try:
from monitoring.models import Alert
Alert.objects.create(
rule_name=alert['rule_name'],
severity=alert['severity'],
message=alert['message'],
metadata=alert['metadata'],
timestamp=timezone.now(),
)
except Exception as e:
logger.error(f"Error storing alert: {str(e)}")
class PerformanceProfiler:
"""Enterprise performance profiling system"""
def __init__(self):
self.profiles = {}
self.is_enabled = True
def start_profile(self, name: str) -> str:
"""Start profiling a function or operation"""
if not self.is_enabled:
return None
profile_id = f"{name}_{int(time.time() * 1000)}"
self.profiles[profile_id] = {
'name': name,
'start_time': time.time(),
'start_memory': psutil.Process().memory_info().rss,
'start_cpu': psutil.cpu_percent(),
}
return profile_id
def end_profile(self, profile_id: str) -> Dict[str, Any]:
"""End profiling and return results"""
if not profile_id or profile_id not in self.profiles:
return None
profile = self.profiles.pop(profile_id)
end_time = time.time()
end_memory = psutil.Process().memory_info().rss
end_cpu = psutil.cpu_percent()
result = {
'name': profile['name'],
'duration': end_time - profile['start_time'],
'memory_delta': end_memory - profile['start_memory'],
'cpu_delta': end_cpu - profile['start_cpu'],
'timestamp': timezone.now().isoformat(),
}
# Log slow operations
if result['duration'] > 1.0: # 1 second
logger.warning(f"Slow operation detected: {result['name']} took {result['duration']:.2f}s")
return result
def profile_function(self, func):
"""Decorator to profile function execution"""
def wrapper(*args, **kwargs):
profile_id = self.start_profile(func.__name__)
try:
result = func(*args, **kwargs)
return result
finally:
if profile_id:
self.end_profile(profile_id)
return wrapper
# Global instances
metrics_collector = MetricsCollector()
alert_manager = AlertManager()
performance_profiler = PerformanceProfiler()
# API Views for monitoring
@api_view(['GET'])
@permission_classes([IsAuthenticated])
def metrics_endpoint(request):
"""Prometheus metrics endpoint"""
try:
metrics_data = metrics_collector.get_metrics()
return Response(metrics_data, content_type=CONTENT_TYPE_LATEST)
except Exception as e:
logger.error(f"Error getting metrics: {str(e)}")
return Response(
{'error': 'Failed to get metrics'},
status=status.HTTP_500_INTERNAL_SERVER_ERROR
)
@api_view(['GET'])
@permission_classes([IsAuthenticated])
def monitoring_dashboard(request):
"""Get monitoring dashboard data"""
try:
# Get system metrics
system_metrics = {
'cpu_usage': psutil.cpu_percent(),
'memory_usage': psutil.virtual_memory().percent,
'disk_usage': (psutil.disk_usage('/').used / psutil.disk_usage('/').total) * 100,
'load_average': psutil.getloadavg() if hasattr(psutil, 'getloadavg') else [0, 0, 0],
}
# Get application metrics
from incident_intelligence.models import Incident
from sla_oncall.models import SLAInstance
application_metrics = {
'total_incidents': Incident.objects.count(),
'active_incidents': Incident.objects.filter(status='active').count(),
'resolved_incidents': Incident.objects.filter(status='resolved').count(),
'sla_breaches': SLAInstance.objects.filter(status='breached').count(),
'active_users': cache.get('active_users_count', 0),
}
# Get recent alerts
from monitoring.models import Alert
recent_alerts = Alert.objects.filter(
timestamp__gte=timezone.now() - timedelta(hours=24)
).order_by('-timestamp')[:10]
return Response({
'system_metrics': system_metrics,
'application_metrics': application_metrics,
'recent_alerts': [
{
'rule_name': alert.rule_name,
'severity': alert.severity,
'message': alert.message,
'timestamp': alert.timestamp.isoformat(),
}
for alert in recent_alerts
],
})
except Exception as e:
logger.error(f"Monitoring dashboard error: {str(e)}")
return Response(
{'error': 'Failed to load monitoring dashboard'},
status=status.HTTP_500_INTERNAL_SERVER_ERROR
)
@api_view(['POST'])
@permission_classes([IsAuthenticated])
def test_alert(request):
"""Test alert notification"""
try:
test_alert = {
'rule_name': 'test_alert',
'severity': 'info',
'message': 'This is a test alert',
'timestamp': timezone.now().isoformat(),
'metadata': {'test': True},
}
alert_manager._send_notifications(test_alert)
return Response({
'message': 'Test alert sent successfully',
'alert': test_alert,
})
except Exception as e:
logger.error(f"Test alert error: {str(e)}")
return Response(
{'error': 'Failed to send test alert'},
status=status.HTTP_500_INTERNAL_SERVER_ERROR
)
class MonitoringMiddleware:
"""Middleware for request monitoring and metrics collection"""
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
start_time = time.time()
response = self.get_response(request)
# Calculate request duration
duration = time.time() - start_time
# Record metrics
metrics_collector.record_http_request(
method=request.method,
endpoint=request.path,
status_code=response.status_code,
duration=duration
)
# Add performance headers
response['X-Response-Time'] = f"{duration:.3f}s"
response['X-Request-ID'] = request.META.get('HTTP_X_REQUEST_ID', 'unknown')
return response
# Management command for starting monitoring services
class StartMonitoringCommand(BaseCommand):
"""Django management command to start monitoring services"""
help = 'Start monitoring services (metrics collection and alerting)'
def handle(self, *args, **options):
self.stdout.write('Starting monitoring services...')
# Start metrics collection
metrics_collector.start_collection()
self.stdout.write(self.style.SUCCESS('Metrics collection started'))
# Start alert monitoring
alert_manager.start_monitoring()
self.stdout.write(self.style.SUCCESS('Alert monitoring started'))
self.stdout.write(self.style.SUCCESS('All monitoring services started successfully'))
# Keep running
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
self.stdout.write('Stopping monitoring services...')
metrics_collector.stop_collection()
alert_manager.stop_monitoring()
self.stdout.write(self.style.SUCCESS('Monitoring services stopped'))

View File

@@ -0,0 +1 @@
# Management commands for monitoring

View File

@@ -0,0 +1 @@
# Management commands

View File

@@ -0,0 +1,665 @@
"""
Management command to set up initial monitoring configuration
"""
from django.core.management.base import BaseCommand
from django.contrib.auth import get_user_model
from monitoring.models import (
MonitoringTarget, SystemMetric, AlertRule, MonitoringDashboard
)
User = get_user_model()
class Command(BaseCommand):
help = 'Set up initial monitoring configuration'
def add_arguments(self, parser):
parser.add_argument(
'--admin-user',
type=str,
help='Username of admin user to create monitoring objects',
default='admin'
)
def handle(self, *args, **options):
admin_username = options['admin_user']
try:
admin_user = User.objects.get(username=admin_username)
except User.DoesNotExist:
self.stdout.write(
self.style.ERROR(f'Admin user "{admin_username}" not found')
)
return
self.stdout.write('Setting up monitoring configuration...')
# Create default monitoring targets
self.create_default_targets(admin_user)
# Create default metrics
self.create_default_metrics(admin_user)
# Create default alert rules
self.create_default_alert_rules(admin_user)
# Create default dashboards
self.create_default_dashboards(admin_user)
self.stdout.write(
self.style.SUCCESS('Monitoring configuration setup completed!')
)
def create_default_targets(self, admin_user):
"""Create default monitoring targets"""
self.stdout.write('Creating default monitoring targets...')
targets = [
{
'name': 'Django Application',
'description': 'Main Django application health check',
'target_type': 'APPLICATION',
'endpoint_url': 'http://localhost:8000/health/',
'related_module': 'core',
'health_check_enabled': True,
'expected_status_codes': [200]
},
{
'name': 'Database',
'description': 'Database connection health check',
'target_type': 'DATABASE',
'related_module': 'core',
'health_check_enabled': True
},
{
'name': 'Cache System',
'description': 'Cache system health check',
'target_type': 'CACHE',
'related_module': 'core',
'health_check_enabled': True
},
{
'name': 'Celery Workers',
'description': 'Celery worker health check',
'target_type': 'QUEUE',
'related_module': 'core',
'health_check_enabled': True
},
{
'name': 'Security Module',
'description': 'Security module health check',
'target_type': 'MODULE',
'related_module': 'security',
'health_check_enabled': True
},
{
'name': 'Incident Intelligence Module',
'description': 'Incident Intelligence module health check',
'target_type': 'MODULE',
'related_module': 'incident_intelligence',
'health_check_enabled': True
},
{
'name': 'Automation Orchestration Module',
'description': 'Automation Orchestration module health check',
'target_type': 'MODULE',
'related_module': 'automation_orchestration',
'health_check_enabled': True
},
{
'name': 'SLA OnCall Module',
'description': 'SLA OnCall module health check',
'target_type': 'MODULE',
'related_module': 'sla_oncall',
'health_check_enabled': True
},
{
'name': 'Collaboration War Rooms Module',
'description': 'Collaboration War Rooms module health check',
'target_type': 'MODULE',
'related_module': 'collaboration_war_rooms',
'health_check_enabled': True
},
{
'name': 'Compliance Governance Module',
'description': 'Compliance Governance module health check',
'target_type': 'MODULE',
'related_module': 'compliance_governance',
'health_check_enabled': True
},
{
'name': 'Analytics Predictive Insights Module',
'description': 'Analytics Predictive Insights module health check',
'target_type': 'MODULE',
'related_module': 'analytics_predictive_insights',
'health_check_enabled': True
},
{
'name': 'Knowledge Learning Module',
'description': 'Knowledge Learning module health check',
'target_type': 'MODULE',
'related_module': 'knowledge_learning',
'health_check_enabled': True
}
]
for target_data in targets:
target, created = MonitoringTarget.objects.get_or_create(
name=target_data['name'],
defaults={
**target_data,
'created_by': admin_user
}
)
if created:
self.stdout.write(f' Created target: {target.name}')
else:
self.stdout.write(f' Target already exists: {target.name}')
def create_default_metrics(self, admin_user):
"""Create default system metrics"""
self.stdout.write('Creating default system metrics...')
metrics = [
{
'name': 'API Response Time',
'description': 'Average API response time in milliseconds',
'metric_type': 'PERFORMANCE',
'category': 'API_RESPONSE_TIME',
'unit': 'milliseconds',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 300,
'warning_threshold': 1000,
'critical_threshold': 2000,
'is_system_metric': True
},
{
'name': 'Request Throughput',
'description': 'Number of requests per minute',
'metric_type': 'PERFORMANCE',
'category': 'THROUGHPUT',
'unit': 'requests/minute',
'aggregation_method': 'SUM',
'collection_interval_seconds': 60,
'warning_threshold': 1000,
'critical_threshold': 2000,
'is_system_metric': True
},
{
'name': 'Error Rate',
'description': 'Percentage of failed requests',
'metric_type': 'PERFORMANCE',
'category': 'ERROR_RATE',
'unit': 'percentage',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 300,
'warning_threshold': 5.0,
'critical_threshold': 10.0,
'is_system_metric': True
},
{
'name': 'System Availability',
'description': 'System availability percentage',
'metric_type': 'INFRASTRUCTURE',
'category': 'AVAILABILITY',
'unit': 'percentage',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 300,
'warning_threshold': 99.0,
'critical_threshold': 95.0,
'is_system_metric': True
},
{
'name': 'Incident Count',
'description': 'Number of incidents in the last 24 hours',
'metric_type': 'BUSINESS',
'category': 'INCIDENT_COUNT',
'unit': 'count',
'aggregation_method': 'COUNT',
'collection_interval_seconds': 3600,
'warning_threshold': 10,
'critical_threshold': 20,
'is_system_metric': True,
'related_module': 'incident_intelligence'
},
{
'name': 'Mean Time to Resolve',
'description': 'Average time to resolve incidents in minutes',
'metric_type': 'BUSINESS',
'category': 'MTTR',
'unit': 'minutes',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 3600,
'warning_threshold': 120,
'critical_threshold': 240,
'is_system_metric': True,
'related_module': 'incident_intelligence'
},
{
'name': 'Mean Time to Acknowledge',
'description': 'Average time to acknowledge incidents in minutes',
'metric_type': 'BUSINESS',
'category': 'MTTA',
'unit': 'minutes',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 3600,
'warning_threshold': 15,
'critical_threshold': 30,
'is_system_metric': True,
'related_module': 'incident_intelligence'
},
{
'name': 'SLA Compliance',
'description': 'SLA compliance percentage',
'metric_type': 'BUSINESS',
'category': 'SLA_COMPLIANCE',
'unit': 'percentage',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 3600,
'warning_threshold': 95.0,
'critical_threshold': 90.0,
'is_system_metric': True,
'related_module': 'sla_oncall'
},
{
'name': 'Security Events',
'description': 'Number of security events in the last hour',
'metric_type': 'SECURITY',
'category': 'SECURITY_EVENTS',
'unit': 'count',
'aggregation_method': 'COUNT',
'collection_interval_seconds': 3600,
'warning_threshold': 5,
'critical_threshold': 10,
'is_system_metric': True,
'related_module': 'security'
},
{
'name': 'Automation Success Rate',
'description': 'Percentage of successful automation executions',
'metric_type': 'BUSINESS',
'category': 'AUTOMATION_SUCCESS',
'unit': 'percentage',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 3600,
'warning_threshold': 90.0,
'critical_threshold': 80.0,
'is_system_metric': True,
'related_module': 'automation_orchestration'
},
{
'name': 'AI Model Accuracy',
'description': 'AI model accuracy percentage',
'metric_type': 'BUSINESS',
'category': 'AI_ACCURACY',
'unit': 'percentage',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 3600,
'warning_threshold': 85.0,
'critical_threshold': 75.0,
'is_system_metric': True,
'related_module': 'incident_intelligence'
},
{
'name': 'Cost Impact',
'description': 'Total cost impact in USD for the last 30 days',
'metric_type': 'BUSINESS',
'category': 'COST_IMPACT',
'unit': 'USD',
'aggregation_method': 'SUM',
'collection_interval_seconds': 86400,
'warning_threshold': 10000,
'critical_threshold': 50000,
'is_system_metric': True,
'related_module': 'analytics_predictive_insights'
},
{
'name': 'User Activity',
'description': 'Number of active users in the last hour',
'metric_type': 'BUSINESS',
'category': 'USER_ACTIVITY',
'unit': 'count',
'aggregation_method': 'COUNT',
'collection_interval_seconds': 3600,
'warning_threshold': 50,
'critical_threshold': 100,
'is_system_metric': True
},
{
'name': 'CPU Usage',
'description': 'System CPU usage percentage',
'metric_type': 'INFRASTRUCTURE',
'category': 'SYSTEM_RESOURCES',
'unit': 'percentage',
'aggregation_method': 'AVERAGE',
'collection_interval_seconds': 300,
'warning_threshold': 80.0,
'critical_threshold': 90.0,
'is_system_metric': True
}
]
for metric_data in metrics:
metric, created = SystemMetric.objects.get_or_create(
name=metric_data['name'],
defaults={
**metric_data,
'created_by': admin_user
}
)
if created:
self.stdout.write(f' Created metric: {metric.name}')
else:
self.stdout.write(f' Metric already exists: {metric.name}')
def create_default_alert_rules(self, admin_user):
"""Create default alert rules"""
self.stdout.write('Creating default alert rules...')
# Get metrics for alert rules
api_response_metric = SystemMetric.objects.filter(name='API Response Time').first()
error_rate_metric = SystemMetric.objects.filter(name='Error Rate').first()
availability_metric = SystemMetric.objects.filter(name='System Availability').first()
incident_count_metric = SystemMetric.objects.filter(name='Incident Count').first()
mttr_metric = SystemMetric.objects.filter(name='Mean Time to Resolve').first()
security_events_metric = SystemMetric.objects.filter(name='Security Events').first()
cpu_metric = SystemMetric.objects.filter(name='CPU Usage').first()
alert_rules = [
{
'name': 'High API Response Time',
'description': 'Alert when API response time exceeds threshold',
'alert_type': 'THRESHOLD',
'severity': 'HIGH',
'condition': {
'type': 'THRESHOLD',
'operator': '>',
'threshold': 2000
},
'metric': api_response_metric,
'notification_channels': [
{
'type': 'EMAIL',
'recipients': ['admin@example.com']
}
]
},
{
'name': 'High Error Rate',
'description': 'Alert when error rate exceeds threshold',
'alert_type': 'THRESHOLD',
'severity': 'CRITICAL',
'condition': {
'type': 'THRESHOLD',
'operator': '>',
'threshold': 10.0
},
'metric': error_rate_metric,
'notification_channels': [
{
'type': 'EMAIL',
'recipients': ['admin@example.com']
}
]
},
{
'name': 'Low System Availability',
'description': 'Alert when system availability drops below threshold',
'alert_type': 'AVAILABILITY',
'severity': 'CRITICAL',
'condition': {
'type': 'THRESHOLD',
'operator': '<',
'threshold': 95.0
},
'metric': availability_metric,
'notification_channels': [
{
'type': 'EMAIL',
'recipients': ['admin@example.com']
}
]
},
{
'name': 'High Incident Count',
'description': 'Alert when incident count exceeds threshold',
'alert_type': 'THRESHOLD',
'severity': 'HIGH',
'condition': {
'type': 'THRESHOLD',
'operator': '>',
'threshold': 20
},
'metric': incident_count_metric,
'notification_channels': [
{
'type': 'EMAIL',
'recipients': ['admin@example.com']
}
]
},
{
'name': 'High MTTR',
'description': 'Alert when mean time to resolve exceeds threshold',
'alert_type': 'THRESHOLD',
'severity': 'MEDIUM',
'condition': {
'type': 'THRESHOLD',
'operator': '>',
'threshold': 240
},
'metric': mttr_metric,
'notification_channels': [
{
'type': 'EMAIL',
'recipients': ['admin@example.com']
}
]
},
{
'name': 'High Security Events',
'description': 'Alert when security events exceed threshold',
'alert_type': 'THRESHOLD',
'severity': 'HIGH',
'condition': {
'type': 'THRESHOLD',
'operator': '>',
'threshold': 10
},
'metric': security_events_metric,
'notification_channels': [
{
'type': 'EMAIL',
'recipients': ['admin@example.com']
}
]
},
{
'name': 'High CPU Usage',
'description': 'Alert when CPU usage exceeds threshold',
'alert_type': 'THRESHOLD',
'severity': 'HIGH',
'condition': {
'type': 'THRESHOLD',
'operator': '>',
'threshold': 90.0
},
'metric': cpu_metric,
'notification_channels': [
{
'type': 'EMAIL',
'recipients': ['admin@example.com']
}
]
}
]
for rule_data in alert_rules:
if rule_data['metric']: # Only create if metric exists
rule, created = AlertRule.objects.get_or_create(
name=rule_data['name'],
defaults={
**rule_data,
'created_by': admin_user
}
)
if created:
self.stdout.write(f' Created alert rule: {rule.name}')
else:
self.stdout.write(f' Alert rule already exists: {rule.name}')
def create_default_dashboards(self, admin_user):
"""Create default monitoring dashboards"""
self.stdout.write('Creating default monitoring dashboards...')
dashboards = [
{
'name': 'System Overview',
'description': 'High-level system overview dashboard',
'dashboard_type': 'SYSTEM_OVERVIEW',
'is_public': True,
'auto_refresh_enabled': True,
'refresh_interval_seconds': 30,
'layout_config': {
'columns': 3,
'rows': 4
},
'widget_configs': [
{
'type': 'system_status',
'position': {'x': 0, 'y': 0, 'width': 3, 'height': 1}
},
{
'type': 'health_summary',
'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
},
{
'type': 'alert_summary',
'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
},
{
'type': 'system_resources',
'position': {'x': 2, 'y': 1, 'width': 1, 'height': 1}
},
{
'type': 'recent_incidents',
'position': {'x': 0, 'y': 2, 'width': 2, 'height': 2}
},
{
'type': 'metric_trends',
'position': {'x': 2, 'y': 2, 'width': 1, 'height': 2}
}
]
},
{
'name': 'Performance Dashboard',
'description': 'System performance metrics dashboard',
'dashboard_type': 'PERFORMANCE',
'is_public': True,
'auto_refresh_enabled': True,
'refresh_interval_seconds': 60,
'layout_config': {
'columns': 2,
'rows': 3
},
'widget_configs': [
{
'type': 'api_response_time',
'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
},
{
'type': 'throughput',
'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
},
{
'type': 'error_rate',
'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
},
{
'type': 'availability',
'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
},
{
'type': 'system_resources',
'position': {'x': 0, 'y': 2, 'width': 2, 'height': 1}
}
]
},
{
'name': 'Business Metrics Dashboard',
'description': 'Business and operational metrics dashboard',
'dashboard_type': 'BUSINESS_METRICS',
'is_public': True,
'auto_refresh_enabled': True,
'refresh_interval_seconds': 300,
'layout_config': {
'columns': 2,
'rows': 3
},
'widget_configs': [
{
'type': 'incident_count',
'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
},
{
'type': 'mttr',
'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
},
{
'type': 'mtta',
'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
},
{
'type': 'sla_compliance',
'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
},
{
'type': 'cost_impact',
'position': {'x': 0, 'y': 2, 'width': 2, 'height': 1}
}
]
},
{
'name': 'Security Dashboard',
'description': 'Security monitoring dashboard',
'dashboard_type': 'SECURITY',
'is_public': False,
'auto_refresh_enabled': True,
'refresh_interval_seconds': 60,
'layout_config': {
'columns': 2,
'rows': 2
},
'widget_configs': [
{
'type': 'security_events',
'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
},
{
'type': 'failed_logins',
'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
},
{
'type': 'risk_assessments',
'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
},
{
'type': 'device_posture',
'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
}
]
}
]
for dashboard_data in dashboards:
dashboard, created = MonitoringDashboard.objects.get_or_create(
name=dashboard_data['name'],
defaults={
**dashboard_data,
'created_by': admin_user
}
)
if created:
self.stdout.write(f' Created dashboard: {dashboard.name}')
else:
self.stdout.write(f' Dashboard already exists: {dashboard.name}')

View File

@@ -0,0 +1,252 @@
# Generated by Django 5.2.6 on 2025-09-18 19:44
import django.db.models.deletion
import uuid
from django.conf import settings
from django.db import migrations, models
class Migration(migrations.Migration):
initial = True
dependencies = [
migrations.swappable_dependency(settings.AUTH_USER_MODEL),
]
operations = [
migrations.CreateModel(
name='MonitoringTarget',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('name', models.CharField(max_length=200, unique=True)),
('description', models.TextField()),
('target_type', models.CharField(choices=[('APPLICATION', 'Application'), ('DATABASE', 'Database'), ('CACHE', 'Cache'), ('QUEUE', 'Message Queue'), ('EXTERNAL_API', 'External API'), ('SERVICE', 'Internal Service'), ('INFRASTRUCTURE', 'Infrastructure'), ('MODULE', 'Django Module')], max_length=20)),
('endpoint_url', models.URLField(blank=True, null=True)),
('connection_config', models.JSONField(default=dict, help_text='Connection configuration (credentials, timeouts, etc.)')),
('check_interval_seconds', models.PositiveIntegerField(default=60)),
('timeout_seconds', models.PositiveIntegerField(default=30)),
('retry_count', models.PositiveIntegerField(default=3)),
('health_check_enabled', models.BooleanField(default=True)),
('health_check_endpoint', models.CharField(blank=True, max_length=200, null=True)),
('expected_status_codes', models.JSONField(default=list, help_text='Expected HTTP status codes for health checks')),
('status', models.CharField(choices=[('ACTIVE', 'Active'), ('INACTIVE', 'Inactive'), ('MAINTENANCE', 'Maintenance'), ('ERROR', 'Error')], default='ACTIVE', max_length=20)),
('last_checked', models.DateTimeField(blank=True, null=True)),
('last_status', models.CharField(choices=[('HEALTHY', 'Healthy'), ('WARNING', 'Warning'), ('CRITICAL', 'Critical'), ('UNKNOWN', 'Unknown')], default='UNKNOWN', max_length=20)),
('related_module', models.CharField(blank=True, help_text="Related Django module (e.g., 'security', 'incident_intelligence')", max_length=50, null=True)),
('created_at', models.DateTimeField(auto_now_add=True)),
('updated_at', models.DateTimeField(auto_now=True)),
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
],
options={
'ordering': ['name'],
},
),
migrations.CreateModel(
name='HealthCheck',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('check_type', models.CharField(choices=[('HTTP', 'HTTP Health Check'), ('DATABASE', 'Database Connection'), ('CACHE', 'Cache Connection'), ('QUEUE', 'Message Queue'), ('CUSTOM', 'Custom Check'), ('PING', 'Network Ping'), ('SSL', 'SSL Certificate')], max_length=20)),
('status', models.CharField(choices=[('HEALTHY', 'Healthy'), ('WARNING', 'Warning'), ('CRITICAL', 'Critical'), ('UNKNOWN', 'Unknown')], max_length=20)),
('response_time_ms', models.PositiveIntegerField(blank=True, null=True)),
('status_code', models.PositiveIntegerField(blank=True, null=True)),
('response_body', models.TextField(blank=True, null=True)),
('error_message', models.TextField(blank=True, null=True)),
('cpu_usage_percent', models.FloatField(blank=True, null=True)),
('memory_usage_percent', models.FloatField(blank=True, null=True)),
('disk_usage_percent', models.FloatField(blank=True, null=True)),
('checked_at', models.DateTimeField(auto_now_add=True)),
('target', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='health_checks', to='monitoring.monitoringtarget')),
],
options={
'ordering': ['-checked_at'],
},
),
migrations.CreateModel(
name='SystemMetric',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('name', models.CharField(max_length=200)),
('description', models.TextField()),
('metric_type', models.CharField(choices=[('PERFORMANCE', 'Performance Metric'), ('BUSINESS', 'Business Metric'), ('SECURITY', 'Security Metric'), ('INFRASTRUCTURE', 'Infrastructure Metric'), ('CUSTOM', 'Custom Metric')], max_length=20)),
('category', models.CharField(choices=[('API_RESPONSE_TIME', 'API Response Time'), ('THROUGHPUT', 'Throughput'), ('ERROR_RATE', 'Error Rate'), ('AVAILABILITY', 'Availability'), ('INCIDENT_COUNT', 'Incident Count'), ('MTTR', 'Mean Time to Resolve'), ('MTTA', 'Mean Time to Acknowledge'), ('SLA_COMPLIANCE', 'SLA Compliance'), ('SECURITY_EVENTS', 'Security Events'), ('AUTOMATION_SUCCESS', 'Automation Success Rate'), ('AI_ACCURACY', 'AI Model Accuracy'), ('COST_IMPACT', 'Cost Impact'), ('USER_ACTIVITY', 'User Activity'), ('SYSTEM_RESOURCES', 'System Resources')], max_length=30)),
('unit', models.CharField(help_text='Unit of measurement', max_length=50)),
('aggregation_method', models.CharField(choices=[('AVERAGE', 'Average'), ('SUM', 'Sum'), ('COUNT', 'Count'), ('MIN', 'Minimum'), ('MAX', 'Maximum'), ('PERCENTILE_95', '95th Percentile'), ('PERCENTILE_99', '99th Percentile')], max_length=20)),
('collection_interval_seconds', models.PositiveIntegerField(default=300)),
('retention_days', models.PositiveIntegerField(default=90)),
('warning_threshold', models.FloatField(blank=True, null=True)),
('critical_threshold', models.FloatField(blank=True, null=True)),
('is_active', models.BooleanField(default=True)),
('is_system_metric', models.BooleanField(default=False)),
('related_module', models.CharField(blank=True, help_text='Related Django module', max_length=50, null=True)),
('created_at', models.DateTimeField(auto_now_add=True)),
('updated_at', models.DateTimeField(auto_now=True)),
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
],
options={
'ordering': ['name'],
},
),
migrations.CreateModel(
name='MetricMeasurement',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('value', models.DecimalField(decimal_places=4, max_digits=15)),
('timestamp', models.DateTimeField(auto_now_add=True)),
('tags', models.JSONField(default=dict, help_text='Additional tags for this measurement')),
('metadata', models.JSONField(default=dict, help_text='Additional metadata')),
('metric', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='measurements', to='monitoring.systemmetric')),
],
options={
'ordering': ['-timestamp'],
},
),
migrations.CreateModel(
name='AlertRule',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('name', models.CharField(max_length=200)),
('description', models.TextField()),
('alert_type', models.CharField(choices=[('THRESHOLD', 'Threshold Alert'), ('ANOMALY', 'Anomaly Alert'), ('PATTERN', 'Pattern Alert'), ('AVAILABILITY', 'Availability Alert'), ('PERFORMANCE', 'Performance Alert')], max_length=20)),
('severity', models.CharField(choices=[('LOW', 'Low'), ('MEDIUM', 'Medium'), ('HIGH', 'High'), ('CRITICAL', 'Critical')], max_length=20)),
('condition', models.JSONField(help_text='Alert condition configuration')),
('evaluation_interval_seconds', models.PositiveIntegerField(default=60)),
('notification_channels', models.JSONField(default=list, help_text='List of notification channels (email, slack, webhook, etc.)')),
('notification_template', models.TextField(blank=True, help_text='Custom notification template', null=True)),
('status', models.CharField(choices=[('ACTIVE', 'Active'), ('INACTIVE', 'Inactive'), ('MAINTENANCE', 'Maintenance')], default='ACTIVE', max_length=20)),
('is_enabled', models.BooleanField(default=True)),
('created_at', models.DateTimeField(auto_now_add=True)),
('updated_at', models.DateTimeField(auto_now=True)),
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
('target', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.CASCADE, related_name='alert_rules', to='monitoring.monitoringtarget')),
('metric', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.CASCADE, related_name='alert_rules', to='monitoring.systemmetric')),
],
options={
'ordering': ['name'],
},
),
migrations.CreateModel(
name='SystemStatus',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('status', models.CharField(choices=[('OPERATIONAL', 'Operational'), ('DEGRADED', 'Degraded'), ('PARTIAL_OUTAGE', 'Partial Outage'), ('MAJOR_OUTAGE', 'Major Outage'), ('MAINTENANCE', 'Maintenance')], max_length=20)),
('message', models.TextField(help_text='Status message for users')),
('affected_services', models.JSONField(default=list, help_text='List of affected services')),
('estimated_resolution', models.DateTimeField(blank=True, null=True)),
('started_at', models.DateTimeField(auto_now_add=True)),
('updated_at', models.DateTimeField(auto_now=True)),
('resolved_at', models.DateTimeField(blank=True, null=True)),
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
],
options={
'ordering': ['-started_at'],
},
),
migrations.CreateModel(
name='Alert',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('title', models.CharField(max_length=200)),
('description', models.TextField()),
('severity', models.CharField(choices=[('LOW', 'Low'), ('MEDIUM', 'Medium'), ('HIGH', 'High'), ('CRITICAL', 'Critical')], max_length=20)),
('status', models.CharField(choices=[('TRIGGERED', 'Triggered'), ('ACKNOWLEDGED', 'Acknowledged'), ('RESOLVED', 'Resolved'), ('SUPPRESSED', 'Suppressed')], default='TRIGGERED', max_length=20)),
('triggered_value', models.DecimalField(blank=True, decimal_places=4, max_digits=15, null=True)),
('threshold_value', models.DecimalField(blank=True, decimal_places=4, max_digits=15, null=True)),
('context_data', models.JSONField(default=dict, help_text='Additional context data for the alert')),
('triggered_at', models.DateTimeField(auto_now_add=True)),
('acknowledged_at', models.DateTimeField(blank=True, null=True)),
('resolved_at', models.DateTimeField(blank=True, null=True)),
('acknowledged_by', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, related_name='acknowledged_alerts', to=settings.AUTH_USER_MODEL)),
('resolved_by', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, related_name='resolved_alerts', to=settings.AUTH_USER_MODEL)),
('rule', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='alerts', to='monitoring.alertrule')),
],
options={
'ordering': ['-triggered_at'],
'indexes': [models.Index(fields=['rule', 'status'], name='monitoring__rule_id_0ff7d3_idx'), models.Index(fields=['severity', 'status'], name='monitoring__severit_1e6a2c_idx'), models.Index(fields=['triggered_at'], name='monitoring__trigger_743dcf_idx')],
},
),
migrations.CreateModel(
name='MonitoringDashboard',
fields=[
('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
('name', models.CharField(max_length=200)),
('description', models.TextField()),
('dashboard_type', models.CharField(choices=[('SYSTEM_OVERVIEW', 'System Overview'), ('PERFORMANCE', 'Performance'), ('BUSINESS_METRICS', 'Business Metrics'), ('SECURITY', 'Security'), ('INFRASTRUCTURE', 'Infrastructure'), ('CUSTOM', 'Custom')], max_length=20)),
('layout_config', models.JSONField(default=dict, help_text='Dashboard layout configuration')),
('widget_configs', models.JSONField(default=list, help_text='Configuration for dashboard widgets')),
('is_public', models.BooleanField(default=False)),
('allowed_roles', models.JSONField(default=list, help_text='List of roles that can access this dashboard')),
('auto_refresh_enabled', models.BooleanField(default=True)),
('refresh_interval_seconds', models.PositiveIntegerField(default=30)),
('is_active', models.BooleanField(default=True)),
('created_at', models.DateTimeField(auto_now_add=True)),
('updated_at', models.DateTimeField(auto_now=True)),
('allowed_users', models.ManyToManyField(blank=True, related_name='accessible_monitoring_dashboards', to=settings.AUTH_USER_MODEL)),
('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
],
options={
'ordering': ['name'],
'indexes': [models.Index(fields=['dashboard_type', 'is_active'], name='monitoring__dashboa_2e7a27_idx'), models.Index(fields=['is_public'], name='monitoring__is_publ_811f62_idx')],
},
),
migrations.AddIndex(
model_name='monitoringtarget',
index=models.Index(fields=['target_type', 'status'], name='monitoring__target__f37347_idx'),
),
migrations.AddIndex(
model_name='monitoringtarget',
index=models.Index(fields=['related_module'], name='monitoring__related_0c51fc_idx'),
),
migrations.AddIndex(
model_name='monitoringtarget',
index=models.Index(fields=['last_checked'], name='monitoring__last_ch_83ce18_idx'),
),
migrations.AddIndex(
model_name='healthcheck',
index=models.Index(fields=['target', 'checked_at'], name='monitoring__target__8d1cd6_idx'),
),
migrations.AddIndex(
model_name='healthcheck',
index=models.Index(fields=['status', 'checked_at'], name='monitoring__status_636b2b_idx'),
),
migrations.AddIndex(
model_name='healthcheck',
index=models.Index(fields=['check_type'], name='monitoring__check_t_b442f3_idx'),
),
migrations.AddIndex(
model_name='systemmetric',
index=models.Index(fields=['metric_type', 'category'], name='monitoring__metric__df4606_idx'),
),
migrations.AddIndex(
model_name='systemmetric',
index=models.Index(fields=['related_module'], name='monitoring__related_7b383b_idx'),
),
migrations.AddIndex(
model_name='systemmetric',
index=models.Index(fields=['is_active'], name='monitoring__is_acti_c90676_idx'),
),
migrations.AddIndex(
model_name='metricmeasurement',
index=models.Index(fields=['metric', 'timestamp'], name='monitoring__metric__216cac_idx'),
),
migrations.AddIndex(
model_name='metricmeasurement',
index=models.Index(fields=['timestamp'], name='monitoring__timesta_75a739_idx'),
),
migrations.AddIndex(
model_name='alertrule',
index=models.Index(fields=['alert_type', 'severity'], name='monitoring__alert_t_915b15_idx'),
),
migrations.AddIndex(
model_name='alertrule',
index=models.Index(fields=['status', 'is_enabled'], name='monitoring__status_e905cc_idx'),
),
migrations.AddIndex(
model_name='systemstatus',
index=models.Index(fields=['status', 'started_at'], name='monitoring__status_18966f_idx'),
),
migrations.AddIndex(
model_name='systemstatus',
index=models.Index(fields=['started_at'], name='monitoring__started_d85786_idx'),
),
]

View File

@@ -0,0 +1,515 @@
"""
Monitoring models for comprehensive system observability
"""
import uuid
import json
from datetime import datetime, timedelta
from typing import Dict, Any, Optional, List
from decimal import Decimal
from django.db import models
from django.contrib.auth import get_user_model
from django.core.validators import MinValueValidator, MaxValueValidator
from django.utils import timezone
from django.core.exceptions import ValidationError
User = get_user_model()
class MonitoringTarget(models.Model):
"""Target systems, services, or components to monitor"""
TARGET_TYPES = [
('APPLICATION', 'Application'),
('DATABASE', 'Database'),
('CACHE', 'Cache'),
('QUEUE', 'Message Queue'),
('EXTERNAL_API', 'External API'),
('SERVICE', 'Internal Service'),
('INFRASTRUCTURE', 'Infrastructure'),
('MODULE', 'Django Module'),
]
STATUS_CHOICES = [
('ACTIVE', 'Active'),
('INACTIVE', 'Inactive'),
('MAINTENANCE', 'Maintenance'),
('ERROR', 'Error'),
]
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
name = models.CharField(max_length=200, unique=True)
description = models.TextField()
target_type = models.CharField(max_length=20, choices=TARGET_TYPES)
# Connection details
endpoint_url = models.URLField(blank=True, null=True)
connection_config = models.JSONField(
default=dict,
help_text="Connection configuration (credentials, timeouts, etc.)"
)
# Monitoring configuration
check_interval_seconds = models.PositiveIntegerField(default=60)
timeout_seconds = models.PositiveIntegerField(default=30)
retry_count = models.PositiveIntegerField(default=3)
# Health check configuration
health_check_enabled = models.BooleanField(default=True)
health_check_endpoint = models.CharField(max_length=200, blank=True, null=True)
expected_status_codes = models.JSONField(
default=list,
help_text="Expected HTTP status codes for health checks"
)
# Status and metadata
status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='ACTIVE')
last_checked = models.DateTimeField(null=True, blank=True)
last_status = models.CharField(max_length=20, choices=[
('HEALTHY', 'Healthy'),
('WARNING', 'Warning'),
('CRITICAL', 'Critical'),
('UNKNOWN', 'Unknown'),
], default='UNKNOWN')
# Related module (if applicable)
related_module = models.CharField(
max_length=50,
blank=True,
null=True,
help_text="Related Django module (e.g., 'security', 'incident_intelligence')"
)
# Metadata
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
class Meta:
ordering = ['name']
indexes = [
models.Index(fields=['target_type', 'status']),
models.Index(fields=['related_module']),
models.Index(fields=['last_checked']),
]
def __str__(self):
return f"{self.name} ({self.target_type})"
class HealthCheck(models.Model):
"""Individual health check results"""
CHECK_TYPES = [
('HTTP', 'HTTP Health Check'),
('DATABASE', 'Database Connection'),
('CACHE', 'Cache Connection'),
('QUEUE', 'Message Queue'),
('CUSTOM', 'Custom Check'),
('PING', 'Network Ping'),
('SSL', 'SSL Certificate'),
]
STATUS_CHOICES = [
('HEALTHY', 'Healthy'),
('WARNING', 'Warning'),
('CRITICAL', 'Critical'),
('UNKNOWN', 'Unknown'),
]
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
target = models.ForeignKey(MonitoringTarget, on_delete=models.CASCADE, related_name='health_checks')
# Check details
check_type = models.CharField(max_length=20, choices=CHECK_TYPES)
status = models.CharField(max_length=20, choices=STATUS_CHOICES)
response_time_ms = models.PositiveIntegerField(null=True, blank=True)
# Response details
status_code = models.PositiveIntegerField(null=True, blank=True)
response_body = models.TextField(blank=True, null=True)
error_message = models.TextField(blank=True, null=True)
# Metrics
cpu_usage_percent = models.FloatField(null=True, blank=True)
memory_usage_percent = models.FloatField(null=True, blank=True)
disk_usage_percent = models.FloatField(null=True, blank=True)
# Timestamps
checked_at = models.DateTimeField(auto_now_add=True)
class Meta:
ordering = ['-checked_at']
indexes = [
models.Index(fields=['target', 'checked_at']),
models.Index(fields=['status', 'checked_at']),
models.Index(fields=['check_type']),
]
def __str__(self):
return f"{self.target.name} - {self.status} ({self.checked_at})"
class SystemMetric(models.Model):
"""System performance and operational metrics"""
METRIC_TYPES = [
('PERFORMANCE', 'Performance Metric'),
('BUSINESS', 'Business Metric'),
('SECURITY', 'Security Metric'),
('INFRASTRUCTURE', 'Infrastructure Metric'),
('CUSTOM', 'Custom Metric'),
]
METRIC_CATEGORIES = [
('API_RESPONSE_TIME', 'API Response Time'),
('THROUGHPUT', 'Throughput'),
('ERROR_RATE', 'Error Rate'),
('AVAILABILITY', 'Availability'),
('INCIDENT_COUNT', 'Incident Count'),
('MTTR', 'Mean Time to Resolve'),
('MTTA', 'Mean Time to Acknowledge'),
('SLA_COMPLIANCE', 'SLA Compliance'),
('SECURITY_EVENTS', 'Security Events'),
('AUTOMATION_SUCCESS', 'Automation Success Rate'),
('AI_ACCURACY', 'AI Model Accuracy'),
('COST_IMPACT', 'Cost Impact'),
('USER_ACTIVITY', 'User Activity'),
('SYSTEM_RESOURCES', 'System Resources'),
]
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
name = models.CharField(max_length=200)
description = models.TextField()
metric_type = models.CharField(max_length=20, choices=METRIC_TYPES)
category = models.CharField(max_length=30, choices=METRIC_CATEGORIES)
# Metric configuration
unit = models.CharField(max_length=50, help_text="Unit of measurement")
aggregation_method = models.CharField(
max_length=20,
choices=[
('AVERAGE', 'Average'),
('SUM', 'Sum'),
('COUNT', 'Count'),
('MIN', 'Minimum'),
('MAX', 'Maximum'),
('PERCENTILE_95', '95th Percentile'),
('PERCENTILE_99', '99th Percentile'),
]
)
# Collection configuration
collection_interval_seconds = models.PositiveIntegerField(default=300) # 5 minutes
retention_days = models.PositiveIntegerField(default=90)
# Thresholds
warning_threshold = models.FloatField(null=True, blank=True)
critical_threshold = models.FloatField(null=True, blank=True)
# Status
is_active = models.BooleanField(default=True)
is_system_metric = models.BooleanField(default=False)
# Related module
related_module = models.CharField(
max_length=50,
blank=True,
null=True,
help_text="Related Django module"
)
# Metadata
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
class Meta:
ordering = ['name']
indexes = [
models.Index(fields=['metric_type', 'category']),
models.Index(fields=['related_module']),
models.Index(fields=['is_active']),
]
def __str__(self):
return f"{self.name} ({self.category})"
class MetricMeasurement(models.Model):
"""Individual metric measurements"""
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
metric = models.ForeignKey(SystemMetric, on_delete=models.CASCADE, related_name='measurements')
# Measurement details
value = models.DecimalField(max_digits=15, decimal_places=4)
timestamp = models.DateTimeField(auto_now_add=True)
# Context
tags = models.JSONField(
default=dict,
help_text="Additional tags for this measurement"
)
metadata = models.JSONField(
default=dict,
help_text="Additional metadata"
)
class Meta:
ordering = ['-timestamp']
indexes = [
models.Index(fields=['metric', 'timestamp']),
models.Index(fields=['timestamp']),
]
def __str__(self):
return f"{self.metric.name}: {self.value} ({self.timestamp})"
class AlertRule(models.Model):
"""Alert rules for monitoring thresholds"""
ALERT_TYPES = [
('THRESHOLD', 'Threshold Alert'),
('ANOMALY', 'Anomaly Alert'),
('PATTERN', 'Pattern Alert'),
('AVAILABILITY', 'Availability Alert'),
('PERFORMANCE', 'Performance Alert'),
]
SEVERITY_CHOICES = [
('LOW', 'Low'),
('MEDIUM', 'Medium'),
('HIGH', 'High'),
('CRITICAL', 'Critical'),
]
STATUS_CHOICES = [
('ACTIVE', 'Active'),
('INACTIVE', 'Inactive'),
('MAINTENANCE', 'Maintenance'),
]
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
name = models.CharField(max_length=200)
description = models.TextField()
alert_type = models.CharField(max_length=20, choices=ALERT_TYPES)
severity = models.CharField(max_length=20, choices=SEVERITY_CHOICES)
# Rule configuration
condition = models.JSONField(
help_text="Alert condition configuration"
)
evaluation_interval_seconds = models.PositiveIntegerField(default=60)
# Related objects
metric = models.ForeignKey(
SystemMetric,
on_delete=models.CASCADE,
null=True,
blank=True,
related_name='alert_rules'
)
target = models.ForeignKey(
MonitoringTarget,
on_delete=models.CASCADE,
null=True,
blank=True,
related_name='alert_rules'
)
# Notification configuration
notification_channels = models.JSONField(
default=list,
help_text="List of notification channels (email, slack, webhook, etc.)"
)
notification_template = models.TextField(
blank=True,
null=True,
help_text="Custom notification template"
)
# Status
status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='ACTIVE')
is_enabled = models.BooleanField(default=True)
# Metadata
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
class Meta:
ordering = ['name']
indexes = [
models.Index(fields=['alert_type', 'severity']),
models.Index(fields=['status', 'is_enabled']),
]
def __str__(self):
return f"{self.name} ({self.severity})"
class Alert(models.Model):
"""Alert instances"""
STATUS_CHOICES = [
('TRIGGERED', 'Triggered'),
('ACKNOWLEDGED', 'Acknowledged'),
('RESOLVED', 'Resolved'),
('SUPPRESSED', 'Suppressed'),
]
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
rule = models.ForeignKey(AlertRule, on_delete=models.CASCADE, related_name='alerts')
# Alert details
title = models.CharField(max_length=200)
description = models.TextField()
severity = models.CharField(max_length=20, choices=AlertRule.SEVERITY_CHOICES)
status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='TRIGGERED')
# Context
triggered_value = models.DecimalField(max_digits=15, decimal_places=4, null=True, blank=True)
threshold_value = models.DecimalField(max_digits=15, decimal_places=4, null=True, blank=True)
context_data = models.JSONField(
default=dict,
help_text="Additional context data for the alert"
)
# Timestamps
triggered_at = models.DateTimeField(auto_now_add=True)
acknowledged_at = models.DateTimeField(null=True, blank=True)
resolved_at = models.DateTimeField(null=True, blank=True)
# Assignment
acknowledged_by = models.ForeignKey(
User,
on_delete=models.SET_NULL,
null=True,
blank=True,
related_name='acknowledged_alerts'
)
resolved_by = models.ForeignKey(
User,
on_delete=models.SET_NULL,
null=True,
blank=True,
related_name='resolved_alerts'
)
class Meta:
ordering = ['-triggered_at']
indexes = [
models.Index(fields=['rule', 'status']),
models.Index(fields=['severity', 'status']),
models.Index(fields=['triggered_at']),
]
def __str__(self):
return f"{self.title} ({self.severity}) - {self.status}"
class MonitoringDashboard(models.Model):
"""Monitoring dashboard configurations"""
DASHBOARD_TYPES = [
('SYSTEM_OVERVIEW', 'System Overview'),
('PERFORMANCE', 'Performance'),
('BUSINESS_METRICS', 'Business Metrics'),
('SECURITY', 'Security'),
('INFRASTRUCTURE', 'Infrastructure'),
('CUSTOM', 'Custom'),
]
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
name = models.CharField(max_length=200)
description = models.TextField()
dashboard_type = models.CharField(max_length=20, choices=DASHBOARD_TYPES)
# Dashboard configuration
layout_config = models.JSONField(
default=dict,
help_text="Dashboard layout configuration"
)
widget_configs = models.JSONField(
default=list,
help_text="Configuration for dashboard widgets"
)
# Access control
is_public = models.BooleanField(default=False)
allowed_users = models.ManyToManyField(
User,
blank=True,
related_name='accessible_monitoring_dashboards'
)
allowed_roles = models.JSONField(
default=list,
help_text="List of roles that can access this dashboard"
)
# Refresh configuration
auto_refresh_enabled = models.BooleanField(default=True)
refresh_interval_seconds = models.PositiveIntegerField(default=30)
# Status
is_active = models.BooleanField(default=True)
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
class Meta:
ordering = ['name']
indexes = [
models.Index(fields=['dashboard_type', 'is_active']),
models.Index(fields=['is_public']),
]
def __str__(self):
return f"{self.name} ({self.dashboard_type})"
class SystemStatus(models.Model):
"""Overall system status tracking"""
STATUS_CHOICES = [
('OPERATIONAL', 'Operational'),
('DEGRADED', 'Degraded'),
('PARTIAL_OUTAGE', 'Partial Outage'),
('MAJOR_OUTAGE', 'Major Outage'),
('MAINTENANCE', 'Maintenance'),
]
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
status = models.CharField(max_length=20, choices=STATUS_CHOICES)
message = models.TextField(help_text="Status message for users")
# Impact details
affected_services = models.JSONField(
default=list,
help_text="List of affected services"
)
estimated_resolution = models.DateTimeField(null=True, blank=True)
# Timestamps
started_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
resolved_at = models.DateTimeField(null=True, blank=True)
# Metadata
created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
class Meta:
ordering = ['-started_at']
indexes = [
models.Index(fields=['status', 'started_at']),
models.Index(fields=['started_at']),
]
def __str__(self):
return f"System Status: {self.status} ({self.started_at})"
@property
def is_resolved(self):
return self.resolved_at is not None

View File

@@ -0,0 +1,200 @@
"""
Serializers for monitoring models
"""
from rest_framework import serializers
from monitoring.models import (
MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
AlertRule, Alert, MonitoringDashboard, SystemStatus
)
class MonitoringTargetSerializer(serializers.ModelSerializer):
"""Serializer for MonitoringTarget model"""
last_status_display = serializers.CharField(source='get_last_status_display', read_only=True)
target_type_display = serializers.CharField(source='get_target_type_display', read_only=True)
class Meta:
model = MonitoringTarget
fields = [
'id', 'name', 'description', 'target_type', 'target_type_display',
'endpoint_url', 'connection_config', 'check_interval_seconds',
'timeout_seconds', 'retry_count', 'health_check_enabled',
'health_check_endpoint', 'expected_status_codes', 'status',
'last_checked', 'last_status', 'last_status_display',
'related_module', 'created_by', 'created_at', 'updated_at'
]
read_only_fields = ['id', 'created_at', 'updated_at', 'last_checked']
class HealthCheckSerializer(serializers.ModelSerializer):
"""Serializer for HealthCheck model"""
target_name = serializers.CharField(source='target.name', read_only=True)
status_display = serializers.CharField(source='get_status_display', read_only=True)
check_type_display = serializers.CharField(source='get_check_type_display', read_only=True)
class Meta:
model = HealthCheck
fields = [
'id', 'target', 'target_name', 'check_type', 'check_type_display',
'status', 'status_display', 'response_time_ms', 'status_code',
'response_body', 'error_message', 'cpu_usage_percent',
'memory_usage_percent', 'disk_usage_percent', 'checked_at'
]
read_only_fields = ['id', 'checked_at']
class SystemMetricSerializer(serializers.ModelSerializer):
"""Serializer for SystemMetric model"""
metric_type_display = serializers.CharField(source='get_metric_type_display', read_only=True)
category_display = serializers.CharField(source='get_category_display', read_only=True)
aggregation_method_display = serializers.CharField(source='get_aggregation_method_display', read_only=True)
class Meta:
model = SystemMetric
fields = [
'id', 'name', 'description', 'metric_type', 'metric_type_display',
'category', 'category_display', 'unit', 'aggregation_method',
'aggregation_method_display', 'collection_interval_seconds',
'retention_days', 'warning_threshold', 'critical_threshold',
'is_active', 'is_system_metric', 'related_module',
'created_by', 'created_at', 'updated_at'
]
read_only_fields = ['id', 'created_at', 'updated_at']
class MetricMeasurementSerializer(serializers.ModelSerializer):
"""Serializer for MetricMeasurement model"""
metric_name = serializers.CharField(source='metric.name', read_only=True)
metric_unit = serializers.CharField(source='metric.unit', read_only=True)
class Meta:
model = MetricMeasurement
fields = [
'id', 'metric', 'metric_name', 'metric_unit', 'value',
'timestamp', 'tags', 'metadata'
]
read_only_fields = ['id', 'timestamp']
class AlertRuleSerializer(serializers.ModelSerializer):
"""Serializer for AlertRule model"""
alert_type_display = serializers.CharField(source='get_alert_type_display', read_only=True)
severity_display = serializers.CharField(source='get_severity_display', read_only=True)
status_display = serializers.CharField(source='get_status_display', read_only=True)
metric_name = serializers.CharField(source='metric.name', read_only=True)
target_name = serializers.CharField(source='target.name', read_only=True)
class Meta:
model = AlertRule
fields = [
'id', 'name', 'description', 'alert_type', 'alert_type_display',
'severity', 'severity_display', 'condition', 'evaluation_interval_seconds',
'metric', 'metric_name', 'target', 'target_name',
'notification_channels', 'notification_template', 'status',
'status_display', 'is_enabled', 'created_by', 'created_at', 'updated_at'
]
read_only_fields = ['id', 'created_at', 'updated_at']
class AlertSerializer(serializers.ModelSerializer):
"""Serializer for Alert model"""
rule_name = serializers.CharField(source='rule.name', read_only=True)
severity_display = serializers.CharField(source='get_severity_display', read_only=True)
status_display = serializers.CharField(source='get_status_display', read_only=True)
acknowledged_by_username = serializers.CharField(source='acknowledged_by.username', read_only=True)
resolved_by_username = serializers.CharField(source='resolved_by.username', read_only=True)
class Meta:
model = Alert
fields = [
'id', 'rule', 'rule_name', 'title', 'description', 'severity',
'severity_display', 'status', 'status_display', 'triggered_value',
'threshold_value', 'context_data', 'triggered_at', 'acknowledged_at',
'resolved_at', 'acknowledged_by', 'acknowledged_by_username',
'resolved_by', 'resolved_by_username'
]
read_only_fields = ['id', 'triggered_at']
class MonitoringDashboardSerializer(serializers.ModelSerializer):
"""Serializer for MonitoringDashboard model"""
dashboard_type_display = serializers.CharField(source='get_dashboard_type_display', read_only=True)
created_by_username = serializers.CharField(source='created_by.username', read_only=True)
class Meta:
model = MonitoringDashboard
fields = [
'id', 'name', 'description', 'dashboard_type', 'dashboard_type_display',
'layout_config', 'widget_configs', 'is_public', 'allowed_users',
'allowed_roles', 'auto_refresh_enabled', 'refresh_interval_seconds',
'is_active', 'created_by', 'created_by_username', 'created_at', 'updated_at'
]
read_only_fields = ['id', 'created_at', 'updated_at']
class SystemStatusSerializer(serializers.ModelSerializer):
"""Serializer for SystemStatus model"""
status_display = serializers.CharField(source='get_status_display', read_only=True)
created_by_username = serializers.CharField(source='created_by.username', read_only=True)
is_resolved = serializers.BooleanField(read_only=True)
class Meta:
model = SystemStatus
fields = [
'id', 'status', 'status_display', 'message', 'affected_services',
'estimated_resolution', 'started_at', 'updated_at', 'resolved_at',
'created_by', 'created_by_username', 'is_resolved'
]
read_only_fields = ['id', 'started_at', 'updated_at']
class HealthCheckSummarySerializer(serializers.Serializer):
"""Serializer for health check summary"""
overall_status = serializers.CharField()
total_targets = serializers.IntegerField()
healthy_targets = serializers.IntegerField()
warning_targets = serializers.IntegerField()
critical_targets = serializers.IntegerField()
health_percentage = serializers.FloatField()
last_updated = serializers.DateTimeField()
class MetricTrendSerializer(serializers.Serializer):
"""Serializer for metric trends"""
metric_name = serializers.CharField()
period_days = serializers.IntegerField()
daily_data = serializers.ListField()
trend = serializers.CharField()
class AlertSummarySerializer(serializers.Serializer):
"""Serializer for alert summary"""
total_alerts = serializers.IntegerField()
critical_alerts = serializers.IntegerField()
high_alerts = serializers.IntegerField()
medium_alerts = serializers.IntegerField()
low_alerts = serializers.IntegerField()
acknowledged_alerts = serializers.IntegerField()
resolved_alerts = serializers.IntegerField()
class SystemOverviewSerializer(serializers.Serializer):
"""Serializer for system overview"""
system_status = SystemStatusSerializer()
health_summary = HealthCheckSummarySerializer()
alert_summary = AlertSummarySerializer()
recent_incidents = serializers.ListField()
top_metrics = serializers.ListField()
system_resources = serializers.DictField()

View File

@@ -0,0 +1 @@
# Monitoring services

View File

@@ -0,0 +1,449 @@
"""
Alerting service for monitoring system
"""
import logging
from typing import Dict, Any, List, Optional
from datetime import datetime, timedelta
from django.utils import timezone
from django.core.mail import send_mail
from django.conf import settings
from django.contrib.auth import get_user_model
from monitoring.models import AlertRule, Alert, SystemMetric, MetricMeasurement, MonitoringTarget
User = get_user_model()
logger = logging.getLogger(__name__)
class AlertEvaluator:
"""Service for evaluating alert conditions"""
def __init__(self):
self.aggregator = None # Will be imported to avoid circular imports
def evaluate_alert_rules(self) -> List[Dict[str, Any]]:
"""Evaluate all active alert rules"""
triggered_alerts = []
active_rules = AlertRule.objects.filter(
status='ACTIVE',
is_enabled=True
)
for rule in active_rules:
try:
if self._evaluate_rule(rule):
alert_data = self._create_alert(rule)
triggered_alerts.append(alert_data)
except Exception as e:
logger.error(f"Failed to evaluate alert rule {rule.name}: {e}")
return triggered_alerts
def _evaluate_rule(self, rule: AlertRule) -> bool:
"""Evaluate if an alert rule condition is met"""
condition = rule.condition
condition_type = condition.get('type')
if condition_type == 'THRESHOLD':
return self._evaluate_threshold_condition(rule, condition)
elif condition_type == 'ANOMALY':
return self._evaluate_anomaly_condition(rule, condition)
elif condition_type == 'AVAILABILITY':
return self._evaluate_availability_condition(rule, condition)
elif condition_type == 'PATTERN':
return self._evaluate_pattern_condition(rule, condition)
else:
logger.warning(f"Unknown condition type: {condition_type}")
return False
def _evaluate_threshold_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
"""Evaluate threshold-based alert conditions"""
if not rule.metric:
return False
# Get latest metric value
latest_measurement = MetricMeasurement.objects.filter(
metric=rule.metric
).order_by('-timestamp').first()
if not latest_measurement:
return False
current_value = float(latest_measurement.value)
threshold_value = condition.get('threshold')
operator = condition.get('operator', '>')
if operator == '>':
return current_value > threshold_value
elif operator == '>=':
return current_value >= threshold_value
elif operator == '<':
return current_value < threshold_value
elif operator == '<=':
return current_value <= threshold_value
elif operator == '==':
return current_value == threshold_value
elif operator == '!=':
return current_value != threshold_value
else:
logger.warning(f"Unknown operator: {operator}")
return False
def _evaluate_anomaly_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
"""Evaluate anomaly-based alert conditions"""
# This would integrate with anomaly detection models
# For now, implement a simple statistical anomaly detection
if not rule.metric:
return False
# Get recent measurements
since = timezone.now() - timedelta(hours=24)
measurements = MetricMeasurement.objects.filter(
metric=rule.metric,
timestamp__gte=since
).order_by('-timestamp')[:100] # Last 100 measurements
if len(measurements) < 10: # Need minimum data points
return False
values = [float(m.value) for m in measurements]
# Calculate mean and standard deviation
mean = sum(values) / len(values)
variance = sum((x - mean) ** 2 for x in values) / len(values)
std_dev = variance ** 0.5
# Check if latest value is an anomaly (more than 2 standard deviations)
latest_value = values[0]
anomaly_threshold = condition.get('threshold', 2.0) # Default 2 sigma
return abs(latest_value - mean) > (anomaly_threshold * std_dev)
def _evaluate_availability_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
"""Evaluate availability-based alert conditions"""
if not rule.target:
return False
# Check if target is in critical state
return rule.target.last_status == 'CRITICAL'
def _evaluate_pattern_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
"""Evaluate pattern-based alert conditions"""
# This would integrate with pattern detection algorithms
# For now, return False as placeholder
return False
def _create_alert(self, rule: AlertRule) -> Dict[str, Any]:
"""Create an alert instance"""
# Get current value for context
current_value = None
threshold_value = None
if rule.metric:
latest_measurement = MetricMeasurement.objects.filter(
metric=rule.metric
).order_by('-timestamp').first()
if latest_measurement:
current_value = float(latest_measurement.value)
threshold_value = rule.metric.critical_threshold
# Create alert
alert = Alert.objects.create(
rule=rule,
title=f"{rule.name} - {rule.severity}",
description=self._generate_alert_description(rule, current_value, threshold_value),
severity=rule.severity,
triggered_value=current_value,
threshold_value=threshold_value,
context_data={
'rule_id': str(rule.id),
'metric_name': rule.metric.name if rule.metric else None,
'target_name': rule.target.name if rule.target else None,
'condition': rule.condition
}
)
return {
'alert_id': str(alert.id),
'rule_name': rule.name,
'severity': rule.severity,
'title': alert.title,
'description': alert.description,
'current_value': current_value,
'threshold_value': threshold_value
}
def _generate_alert_description(self, rule: AlertRule, current_value: Optional[float], threshold_value: Optional[float]) -> str:
"""Generate alert description"""
description = f"Alert rule '{rule.name}' has been triggered.\n"
if rule.metric and current_value is not None:
description += f"Current value: {current_value} {rule.metric.unit}\n"
if threshold_value is not None:
description += f"Threshold: {threshold_value} {rule.metric.unit if rule.metric else ''}\n"
if rule.target:
description += f"Target: {rule.target.name}\n"
description += f"Severity: {rule.severity}\n"
description += f"Time: {timezone.now().strftime('%Y-%m-%d %H:%M:%S')}"
return description
class NotificationService:
"""Service for sending alert notifications"""
def __init__(self):
self.evaluator = AlertEvaluator()
def send_alert_notifications(self, alert_data: Dict[str, Any]) -> Dict[str, Any]:
"""Send notifications for an alert"""
results = {}
# Get alert rule to determine notification channels
rule_id = alert_data.get('rule_id')
if not rule_id:
return {'error': 'No rule ID provided'}
try:
rule = AlertRule.objects.get(id=rule_id)
except AlertRule.DoesNotExist:
return {'error': 'Alert rule not found'}
notification_channels = rule.notification_channels or []
for channel in notification_channels:
try:
if channel['type'] == 'EMAIL':
result = self._send_email_notification(alert_data, channel)
elif channel['type'] == 'SLACK':
result = self._send_slack_notification(alert_data, channel)
elif channel['type'] == 'WEBHOOK':
result = self._send_webhook_notification(alert_data, channel)
else:
result = {'error': f'Unknown notification channel type: {channel["type"]}'}
results[channel['type']] = result
except Exception as e:
logger.error(f"Failed to send {channel['type']} notification: {e}")
results[channel['type']] = {'error': str(e)}
return results
def _send_email_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
"""Send email notification"""
try:
recipients = channel.get('recipients', [])
if not recipients:
return {'error': 'No email recipients configured'}
subject = f"[{alert_data.get('severity', 'ALERT')}] {alert_data.get('title', 'System Alert')}"
message = alert_data.get('description', '')
send_mail(
subject=subject,
message=message,
from_email=settings.DEFAULT_FROM_EMAIL,
recipient_list=recipients,
fail_silently=False
)
return {'status': 'sent', 'recipients': recipients}
except Exception as e:
return {'error': str(e)}
def _send_slack_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
"""Send Slack notification"""
try:
webhook_url = channel.get('webhook_url')
if not webhook_url:
return {'error': 'No Slack webhook URL configured'}
# Create Slack message
color = self._get_slack_color(alert_data.get('severity', 'MEDIUM'))
slack_message = {
"text": alert_data.get('title', 'System Alert'),
"attachments": [
{
"color": color,
"fields": [
{
"title": "Description",
"value": alert_data.get('description', ''),
"short": False
},
{
"title": "Severity",
"value": alert_data.get('severity', 'UNKNOWN'),
"short": True
},
{
"title": "Time",
"value": timezone.now().strftime('%Y-%m-%d %H:%M:%S'),
"short": True
}
]
}
]
}
# Send to Slack (would use requests in real implementation)
# requests.post(webhook_url, json=slack_message)
return {'status': 'sent', 'channel': channel.get('channel', '#alerts')}
except Exception as e:
return {'error': str(e)}
def _send_webhook_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
"""Send webhook notification"""
try:
webhook_url = channel.get('url')
if not webhook_url:
return {'error': 'No webhook URL configured'}
# Prepare webhook payload
payload = {
'alert': alert_data,
'timestamp': timezone.now().isoformat(),
'source': 'ETB-API-Monitoring'
}
# Send webhook (would use requests in real implementation)
# requests.post(webhook_url, json=payload)
return {'status': 'sent', 'url': webhook_url}
except Exception as e:
return {'error': str(e)}
def _get_slack_color(self, severity: str) -> str:
"""Get Slack color based on severity"""
color_map = {
'LOW': 'good',
'MEDIUM': 'warning',
'HIGH': 'danger',
'CRITICAL': 'danger'
}
return color_map.get(severity, 'warning')
class AlertingService:
"""Main alerting service that coordinates alert evaluation and notification"""
def __init__(self):
self.evaluator = AlertEvaluator()
self.notification_service = NotificationService()
def run_alert_evaluation(self) -> Dict[str, Any]:
"""Run alert evaluation and send notifications"""
results = {
'evaluated_rules': 0,
'triggered_alerts': 0,
'notifications_sent': 0,
'errors': []
}
try:
# Evaluate all alert rules
triggered_alerts = self.evaluator.evaluate_alert_rules()
results['triggered_alerts'] = len(triggered_alerts)
# Send notifications for triggered alerts
for alert_data in triggered_alerts:
try:
notification_results = self.notification_service.send_alert_notifications(alert_data)
results['notifications_sent'] += 1
except Exception as e:
logger.error(f"Failed to send notifications for alert {alert_data.get('alert_id')}: {e}")
results['errors'].append(str(e))
# Count evaluated rules
results['evaluated_rules'] = AlertRule.objects.filter(
status='ACTIVE',
is_enabled=True
).count()
except Exception as e:
logger.error(f"Alert evaluation failed: {e}")
results['errors'].append(str(e))
return results
def acknowledge_alert(self, alert_id: str, user: User) -> Dict[str, Any]:
"""Acknowledge an alert"""
try:
alert = Alert.objects.get(id=alert_id)
alert.status = 'ACKNOWLEDGED'
alert.acknowledged_by = user
alert.acknowledged_at = timezone.now()
alert.save()
return {
'status': 'success',
'message': f'Alert {alert_id} acknowledged by {user.username}'
}
except Alert.DoesNotExist:
return {
'status': 'error',
'message': f'Alert {alert_id} not found'
}
except Exception as e:
return {
'status': 'error',
'message': str(e)
}
def resolve_alert(self, alert_id: str, user: User) -> Dict[str, Any]:
"""Resolve an alert"""
try:
alert = Alert.objects.get(id=alert_id)
alert.status = 'RESOLVED'
alert.resolved_by = user
alert.resolved_at = timezone.now()
alert.save()
return {
'status': 'success',
'message': f'Alert {alert_id} resolved by {user.username}'
}
except Alert.DoesNotExist:
return {
'status': 'error',
'message': f'Alert {alert_id} not found'
}
except Exception as e:
return {
'status': 'error',
'message': str(e)
}
def get_active_alerts(self, severity: Optional[str] = None) -> List[Dict[str, Any]]:
"""Get active alerts"""
alerts = Alert.objects.filter(status='TRIGGERED')
if severity:
alerts = alerts.filter(severity=severity)
return [
{
'id': str(alert.id),
'title': alert.title,
'description': alert.description,
'severity': alert.severity,
'triggered_at': alert.triggered_at,
'rule_name': alert.rule.name,
'current_value': float(alert.triggered_value) if alert.triggered_value else None,
'threshold_value': float(alert.threshold_value) if alert.threshold_value else None
}
for alert in alerts.order_by('-triggered_at')
]

View File

@@ -0,0 +1,372 @@
"""
Health check services for monitoring system components
"""
import time
import requests
import psutil
import logging
from typing import Dict, Any, Optional, Tuple
from django.conf import settings
from django.db import connection
from django.core.cache import cache
from django.utils import timezone
from celery import current_app as celery_app
logger = logging.getLogger(__name__)
class BaseHealthCheck:
"""Base class for health checks"""
def __init__(self, target):
self.target = target
self.start_time = None
self.end_time = None
def execute(self) -> Dict[str, Any]:
"""Execute the health check and return results"""
self.start_time = time.time()
try:
result = self._perform_check()
self.end_time = time.time()
result.update({
'response_time_ms': int((self.end_time - self.start_time) * 1000),
'checked_at': timezone.now(),
'error_message': None
})
return result
except Exception as e:
self.end_time = time.time()
logger.error(f"Health check failed for {self.target.name}: {e}")
return {
'status': 'CRITICAL',
'response_time_ms': int((self.end_time - self.start_time) * 1000),
'checked_at': timezone.now(),
'error_message': str(e)
}
def _perform_check(self) -> Dict[str, Any]:
"""Override in subclasses to implement specific checks"""
raise NotImplementedError
class HTTPHealthCheck(BaseHealthCheck):
"""HTTP-based health check"""
def _perform_check(self) -> Dict[str, Any]:
url = self.target.endpoint_url
if not url:
raise ValueError("No endpoint URL configured")
timeout = self.target.timeout_seconds
expected_codes = self.target.expected_status_codes or [200]
response = requests.get(url, timeout=timeout)
if response.status_code in expected_codes:
status = 'HEALTHY'
elif response.status_code >= 500:
status = 'CRITICAL'
else:
status = 'WARNING'
return {
'status': status,
'status_code': response.status_code,
'response_body': response.text[:1000] # Limit response body size
}
class DatabaseHealthCheck(BaseHealthCheck):
"""Database connection health check"""
def _perform_check(self) -> Dict[str, Any]:
try:
with connection.cursor() as cursor:
cursor.execute("SELECT 1")
result = cursor.fetchone()
if result and result[0] == 1:
return {
'status': 'HEALTHY',
'status_code': 200
}
else:
return {
'status': 'CRITICAL',
'status_code': 500,
'error_message': 'Database query returned unexpected result'
}
except Exception as e:
return {
'status': 'CRITICAL',
'status_code': 500,
'error_message': f'Database connection failed: {str(e)}'
}
class CacheHealthCheck(BaseHealthCheck):
"""Cache system health check"""
def _perform_check(self) -> Dict[str, Any]:
try:
# Test cache write/read
test_key = f"health_check_{int(time.time())}"
test_value = "health_check_value"
cache.set(test_key, test_value, timeout=10)
retrieved_value = cache.get(test_key)
if retrieved_value == test_value:
cache.delete(test_key) # Clean up
return {
'status': 'HEALTHY',
'status_code': 200
}
else:
return {
'status': 'CRITICAL',
'status_code': 500,
'error_message': 'Cache read/write test failed'
}
except Exception as e:
return {
'status': 'CRITICAL',
'status_code': 500,
'error_message': f'Cache operation failed: {str(e)}'
}
class CeleryHealthCheck(BaseHealthCheck):
"""Celery worker health check"""
def _perform_check(self) -> Dict[str, Any]:
try:
# Check if Celery workers are active
inspect = celery_app.control.inspect()
active_workers = inspect.active()
if active_workers:
worker_count = len(active_workers)
return {
'status': 'HEALTHY',
'status_code': 200,
'response_body': f'Active workers: {worker_count}'
}
else:
return {
'status': 'CRITICAL',
'status_code': 500,
'error_message': 'No active Celery workers found'
}
except Exception as e:
return {
'status': 'CRITICAL',
'status_code': 500,
'error_message': f'Celery health check failed: {str(e)}'
}
class SystemResourceHealthCheck(BaseHealthCheck):
"""System resource health check"""
def _perform_check(self) -> Dict[str, Any]:
try:
# Get system metrics
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
disk = psutil.disk_usage('/')
# Determine status based on thresholds
status = 'HEALTHY'
if cpu_percent > 90 or memory.percent > 90 or disk.percent > 90:
status = 'CRITICAL'
elif cpu_percent > 80 or memory.percent > 80 or disk.percent > 80:
status = 'WARNING'
return {
'status': status,
'status_code': 200,
'cpu_usage_percent': cpu_percent,
'memory_usage_percent': memory.percent,
'disk_usage_percent': disk.percent,
'response_body': f'CPU: {cpu_percent}%, Memory: {memory.percent}%, Disk: {disk.percent}%'
}
except Exception as e:
return {
'status': 'CRITICAL',
'status_code': 500,
'error_message': f'System resource check failed: {str(e)}'
}
class ModuleHealthCheck(BaseHealthCheck):
"""Django module health check"""
def _perform_check(self) -> Dict[str, Any]:
try:
module_name = self.target.related_module
if not module_name:
raise ValueError("No module specified for module health check")
# Import the module to check if it's accessible
__import__(module_name)
# Check if module has required models/views
from django.apps import apps
app_config = apps.get_app_config(module_name)
if app_config:
return {
'status': 'HEALTHY',
'status_code': 200,
'response_body': f'Module {module_name} is accessible'
}
else:
return {
'status': 'WARNING',
'status_code': 200,
'error_message': f'Module {module_name} not found in Django apps'
}
except Exception as e:
return {
'status': 'CRITICAL',
'status_code': 500,
'error_message': f'Module health check failed: {str(e)}'
}
class HealthCheckFactory:
"""Factory for creating health check instances"""
CHECK_CLASSES = {
'HTTP': HTTPHealthCheck,
'DATABASE': DatabaseHealthCheck,
'CACHE': CacheHealthCheck,
'QUEUE': CeleryHealthCheck,
'CUSTOM': BaseHealthCheck,
'PING': HTTPHealthCheck, # Use HTTP for ping
'SSL': HTTPHealthCheck, # Use HTTP for SSL
}
@classmethod
def create_health_check(cls, target, check_type: str) -> BaseHealthCheck:
"""Create a health check instance based on type"""
check_class = cls.CHECK_CLASSES.get(check_type, BaseHealthCheck)
return check_class(target)
@classmethod
def get_available_check_types(cls) -> list:
"""Get list of available health check types"""
return list(cls.CHECK_CLASSES.keys())
class HealthCheckService:
"""Service for managing health checks"""
def __init__(self):
self.factory = HealthCheckFactory()
def execute_health_check(self, target, check_type: str) -> Dict[str, Any]:
"""Execute a health check for a target"""
health_check = self.factory.create_health_check(target, check_type)
return health_check.execute()
def execute_all_health_checks(self) -> Dict[str, Any]:
"""Execute health checks for all active targets"""
from monitoring.models import MonitoringTarget, HealthCheck
results = {}
active_targets = MonitoringTarget.objects.filter(
status='ACTIVE',
health_check_enabled=True
)
for target in active_targets:
try:
# Determine check type based on target type
check_type = self._get_check_type_for_target(target)
# Execute health check
result = self.execute_health_check(target, check_type)
# Save result to database
HealthCheck.objects.create(
target=target,
check_type=check_type,
status=result['status'],
response_time_ms=result.get('response_time_ms'),
status_code=result.get('status_code'),
response_body=result.get('response_body'),
error_message=result.get('error_message'),
cpu_usage_percent=result.get('cpu_usage_percent'),
memory_usage_percent=result.get('memory_usage_percent'),
disk_usage_percent=result.get('disk_usage_percent')
)
# Update target status
target.last_checked = timezone.now()
target.last_status = result['status']
target.save(update_fields=['last_checked', 'last_status'])
results[target.name] = result
except Exception as e:
logger.error(f"Failed to execute health check for {target.name}: {e}")
results[target.name] = {
'status': 'CRITICAL',
'error_message': str(e)
}
return results
def _get_check_type_for_target(self, target) -> str:
"""Determine the appropriate check type for a target"""
target_type_mapping = {
'APPLICATION': 'HTTP',
'DATABASE': 'DATABASE',
'CACHE': 'CACHE',
'QUEUE': 'QUEUE',
'EXTERNAL_API': 'HTTP',
'SERVICE': 'HTTP',
'INFRASTRUCTURE': 'HTTP',
'MODULE': 'CUSTOM',
}
return target_type_mapping.get(target.target_type, 'HTTP')
def get_system_health_summary(self) -> Dict[str, Any]:
"""Get overall system health summary"""
from monitoring.models import HealthCheck, MonitoringTarget
# Get latest health check for each target
latest_checks = HealthCheck.objects.filter(
target__status='ACTIVE'
).order_by('target', '-checked_at').distinct('target')
total_targets = MonitoringTarget.objects.filter(status='ACTIVE').count()
healthy_targets = latest_checks.filter(status='HEALTHY').count()
warning_targets = latest_checks.filter(status='WARNING').count()
critical_targets = latest_checks.filter(status='CRITICAL').count()
# Calculate overall status
if critical_targets > 0:
overall_status = 'CRITICAL'
elif warning_targets > 0:
overall_status = 'WARNING'
elif healthy_targets == total_targets:
overall_status = 'HEALTHY'
else:
overall_status = 'UNKNOWN'
return {
'overall_status': overall_status,
'total_targets': total_targets,
'healthy_targets': healthy_targets,
'warning_targets': warning_targets,
'critical_targets': critical_targets,
'health_percentage': (healthy_targets / total_targets * 100) if total_targets > 0 else 0,
'last_updated': timezone.now()
}

View File

@@ -0,0 +1,364 @@
"""
Metrics collection service for system monitoring
"""
import time
import logging
from typing import Dict, Any, List, Optional
from datetime import datetime, timedelta
from django.utils import timezone
from django.db import connection
from django.core.cache import cache
from django.conf import settings
from django.contrib.auth import get_user_model
from monitoring.models import SystemMetric, MetricMeasurement
User = get_user_model()
logger = logging.getLogger(__name__)
class MetricsCollector:
"""Service for collecting and storing system metrics"""
def __init__(self):
self.collected_metrics = {}
def collect_all_metrics(self) -> Dict[str, Any]:
"""Collect all configured metrics"""
results = {}
# Get all active metrics
active_metrics = SystemMetric.objects.filter(is_active=True)
for metric in active_metrics:
try:
value = self._collect_metric_value(metric)
if value is not None:
# Store measurement
measurement = MetricMeasurement.objects.create(
metric=metric,
value=value,
tags=self._get_metric_tags(metric),
metadata=self._get_metric_metadata(metric)
)
results[metric.name] = {
'value': value,
'measurement_id': measurement.id,
'timestamp': measurement.timestamp
}
except Exception as e:
logger.error(f"Failed to collect metric {metric.name}: {e}")
results[metric.name] = {
'error': str(e)
}
return results
def _collect_metric_value(self, metric: SystemMetric) -> Optional[float]:
"""Collect value for a specific metric"""
category = metric.category
if category == 'API_RESPONSE_TIME':
return self._collect_api_response_time(metric)
elif category == 'THROUGHPUT':
return self._collect_throughput(metric)
elif category == 'ERROR_RATE':
return self._collect_error_rate(metric)
elif category == 'AVAILABILITY':
return self._collect_availability(metric)
elif category == 'INCIDENT_COUNT':
return self._collect_incident_count(metric)
elif category == 'MTTR':
return self._collect_mttr(metric)
elif category == 'MTTA':
return self._collect_mtta(metric)
elif category == 'SLA_COMPLIANCE':
return self._collect_sla_compliance(metric)
elif category == 'SECURITY_EVENTS':
return self._collect_security_events(metric)
elif category == 'AUTOMATION_SUCCESS':
return self._collect_automation_success(metric)
elif category == 'AI_ACCURACY':
return self._collect_ai_accuracy(metric)
elif category == 'COST_IMPACT':
return self._collect_cost_impact(metric)
elif category == 'USER_ACTIVITY':
return self._collect_user_activity(metric)
elif category == 'SYSTEM_RESOURCES':
return self._collect_system_resources(metric)
else:
logger.warning(f"Unknown metric category: {category}")
return None
def _collect_api_response_time(self, metric: SystemMetric) -> Optional[float]:
"""Collect API response time metrics"""
# This would typically come from middleware or APM tools
# For now, return a mock value
return 150.5 # milliseconds
def _collect_throughput(self, metric: SystemMetric) -> Optional[float]:
"""Collect throughput metrics (requests per minute)"""
# Count requests in the last minute
# This would typically come from access logs or middleware
return 120.0 # requests per minute
def _collect_error_rate(self, metric: SystemMetric) -> Optional[float]:
"""Collect error rate metrics"""
# Count errors in the last hour
# This would typically come from logs or error tracking
return 0.02 # 2% error rate
def _collect_availability(self, metric: SystemMetric) -> Optional[float]:
"""Collect availability metrics"""
# Calculate availability percentage
# This would typically come from uptime monitoring
return 99.9 # 99.9% availability
def _collect_incident_count(self, metric: SystemMetric) -> Optional[float]:
"""Collect incident count metrics"""
from incident_intelligence.models import Incident
# Count incidents in the last 24 hours
since = timezone.now() - timedelta(hours=24)
count = Incident.objects.filter(created_at__gte=since).count()
return float(count)
def _collect_mttr(self, metric: SystemMetric) -> Optional[float]:
"""Collect Mean Time to Resolve metrics"""
from incident_intelligence.models import Incident
# Calculate MTTR for resolved incidents in the last 7 days
since = timezone.now() - timedelta(days=7)
resolved_incidents = Incident.objects.filter(
status__in=['RESOLVED', 'CLOSED'],
resolved_at__isnull=False,
resolved_at__gte=since
)
if not resolved_incidents.exists():
return None
total_resolution_time = 0
count = 0
for incident in resolved_incidents:
if incident.resolved_at and incident.created_at:
resolution_time = incident.resolved_at - incident.created_at
total_resolution_time += resolution_time.total_seconds()
count += 1
if count > 0:
return total_resolution_time / count / 60 # Convert to minutes
return None
def _collect_mtta(self, metric: SystemMetric) -> Optional[float]:
"""Collect Mean Time to Acknowledge metrics"""
# This would require tracking when incidents are first acknowledged
# For now, return a mock value
return 15.5 # minutes
def _collect_sla_compliance(self, metric: SystemMetric) -> Optional[float]:
"""Collect SLA compliance metrics"""
from sla_oncall.models import SLAInstance
# Calculate SLA compliance percentage
total_slas = SLAInstance.objects.count()
if total_slas == 0:
return None
# This would require more complex SLA compliance calculation
# For now, return a mock value
return 95.5 # 95.5% SLA compliance
def _collect_security_events(self, metric: SystemMetric) -> Optional[float]:
"""Collect security events metrics"""
# Count security events in the last hour
# This would come from security logs or audit trails
return 3.0 # 3 security events in the last hour
def _collect_automation_success(self, metric: SystemMetric) -> Optional[float]:
"""Collect automation success rate metrics"""
from automation_orchestration.models import RunbookExecution
# Calculate success rate for runbook executions in the last 24 hours
since = timezone.now() - timedelta(hours=24)
executions = RunbookExecution.objects.filter(created_at__gte=since)
if not executions.exists():
return None
successful = executions.filter(status='COMPLETED').count()
total = executions.count()
return (successful / total * 100) if total > 0 else None
def _collect_ai_accuracy(self, metric: SystemMetric) -> Optional[float]:
"""Collect AI model accuracy metrics"""
from incident_intelligence.models import IncidentClassification
# Calculate accuracy for AI classifications
classifications = IncidentClassification.objects.all()
if not classifications.exists():
return None
# This would require comparing predictions with actual outcomes
# For now, return average confidence score
total_confidence = sum(c.confidence_score for c in classifications)
return (total_confidence / classifications.count() * 100) if classifications.count() > 0 else None
def _collect_cost_impact(self, metric: SystemMetric) -> Optional[float]:
"""Collect cost impact metrics"""
from analytics_predictive_insights.models import CostImpactAnalysis
# Calculate total cost impact for the last 30 days
since = timezone.now() - timedelta(days=30)
cost_analyses = CostImpactAnalysis.objects.filter(created_at__gte=since)
total_cost = sum(float(ca.cost_amount) for ca in cost_analyses)
return total_cost
def _collect_user_activity(self, metric: SystemMetric) -> Optional[float]:
"""Collect user activity metrics"""
# Count active users in the last hour
since = timezone.now() - timedelta(hours=1)
# This would require user activity tracking
return 25.0 # 25 active users in the last hour
def _collect_system_resources(self, metric: SystemMetric) -> Optional[float]:
"""Collect system resource metrics"""
import psutil
# Get CPU usage
cpu_percent = psutil.cpu_percent(interval=1)
return cpu_percent
def _get_metric_tags(self, metric: SystemMetric) -> Dict[str, str]:
"""Get tags for a metric measurement"""
tags = {
'metric_type': metric.metric_type,
'category': metric.category,
}
if metric.related_module:
tags['module'] = metric.related_module
return tags
def _get_metric_metadata(self, metric: SystemMetric) -> Dict[str, Any]:
"""Get metadata for a metric measurement"""
return {
'unit': metric.unit,
'aggregation_method': metric.aggregation_method,
'collection_interval': metric.collection_interval_seconds,
}
class MetricsAggregator:
"""Service for aggregating metrics over time periods"""
def __init__(self):
self.collector = MetricsCollector()
def aggregate_metrics(self, metric: SystemMetric, start_time: datetime, end_time: datetime) -> Dict[str, Any]:
"""Aggregate metrics over a time period"""
measurements = MetricMeasurement.objects.filter(
metric=metric,
timestamp__gte=start_time,
timestamp__lte=end_time
).order_by('timestamp')
if not measurements.exists():
return {
'count': 0,
'values': [],
'aggregated_value': None
}
values = [float(m.value) for m in measurements]
aggregated_value = self._aggregate_values(values, metric.aggregation_method)
return {
'count': len(values),
'values': values,
'aggregated_value': aggregated_value,
'start_time': start_time,
'end_time': end_time,
'unit': metric.unit
}
def _aggregate_values(self, values: List[float], method: str) -> Optional[float]:
"""Aggregate a list of values using the specified method"""
if not values:
return None
if method == 'AVERAGE':
return sum(values) / len(values)
elif method == 'SUM':
return sum(values)
elif method == 'COUNT':
return len(values)
elif method == 'MIN':
return min(values)
elif method == 'MAX':
return max(values)
elif method == 'PERCENTILE_95':
return self._calculate_percentile(values, 95)
elif method == 'PERCENTILE_99':
return self._calculate_percentile(values, 99)
else:
return sum(values) / len(values) # Default to average
def _calculate_percentile(self, values: List[float], percentile: int) -> float:
"""Calculate percentile of values"""
sorted_values = sorted(values)
index = int((percentile / 100) * len(sorted_values))
return sorted_values[min(index, len(sorted_values) - 1)]
def get_metric_trends(self, metric: SystemMetric, days: int = 7) -> Dict[str, Any]:
"""Get metric trends over a period"""
end_time = timezone.now()
start_time = end_time - timedelta(days=days)
# Get daily aggregations
daily_data = []
for i in range(days):
day_start = start_time + timedelta(days=i)
day_end = day_start + timedelta(days=1)
day_aggregation = self.aggregate_metrics(metric, day_start, day_end)
daily_data.append({
'date': day_start.date(),
'value': day_aggregation['aggregated_value'],
'count': day_aggregation['count']
})
return {
'metric_name': metric.name,
'period_days': days,
'daily_data': daily_data,
'trend': self._calculate_trend([d['value'] for d in daily_data if d['value'] is not None])
}
def _calculate_trend(self, values: List[float]) -> str:
"""Calculate trend direction from values"""
if len(values) < 2:
return 'STABLE'
# Simple linear trend calculation
first_half = values[:len(values)//2]
second_half = values[len(values)//2:]
first_avg = sum(first_half) / len(first_half)
second_avg = sum(second_half) / len(second_half)
change_percent = ((second_avg - first_avg) / first_avg) * 100 if first_avg != 0 else 0
if change_percent > 5:
return 'INCREASING'
elif change_percent < -5:
return 'DECREASING'
else:
return 'STABLE'

View File

@@ -0,0 +1,88 @@
"""
Signals for monitoring system
"""
import logging
from django.db.models.signals import post_save, post_delete
from django.dispatch import receiver
from django.utils import timezone
from monitoring.models import Alert, SystemStatus
from monitoring.services.alerting import AlertingService
logger = logging.getLogger(__name__)
@receiver(post_save, sender=Alert)
def alert_created_handler(sender, instance, created, **kwargs):
"""Handle alert creation"""
if created:
logger.info(f"New alert created: {instance.title} ({instance.severity})")
# Send notifications for new alerts
try:
alerting_service = AlertingService()
alert_data = {
'rule_id': str(instance.rule.id),
'title': instance.title,
'description': instance.description,
'severity': instance.severity,
'current_value': float(instance.triggered_value) if instance.triggered_value else None,
'threshold_value': float(instance.threshold_value) if instance.threshold_value else None
}
notification_results = alerting_service.notification_service.send_alert_notifications(alert_data)
logger.info(f"Alert notifications sent: {notification_results}")
except Exception as e:
logger.error(f"Failed to send alert notifications: {e}")
@receiver(post_save, sender=SystemStatus)
def system_status_changed_handler(sender, instance, created, **kwargs):
"""Handle system status changes"""
if created or instance.tracker.has_changed('status'):
logger.info(f"System status changed to: {instance.status}")
# Update system status in cache or external systems
try:
# This could trigger notifications to external systems
# or update status pages
pass
except Exception as e:
logger.error(f"Failed to update system status: {e}")
# Add tracker to SystemStatus model for change detection
from django.db import models
class SystemStatusTracker:
"""Track changes to SystemStatus model"""
def __init__(self, instance):
self.instance = instance
self._initial_data = {}
if instance.pk:
self._initial_data = {
'status': instance.status,
'message': instance.message
}
def has_changed(self, field):
"""Check if a field has changed"""
if not self.instance.pk:
return True
return getattr(self.instance, field) != self._initial_data.get(field)
# Monkey patch the SystemStatus model to add tracker
def add_tracker_to_system_status():
"""Add tracker to SystemStatus instances"""
original_init = SystemStatus.__init__
def new_init(self, *args, **kwargs):
original_init(self, *args, **kwargs)
self.tracker = SystemStatusTracker(self)
SystemStatus.__init__ = new_init
# Call the function to add tracker
add_tracker_to_system_status()

319
ETB-API/monitoring/tasks.py Normal file
View File

@@ -0,0 +1,319 @@
"""
Celery tasks for automated monitoring
"""
import logging
from celery import shared_task
from django.utils import timezone
from datetime import timedelta
from monitoring.services.health_checks import HealthCheckService
from monitoring.services.metrics_collector import MetricsCollector
from monitoring.services.alerting import AlertingService
logger = logging.getLogger(__name__)
@shared_task(bind=True, max_retries=3)
def execute_health_checks(self):
"""Execute health checks for all monitoring targets"""
try:
logger.info("Starting health check execution")
health_service = HealthCheckService()
results = health_service.execute_all_health_checks()
logger.info(f"Health checks completed. Results: {len(results)} targets checked")
return {
'status': 'success',
'targets_checked': len(results),
'results': results
}
except Exception as e:
logger.error(f"Health check execution failed: {e}")
# Retry with exponential backoff
if self.request.retries < self.max_retries:
countdown = 2 ** self.request.retries
logger.info(f"Retrying health checks in {countdown} seconds")
raise self.retry(countdown=countdown)
return {
'status': 'error',
'error': str(e)
}
@shared_task(bind=True, max_retries=3)
def collect_metrics(self):
"""Collect metrics from all configured sources"""
try:
logger.info("Starting metrics collection")
collector = MetricsCollector()
results = collector.collect_all_metrics()
successful_metrics = len([r for r in results.values() if 'error' not in r])
failed_metrics = len([r for r in results.values() if 'error' in r])
logger.info(f"Metrics collection completed. Success: {successful_metrics}, Failed: {failed_metrics}")
return {
'status': 'success',
'successful_metrics': successful_metrics,
'failed_metrics': failed_metrics,
'results': results
}
except Exception as e:
logger.error(f"Metrics collection failed: {e}")
# Retry with exponential backoff
if self.request.retries < self.max_retries:
countdown = 2 ** self.request.retries
logger.info(f"Retrying metrics collection in {countdown} seconds")
raise self.retry(countdown=countdown)
return {
'status': 'error',
'error': str(e)
}
@shared_task(bind=True, max_retries=3)
def evaluate_alerts(self):
"""Evaluate alert rules and send notifications"""
try:
logger.info("Starting alert evaluation")
alerting_service = AlertingService()
results = alerting_service.run_alert_evaluation()
logger.info(f"Alert evaluation completed. Triggered: {results['triggered_alerts']}, Notifications: {results['notifications_sent']}")
return results
except Exception as e:
logger.error(f"Alert evaluation failed: {e}")
# Retry with exponential backoff
if self.request.retries < self.max_retries:
countdown = 2 ** self.request.retries
logger.info(f"Retrying alert evaluation in {countdown} seconds")
raise self.retry(countdown=countdown)
return {
'status': 'error',
'error': str(e)
}
@shared_task(bind=True, max_retries=3)
def cleanup_old_data(self):
"""Clean up old monitoring data"""
try:
logger.info("Starting data cleanup")
from monitoring.models import HealthCheck, MetricMeasurement, Alert
# Clean up old health checks (keep last 7 days)
cutoff_date = timezone.now() - timedelta(days=7)
old_health_checks = HealthCheck.objects.filter(checked_at__lt=cutoff_date)
health_checks_deleted = old_health_checks.count()
old_health_checks.delete()
# Clean up old metric measurements (keep last 90 days)
cutoff_date = timezone.now() - timedelta(days=90)
old_measurements = MetricMeasurement.objects.filter(timestamp__lt=cutoff_date)
measurements_deleted = old_measurements.count()
old_measurements.delete()
# Clean up resolved alerts older than 30 days
cutoff_date = timezone.now() - timedelta(days=30)
old_alerts = Alert.objects.filter(
status='RESOLVED',
resolved_at__lt=cutoff_date
)
alerts_deleted = old_alerts.count()
old_alerts.delete()
logger.info(f"Data cleanup completed. Health checks: {health_checks_deleted}, Measurements: {measurements_deleted}, Alerts: {alerts_deleted}")
return {
'status': 'success',
'health_checks_deleted': health_checks_deleted,
'measurements_deleted': measurements_deleted,
'alerts_deleted': alerts_deleted
}
except Exception as e:
logger.error(f"Data cleanup failed: {e}")
# Retry with exponential backoff
if self.request.retries < self.max_retries:
countdown = 2 ** self.request.retries
logger.info(f"Retrying data cleanup in {countdown} seconds")
raise self.retry(countdown=countdown)
return {
'status': 'error',
'error': str(e)
}
@shared_task(bind=True, max_retries=3)
def generate_system_status_report(self):
"""Generate system status report"""
try:
logger.info("Generating system status report")
from monitoring.models import SystemStatus
from monitoring.services.health_checks import HealthCheckService
health_service = HealthCheckService()
health_summary = health_service.get_system_health_summary()
# Determine overall system status
if health_summary['critical_targets'] > 0:
status = 'MAJOR_OUTAGE'
message = f"Critical issues detected in {health_summary['critical_targets']} systems"
elif health_summary['warning_targets'] > 0:
status = 'DEGRADED'
message = f"Performance issues detected in {health_summary['warning_targets']} systems"
else:
status = 'OPERATIONAL'
message = "All systems operational"
# Create system status record
system_status = SystemStatus.objects.create(
status=status,
message=message,
affected_services=[] # Would be populated based on actual issues
)
logger.info(f"System status report generated: {status}")
return {
'status': 'success',
'system_status': status,
'message': message,
'health_summary': health_summary
}
except Exception as e:
logger.error(f"System status report generation failed: {e}")
# Retry with exponential backoff
if self.request.retries < self.max_retries:
countdown = 2 ** self.request.retries
logger.info(f"Retrying system status report in {countdown} seconds")
raise self.retry(countdown=countdown)
return {
'status': 'error',
'error': str(e)
}
@shared_task(bind=True, max_retries=3)
def monitor_external_integrations(self):
"""Monitor external integrations and services"""
try:
logger.info("Starting external integrations monitoring")
from monitoring.models import MonitoringTarget
from monitoring.services.health_checks import HealthCheckService
health_service = HealthCheckService()
# Get external integration targets
external_targets = MonitoringTarget.objects.filter(
target_type='EXTERNAL_API',
status='ACTIVE'
)
results = {}
for target in external_targets:
try:
result = health_service.execute_health_check(target, 'HTTP')
results[target.name] = result
# Log integration status
if result['status'] == 'CRITICAL':
logger.warning(f"External integration {target.name} is critical")
elif result['status'] == 'WARNING':
logger.info(f"External integration {target.name} has warnings")
except Exception as e:
logger.error(f"Failed to check external integration {target.name}: {e}")
results[target.name] = {'status': 'CRITICAL', 'error': str(e)}
logger.info(f"External integrations monitoring completed. Checked: {len(results)} integrations")
return {
'status': 'success',
'integrations_checked': len(results),
'results': results
}
except Exception as e:
logger.error(f"External integrations monitoring failed: {e}")
# Retry with exponential backoff
if self.request.retries < self.max_retries:
countdown = 2 ** self.request.retries
logger.info(f"Retrying external integrations monitoring in {countdown} seconds")
raise self.retry(countdown=countdown)
return {
'status': 'error',
'error': str(e)
}
@shared_task(bind=True, max_retries=3)
def update_monitoring_dashboards(self):
"""Update monitoring dashboards with latest data"""
try:
logger.info("Updating monitoring dashboards")
from monitoring.models import MonitoringDashboard
from monitoring.services.metrics_collector import MetricsAggregator
aggregator = MetricsAggregator()
# Get active dashboards
active_dashboards = MonitoringDashboard.objects.filter(is_active=True)
updated_dashboards = 0
for dashboard in active_dashboards:
try:
# Update dashboard data (this would typically involve caching or real-time updates)
# For now, just log the update
logger.info(f"Updating dashboard: {dashboard.name}")
updated_dashboards += 1
except Exception as e:
logger.error(f"Failed to update dashboard {dashboard.name}: {e}")
logger.info(f"Dashboard updates completed. Updated: {updated_dashboards} dashboards")
return {
'status': 'success',
'dashboards_updated': updated_dashboards
}
except Exception as e:
logger.error(f"Dashboard update failed: {e}")
# Retry with exponential backoff
if self.request.retries < self.max_retries:
countdown = 2 ** self.request.retries
logger.info(f"Retrying dashboard update in {countdown} seconds")
raise self.retry(countdown=countdown)
return {
'status': 'error',
'error': str(e)
}

View File

@@ -0,0 +1,30 @@
"""
URL configuration for monitoring app
"""
from django.urls import path, include
from rest_framework.routers import DefaultRouter
from monitoring.views import (
MonitoringTargetViewSet, HealthCheckViewSet, SystemMetricViewSet,
MetricMeasurementViewSet, AlertRuleViewSet, AlertViewSet,
MonitoringDashboardViewSet, SystemStatusViewSet, SystemOverviewView,
MonitoringTasksView
)
router = DefaultRouter()
router.register(r'targets', MonitoringTargetViewSet)
router.register(r'health-checks', HealthCheckViewSet)
router.register(r'metrics', SystemMetricViewSet)
router.register(r'measurements', MetricMeasurementViewSet)
router.register(r'alert-rules', AlertRuleViewSet)
router.register(r'alerts', AlertViewSet)
router.register(r'dashboards', MonitoringDashboardViewSet)
router.register(r'status', SystemStatusViewSet)
app_name = 'monitoring'
urlpatterns = [
path('', include(router.urls)),
path('overview/', SystemOverviewView.as_view(), name='system-overview'),
path('tasks/', MonitoringTasksView.as_view(), name='monitoring-tasks'),
]

480
ETB-API/monitoring/views.py Normal file
View File

@@ -0,0 +1,480 @@
"""
Views for monitoring system
"""
import logging
from rest_framework import viewsets, status, permissions
from rest_framework.decorators import action
from rest_framework.response import Response
from rest_framework.views import APIView
from django_filters.rest_framework import DjangoFilterBackend
from rest_framework.filters import SearchFilter, OrderingFilter
from django.utils import timezone
from datetime import timedelta
from monitoring.models import (
MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
AlertRule, Alert, MonitoringDashboard, SystemStatus
)
from monitoring.serializers import (
MonitoringTargetSerializer, HealthCheckSerializer, SystemMetricSerializer,
MetricMeasurementSerializer, AlertRuleSerializer, AlertSerializer,
MonitoringDashboardSerializer, SystemStatusSerializer,
HealthCheckSummarySerializer, MetricTrendSerializer, AlertSummarySerializer,
SystemOverviewSerializer
)
from monitoring.services.health_checks import HealthCheckService
from monitoring.services.metrics_collector import MetricsCollector, MetricsAggregator
from monitoring.services.alerting import AlertingService
from monitoring.tasks import (
execute_health_checks, collect_metrics, evaluate_alerts,
generate_system_status_report
)
logger = logging.getLogger(__name__)
class MonitoringTargetViewSet(viewsets.ModelViewSet):
"""ViewSet for MonitoringTarget model"""
queryset = MonitoringTarget.objects.all()
serializer_class = MonitoringTargetSerializer
permission_classes = [permissions.IsAuthenticated]
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
filterset_fields = ['target_type', 'status', 'last_status', 'related_module']
search_fields = ['name', 'description']
ordering_fields = ['name', 'created_at', 'last_checked']
ordering = ['name']
def perform_create(self, serializer):
"""Set the creator when creating a monitoring target"""
serializer.save(created_by=self.request.user)
@action(detail=True, methods=['post'])
def test_connection(self, request, pk=None):
"""Test connection to monitoring target"""
target = self.get_object()
try:
health_service = HealthCheckService()
result = health_service.execute_health_check(target, 'HTTP')
return Response({
'status': 'success',
'result': result
})
except Exception as e:
return Response({
'status': 'error',
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
@action(detail=True, methods=['post'])
def enable_monitoring(self, request, pk=None):
"""Enable monitoring for a target"""
target = self.get_object()
target.status = 'ACTIVE'
target.save()
return Response({
'status': 'success',
'message': f'Monitoring enabled for {target.name}'
})
@action(detail=True, methods=['post'])
def disable_monitoring(self, request, pk=None):
"""Disable monitoring for a target"""
target = self.get_object()
target.status = 'INACTIVE'
target.save()
return Response({
'status': 'success',
'message': f'Monitoring disabled for {target.name}'
})
class HealthCheckViewSet(viewsets.ReadOnlyModelViewSet):
"""ViewSet for HealthCheck model (read-only)"""
queryset = HealthCheck.objects.all()
serializer_class = HealthCheckSerializer
permission_classes = [permissions.IsAuthenticated]
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
filterset_fields = ['target', 'check_type', 'status']
ordering_fields = ['checked_at', 'response_time_ms']
ordering = ['-checked_at']
@action(detail=False, methods=['get'])
def summary(self, request):
"""Get health check summary"""
try:
health_service = HealthCheckService()
summary = health_service.get_system_health_summary()
serializer = HealthCheckSummarySerializer(summary)
return Response(serializer.data)
except Exception as e:
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
@action(detail=False, methods=['post'])
def run_all_checks(self, request):
"""Run health checks for all targets"""
try:
# Execute health checks asynchronously
task = execute_health_checks.delay()
return Response({
'status': 'success',
'message': 'Health checks started',
'task_id': task.id
})
except Exception as e:
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
class SystemMetricViewSet(viewsets.ModelViewSet):
"""ViewSet for SystemMetric model"""
queryset = SystemMetric.objects.all()
serializer_class = SystemMetricSerializer
permission_classes = [permissions.IsAuthenticated]
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
filterset_fields = ['metric_type', 'category', 'is_active', 'related_module']
search_fields = ['name', 'description']
ordering_fields = ['name', 'created_at']
ordering = ['name']
def perform_create(self, serializer):
"""Set the creator when creating a metric"""
serializer.save(created_by=self.request.user)
@action(detail=True, methods=['get'])
def measurements(self, request, pk=None):
"""Get measurements for a metric"""
metric = self.get_object()
# Get query parameters
hours = int(request.query_params.get('hours', 24))
limit = int(request.query_params.get('limit', 100))
since = timezone.now() - timedelta(hours=hours)
measurements = MetricMeasurement.objects.filter(
metric=metric,
timestamp__gte=since
).order_by('-timestamp')[:limit]
serializer = MetricMeasurementSerializer(measurements, many=True)
return Response(serializer.data)
@action(detail=True, methods=['get'])
def trends(self, request, pk=None):
"""Get metric trends"""
metric = self.get_object()
days = int(request.query_params.get('days', 7))
try:
aggregator = MetricsAggregator()
trends = aggregator.get_metric_trends(metric, days)
serializer = MetricTrendSerializer(trends)
return Response(serializer.data)
except Exception as e:
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
class MetricMeasurementViewSet(viewsets.ReadOnlyModelViewSet):
"""ViewSet for MetricMeasurement model (read-only)"""
queryset = MetricMeasurement.objects.all()
serializer_class = MetricMeasurementSerializer
permission_classes = [permissions.IsAuthenticated]
filter_backends = [DjangoFilterBackend, OrderingFilter]
filterset_fields = ['metric']
ordering_fields = ['timestamp', 'value']
ordering = ['-timestamp']
class AlertRuleViewSet(viewsets.ModelViewSet):
"""ViewSet for AlertRule model"""
queryset = AlertRule.objects.all()
serializer_class = AlertRuleSerializer
permission_classes = [permissions.IsAuthenticated]
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
filterset_fields = ['alert_type', 'severity', 'status', 'is_enabled']
search_fields = ['name', 'description']
ordering_fields = ['name', 'created_at']
ordering = ['name']
def perform_create(self, serializer):
"""Set the creator when creating an alert rule"""
serializer.save(created_by=self.request.user)
@action(detail=True, methods=['post'])
def test_rule(self, request, pk=None):
"""Test an alert rule"""
rule = self.get_object()
try:
alerting_service = AlertingService()
# This would test the rule without creating an alert
return Response({
'status': 'success',
'message': f'Alert rule {rule.name} test completed'
})
except Exception as e:
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
@action(detail=True, methods=['post'])
def enable_rule(self, request, pk=None):
"""Enable an alert rule"""
rule = self.get_object()
rule.is_enabled = True
rule.save()
return Response({
'status': 'success',
'message': f'Alert rule {rule.name} enabled'
})
@action(detail=True, methods=['post'])
def disable_rule(self, request, pk=None):
"""Disable an alert rule"""
rule = self.get_object()
rule.is_enabled = False
rule.save()
return Response({
'status': 'success',
'message': f'Alert rule {rule.name} disabled'
})
class AlertViewSet(viewsets.ModelViewSet):
"""ViewSet for Alert model"""
queryset = Alert.objects.all()
serializer_class = AlertSerializer
permission_classes = [permissions.IsAuthenticated]
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
filterset_fields = ['rule', 'severity', 'status']
search_fields = ['title', 'description']
ordering_fields = ['triggered_at', 'severity']
ordering = ['-triggered_at']
@action(detail=True, methods=['post'])
def acknowledge(self, request, pk=None):
"""Acknowledge an alert"""
alert = self.get_object()
try:
alerting_service = AlertingService()
result = alerting_service.acknowledge_alert(str(alert.id), request.user)
return Response(result)
except Exception as e:
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
@action(detail=True, methods=['post'])
def resolve(self, request, pk=None):
"""Resolve an alert"""
alert = self.get_object()
try:
alerting_service = AlertingService()
result = alerting_service.resolve_alert(str(alert.id), request.user)
return Response(result)
except Exception as e:
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
@action(detail=False, methods=['get'])
def summary(self, request):
"""Get alert summary"""
try:
alerting_service = AlertingService()
active_alerts = alerting_service.get_active_alerts()
# Calculate summary
total_alerts = Alert.objects.count()
critical_alerts = Alert.objects.filter(severity='CRITICAL', status='TRIGGERED').count()
high_alerts = Alert.objects.filter(severity='HIGH', status='TRIGGERED').count()
medium_alerts = Alert.objects.filter(severity='MEDIUM', status='TRIGGERED').count()
low_alerts = Alert.objects.filter(severity='LOW', status='TRIGGERED').count()
acknowledged_alerts = Alert.objects.filter(status='ACKNOWLEDGED').count()
resolved_alerts = Alert.objects.filter(status='RESOLVED').count()
summary = {
'total_alerts': total_alerts,
'critical_alerts': critical_alerts,
'high_alerts': high_alerts,
'medium_alerts': medium_alerts,
'low_alerts': low_alerts,
'acknowledged_alerts': acknowledged_alerts,
'resolved_alerts': resolved_alerts
}
serializer = AlertSummarySerializer(summary)
return Response(serializer.data)
except Exception as e:
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
class MonitoringDashboardViewSet(viewsets.ModelViewSet):
"""ViewSet for MonitoringDashboard model"""
queryset = MonitoringDashboard.objects.all()
serializer_class = MonitoringDashboardSerializer
permission_classes = [permissions.IsAuthenticated]
filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
filterset_fields = ['dashboard_type', 'is_active', 'is_public']
search_fields = ['name', 'description']
ordering_fields = ['name', 'created_at']
ordering = ['name']
def perform_create(self, serializer):
"""Set the creator when creating a dashboard"""
serializer.save(created_by=self.request.user)
def get_queryset(self):
"""Filter dashboards based on user access"""
queryset = super().get_queryset()
if not self.request.user.is_staff:
# Non-staff users can only see public dashboards or dashboards they have access to
queryset = queryset.filter(
models.Q(is_public=True) |
models.Q(allowed_users=self.request.user)
).distinct()
return queryset
class SystemStatusViewSet(viewsets.ReadOnlyModelViewSet):
"""ViewSet for SystemStatus model (read-only)"""
queryset = SystemStatus.objects.all()
serializer_class = SystemStatusSerializer
permission_classes = [permissions.IsAuthenticated]
ordering = ['-started_at']
class SystemOverviewView(APIView):
"""System overview endpoint"""
permission_classes = [permissions.IsAuthenticated]
def get(self, request):
"""Get system overview"""
try:
# Get current system status
current_status = SystemStatus.objects.filter(
resolved_at__isnull=True
).order_by('-started_at').first()
if not current_status:
# Create default operational status
current_status = SystemStatus.objects.create(
status='OPERATIONAL',
message='All systems operational',
created_by=request.user
)
# Get health summary
health_service = HealthCheckService()
health_summary = health_service.get_system_health_summary()
# Get alert summary
alerting_service = AlertingService()
active_alerts = alerting_service.get_active_alerts()
alert_summary = {
'total_alerts': len(active_alerts),
'critical_alerts': len([a for a in active_alerts if a['severity'] == 'CRITICAL']),
'high_alerts': len([a for a in active_alerts if a['severity'] == 'HIGH']),
'medium_alerts': len([a for a in active_alerts if a['severity'] == 'MEDIUM']),
'low_alerts': len([a for a in active_alerts if a['severity'] == 'LOW']),
'acknowledged_alerts': 0, # Would be calculated from database
'resolved_alerts': 0 # Would be calculated from database
}
# Get recent incidents (mock data for now)
recent_incidents = []
# Get top metrics (mock data for now)
top_metrics = []
# Get system resources
import psutil
system_resources = {
'cpu_percent': psutil.cpu_percent(interval=1),
'memory_percent': psutil.virtual_memory().percent,
'disk_percent': psutil.disk_usage('/').percent
}
overview = {
'system_status': current_status,
'health_summary': health_summary,
'alert_summary': alert_summary,
'recent_incidents': recent_incidents,
'top_metrics': top_metrics,
'system_resources': system_resources
}
serializer = SystemOverviewSerializer(overview)
return Response(serializer.data)
except Exception as e:
logger.error(f"Failed to get system overview: {e}")
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
class MonitoringTasksView(APIView):
"""Monitoring tasks management"""
permission_classes = [permissions.IsAuthenticated]
def post(self, request):
"""Execute monitoring tasks"""
task_type = request.data.get('task_type')
try:
if task_type == 'health_checks':
task = execute_health_checks.delay()
elif task_type == 'metrics_collection':
task = collect_metrics.delay()
elif task_type == 'alert_evaluation':
task = evaluate_alerts.delay()
elif task_type == 'system_status_report':
task = generate_system_status_report.delay()
else:
return Response({
'error': 'Invalid task type'
}, status=status.HTTP_400_BAD_REQUEST)
return Response({
'status': 'success',
'message': f'{task_type} task started',
'task_id': task.id
})
except Exception as e:
return Response({
'error': str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)