Updates

2025-09-19 11:58:53 +03:00
parent 306b20e24a
commit 6b247e5b9f
11423 changed files with 1500615 additions and 778 deletions
--- a/ETB-API/monitoring/Documentations/MONITORING_SYSTEM_API.md
+++ b/ETB-API/monitoring/Documentations/MONITORING_SYSTEM_API.md
@@ -0,0 +1,459 @@
+# ETB-API Monitoring System Documentation
+
+## Overview
+
+The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.
+
+## Features
+
+### 1. Health Monitoring
+- **System Health Checks**: Monitor application, database, cache, and queue health
+- **Module Health**: Individual module status and dependency tracking
+- **External Integrations**: Third-party service health monitoring
+- **Infrastructure Monitoring**: Server resources and network connectivity
+
+### 2. Metrics Collection
+- **Performance Metrics**: API response times, throughput, error rates
+- **Business Metrics**: Incident counts, MTTR, MTTA, SLA compliance
+- **Security Metrics**: Security events, failed logins, risk assessments
+- **Infrastructure Metrics**: CPU, memory, disk usage
+- **AI/ML Metrics**: Model accuracy, automation success rates
+
+### 3. Intelligent Alerting
+- **Threshold Alerts**: Configurable thresholds for all metrics
+- **Anomaly Detection**: Statistical anomaly detection
+- **Pattern Alerts**: Pattern-based alerting
+- **Multi-Channel Notifications**: Email, Slack, Webhook support
+- **Alert Management**: Acknowledge, resolve, and track alerts
+
+### 4. Monitoring Dashboards
+- **System Overview**: High-level system status
+- **Performance Dashboard**: Performance metrics visualization
+- **Business Metrics**: Operational metrics dashboard
+- **Security Dashboard**: Security monitoring dashboard
+- **Custom Dashboards**: User-configurable dashboards
+
+## API Endpoints
+
+### Base URL
+```
+http://localhost:8000/api/monitoring/
+```
+
+### Authentication
+All endpoints require authentication using Django REST Framework token authentication.
+
+### Health Checks
+
+#### Get Health Check Summary
+```http
+GET /api/monitoring/health-checks/summary/
+Authorization: Token your-token-here
+```
+
+**Response:**
+```json
+{
+    "overall_status": "HEALTHY",
+    "total_targets": 12,
+    "healthy_targets": 11,
+    "warning_targets": 1,
+    "critical_targets": 0,
+    "health_percentage": 91.67,
+    "last_updated": "2024-01-15T10:30:00Z"
+}
+```
+
+#### Run All Health Checks
+```http
+POST /api/monitoring/health-checks/run_all_checks/
+Authorization: Token your-token-here
+```
+
+**Response:**
+```json
+{
+    "status": "success",
+    "message": "Health checks started",
+    "task_id": "celery-task-id"
+}
+```
+
+#### Test Target Connection
+```http
+POST /api/monitoring/targets/{target_id}/test_connection/
+Authorization: Token your-token-here
+```
+
+### Metrics
+
+#### Get Metric Measurements
+```http
+GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
+Authorization: Token your-token-here
+```
+
+#### Get Metric Trends
+```http
+GET /api/monitoring/metrics/{metric_id}/trends/?days=7
+Authorization: Token your-token-here
+```
+
+**Response:**
+```json
+{
+    "metric_name": "API Response Time",
+    "period_days": 7,
+    "daily_data": [
+        {
+            "date": "2024-01-08",
+            "value": 150.5,
+            "count": 1440
+        }
+    ],
+    "trend": "STABLE"
+}
+```
+
+### Alerts
+
+#### Get Alert Summary
+```http
+GET /api/monitoring/alerts/summary/
+Authorization: Token your-token-here
+```
+
+**Response:**
+```json
+{
+    "total_alerts": 25,
+    "critical_alerts": 2,
+    "high_alerts": 5,
+    "medium_alerts": 8,
+    "low_alerts": 10,
+    "acknowledged_alerts": 15,
+    "resolved_alerts": 20
+}
+```
+
+#### Acknowledge Alert
+```http
+POST /api/monitoring/alerts/{alert_id}/acknowledge/
+Authorization: Token your-token-here
+```
+
+#### Resolve Alert
+```http
+POST /api/monitoring/alerts/{alert_id}/resolve/
+Authorization: Token your-token-here
+```
+
+### System Overview
+
+#### Get System Overview
+```http
+GET /api/monitoring/overview/
+Authorization: Token your-token-here
+```
+
+**Response:**
+```json
+{
+    "system_status": {
+        "status": "OPERATIONAL",
+        "message": "All systems operational",
+        "started_at": "2024-01-15T09:00:00Z"
+    },
+    "health_summary": {
+        "overall_status": "HEALTHY",
+        "total_targets": 12,
+        "healthy_targets": 12,
+        "health_percentage": 100.0
+    },
+    "alert_summary": {
+        "total_alerts": 0,
+        "critical_alerts": 0
+    },
+    "system_resources": {
+        "cpu_percent": 45.2,
+        "memory_percent": 67.8,
+        "disk_percent": 34.5
+    }
+}
+```
+
+### Monitoring Tasks
+
+#### Execute Monitoring Tasks
+```http
+POST /api/monitoring/tasks/
+Authorization: Token your-token-here
+Content-Type: application/json
+
+{
+    "task_type": "health_checks"
+}
+```
+
+**Available task types:**
+- `health_checks`: Execute health checks for all targets
+- `metrics_collection`: Collect metrics from all sources
+- `alert_evaluation`: Evaluate alert rules and send notifications
+- `system_status_report`: Generate system status report
+
+## Data Models
+
+### MonitoringTarget
+Represents a system, service, or component to monitor.
+
+**Fields:**
+- `name`: Target name
+- `target_type`: Type (APPLICATION, DATABASE, CACHE, etc.)
+- `endpoint_url`: Health check endpoint
+- `status`: Current status (ACTIVE, INACTIVE, etc.)
+- `last_status`: Last health check result
+- `health_check_enabled`: Whether health checks are enabled
+
+### SystemMetric
+Defines metrics to collect and monitor.
+
+**Fields:**
+- `name`: Metric name
+- `metric_type`: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)
+- `category`: Category (API_RESPONSE_TIME, MTTR, etc.)
+- `unit`: Unit of measurement
+- `aggregation_method`: How to aggregate values
+- `warning_threshold`: Warning threshold
+- `critical_threshold`: Critical threshold
+
+### AlertRule
+Defines alert conditions and notifications.
+
+**Fields:**
+- `name`: Rule name
+- `alert_type`: Type (THRESHOLD, ANOMALY, etc.)
+- `severity`: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)
+- `condition`: Alert condition configuration
+- `notification_channels`: Notification channels
+- `is_enabled`: Whether rule is enabled
+
+### Alert
+Represents triggered alerts.
+
+**Fields:**
+- `title`: Alert title
+- `description`: Alert description
+- `severity`: Alert severity
+- `status`: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)
+- `triggered_value`: Value that triggered the alert
+- `threshold_value`: Threshold that was exceeded
+
+## Configuration
+
+### Environment Variables
+
+```bash
+# Monitoring Settings
+MONITORING_ENABLED=true
+MONITORING_HEALTH_CHECK_INTERVAL=60
+MONITORING_METRICS_COLLECTION_INTERVAL=300
+MONITORING_ALERT_EVALUATION_INTERVAL=60
+
+# Alerting Settings
+ALERTING_EMAIL_FROM=monitoring@etb-api.com
+ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
+ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts
+
+# Performance Thresholds
+PERFORMANCE_API_RESPONSE_THRESHOLD=2000
+PERFORMANCE_CPU_THRESHOLD=80
+PERFORMANCE_MEMORY_THRESHOLD=80
+```
+
+### Celery Configuration
+
+Add to your Celery configuration:
+
+```python
+from celery.schedules import crontab
+
+CELERY_BEAT_SCHEDULE = {
+    'health-checks': {
+        'task': 'monitoring.tasks.execute_health_checks',
+        'schedule': 60.0,  # Every minute
+    },
+    'metrics-collection': {
+        'task': 'monitoring.tasks.collect_metrics',
+        'schedule': 300.0,  # Every 5 minutes
+    },
+    'alert-evaluation': {
+        'task': 'monitoring.tasks.evaluate_alerts',
+        'schedule': 60.0,  # Every minute
+    },
+    'data-cleanup': {
+        'task': 'monitoring.tasks.cleanup_old_data',
+        'schedule': crontab(hour=2, minute=0),  # Daily at 2 AM
+    },
+}
+```
+
+## Setup Instructions
+
+### 1. Install Dependencies
+
+Add to `requirements.txt`:
+```
+psutil>=5.9.0
+requests>=2.31.0
+```
+
+### 2. Run Migrations
+
+```bash
+python manage.py makemigrations monitoring
+python manage.py migrate
+```
+
+### 3. Set Up Initial Configuration
+
+```bash
+python manage.py setup_monitoring --admin-user admin
+```
+
+### 4. Start Celery Workers
+
+```bash
+celery -A core worker -l info
+celery -A core beat -l info
+```
+
+### 5. Access Monitoring
+
+- **Admin Interface**: `http://localhost:8000/admin/monitoring/`
+- **API Documentation**: `http://localhost:8000/api/monitoring/`
+- **System Overview**: `http://localhost:8000/api/monitoring/overview/`
+
+## Monitoring Best Practices
+
+### 1. Health Checks
+- Set appropriate check intervals (not too frequent)
+- Use timeouts to prevent hanging checks
+- Monitor dependencies and external services
+- Implement graceful degradation
+
+### 2. Metrics Collection
+- Collect metrics at appropriate intervals
+- Use proper aggregation methods
+- Set meaningful thresholds
+- Monitor both technical and business metrics
+
+### 3. Alerting
+- Set up alert rules with appropriate severity levels
+- Use multiple notification channels
+- Implement alert fatigue prevention
+- Regularly review and tune alert thresholds
+
+### 4. Dashboards
+- Create role-based dashboards
+- Use appropriate refresh intervals
+- Include both real-time and historical data
+- Make dashboards actionable
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Health Checks Failing**
+   - Check network connectivity
+   - Verify endpoint URLs
+   - Check authentication credentials
+   - Review timeout settings
+
+2. **Metrics Not Collecting**
+   - Verify Celery workers are running
+   - Check metric configuration
+   - Review collection intervals
+   - Check for errors in logs
+
+3. **Alerts Not Triggering**
+   - Verify alert rules are enabled
+   - Check threshold values
+   - Review notification channel configuration
+   - Check alert evaluation task is running
+
+4. **Performance Issues**
+   - Monitor system resources
+   - Check database query performance
+   - Review metric retention settings
+   - Optimize collection intervals
+
+### Debug Commands
+
+```bash
+# Check monitoring status
+python manage.py shell
+>>> from monitoring.services.health_checks import HealthCheckService
+>>> service = HealthCheckService()
+>>> service.get_system_health_summary()
+
+# Test health checks
+>>> from monitoring.models import MonitoringTarget
+>>> target = MonitoringTarget.objects.first()
+>>> service.execute_health_check(target, 'HTTP')
+
+# Check metrics collection
+>>> from monitoring.services.metrics_collector import MetricsCollector
+>>> collector = MetricsCollector()
+>>> collector.collect_all_metrics()
+```
+
+## Integration with Other Modules
+
+### Security Module
+- Monitor authentication failures
+- Track security events
+- Monitor device posture assessments
+- Alert on risk assessment anomalies
+
+### Incident Intelligence
+- Monitor incident processing times
+- Track AI model performance
+- Monitor correlation engine health
+- Alert on incident volume spikes
+
+### Automation & Orchestration
+- Monitor runbook execution success
+- Track integration health
+- Monitor ChatOps command usage
+- Alert on automation failures
+
+### SLA & On-Call
+- Monitor SLA compliance
+- Track escalation times
+- Monitor on-call assignments
+- Alert on SLA breaches
+
+### Analytics & Predictive Insights
+- Monitor ML model accuracy
+- Track prediction performance
+- Monitor cost impact calculations
+- Alert on anomaly detections
+
+## Future Enhancements
+
+### Planned Features
+1. **Advanced Anomaly Detection**: Machine learning-based anomaly detection
+2. **Predictive Alerting**: Predict and prevent issues before they occur
+3. **Custom Metrics**: User-defined custom metrics
+4. **Advanced Dashboards**: Interactive dashboards with drill-down capabilities
+5. **Mobile App**: Mobile monitoring application
+6. **Integration APIs**: APIs for external monitoring tools
+7. **Cost Optimization**: Resource usage optimization recommendations
+8. **Compliance Reporting**: Automated compliance reporting
+
+### Integration Roadmap
+1. **APM Tools**: New Relic, DataDog, AppDynamics
+2. **Log Aggregation**: ELK Stack, Splunk, Fluentd
+3. **Infrastructure Monitoring**: Prometheus, Grafana, InfluxDB
+4. **Cloud Platforms**: AWS CloudWatch, Azure Monitor, GCP Monitoring
+5. **Communication Platforms**: PagerDuty, OpsGenie, VictorOps
--- a/ETB-API/monitoring/init.py
+++ b/ETB-API/monitoring/init.py
@@ -0,0 +1 @@
+# Monitoring module for ETB-API system
--- a/ETB-API/monitoring/pycache/init.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/init.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/admin.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/admin.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/apps.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/apps.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/enterprise_monitoring.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/enterprise_monitoring.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/models.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/models.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/serializers.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/serializers.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/signals.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/signals.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/tasks.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/tasks.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/urls.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/urls.cpython-312.pyc
--- a/ETB-API/monitoring/pycache/views.cpython-312.pyc
+++ b/ETB-API/monitoring/pycache/views.cpython-312.pyc
--- a/ETB-API/monitoring/admin.py
+++ b/ETB-API/monitoring/admin.py
@@ -0,0 +1,289 @@
+"""
+Admin configuration for monitoring models
+"""
+from django.contrib import admin
+from django.utils.html import format_html
+from django.urls import reverse
+from django.utils import timezone
+
+from monitoring.models import (
+    MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
+    AlertRule, Alert, MonitoringDashboard, SystemStatus
+)
+
+
+@admin.register(MonitoringTarget)
+class MonitoringTargetAdmin(admin.ModelAdmin):
+    """Admin for MonitoringTarget model"""
+    
+    list_display = [
+        'name', 'target_type', 'status', 'last_status', 'last_checked',
+        'health_check_enabled', 'related_module', 'created_at'
+    ]
+    list_filter = ['target_type', 'status', 'last_status', 'health_check_enabled', 'related_module']
+    search_fields = ['name', 'description', 'endpoint_url']
+    readonly_fields = ['id', 'created_at', 'updated_at', 'last_checked']
+    
+    fieldsets = (
+        ('Basic Information', {
+            'fields': ('id', 'name', 'description', 'target_type', 'related_module')
+        }),
+        ('Connection Details', {
+            'fields': ('endpoint_url', 'connection_config')
+        }),
+        ('Monitoring Configuration', {
+            'fields': (
+                'check_interval_seconds', 'timeout_seconds', 'retry_count',
+                'health_check_enabled', 'health_check_endpoint', 'expected_status_codes'
+            )
+        }),
+        ('Status', {
+            'fields': ('status', 'last_checked', 'last_status')
+        }),
+        ('Metadata', {
+            'fields': ('created_by', 'created_at', 'updated_at'),
+            'classes': ('collapse',)
+        })
+    )
+    
+    def get_queryset(self, request):
+        return super().get_queryset(request).select_related('created_by')
+
+
+@admin.register(HealthCheck)
+class HealthCheckAdmin(admin.ModelAdmin):
+    """Admin for HealthCheck model"""
+    
+    list_display = [
+        'target_name', 'check_type', 'status', 'response_time_ms',
+        'status_code', 'checked_at'
+    ]
+    list_filter = ['check_type', 'status', 'target__target_type']
+    search_fields = ['target__name', 'error_message']
+    readonly_fields = ['id', 'checked_at']
+    date_hierarchy = 'checked_at'
+    
+    def target_name(self, obj):
+        return obj.target.name
+    target_name.short_description = 'Target'
+    
+    def get_queryset(self, request):
+        return super().get_queryset(request).select_related('target')
+
+
+@admin.register(SystemMetric)
+class SystemMetricAdmin(admin.ModelAdmin):
+    """Admin for SystemMetric model"""
+    
+    list_display = [
+        'name', 'metric_type', 'category', 'unit', 'is_active',
+        'is_system_metric', 'related_module', 'created_at'
+    ]
+    list_filter = ['metric_type', 'category', 'is_active', 'is_system_metric', 'related_module']
+    search_fields = ['name', 'description']
+    readonly_fields = ['id', 'created_at', 'updated_at']
+    
+    fieldsets = (
+        ('Basic Information', {
+            'fields': ('id', 'name', 'description', 'metric_type', 'category', 'unit')
+        }),
+        ('Configuration', {
+            'fields': (
+                'aggregation_method', 'collection_interval_seconds', 'retention_days',
+                'warning_threshold', 'critical_threshold'
+            )
+        }),
+        ('Status', {
+            'fields': ('is_active', 'is_system_metric', 'related_module')
+        }),
+        ('Metadata', {
+            'fields': ('created_by', 'created_at', 'updated_at'),
+            'classes': ('collapse',)
+        })
+    )
+    
+    def get_queryset(self, request):
+        return super().get_queryset(request).select_related('created_by')
+
+
+@admin.register(MetricMeasurement)
+class MetricMeasurementAdmin(admin.ModelAdmin):
+    """Admin for MetricMeasurement model"""
+    
+    list_display = [
+        'metric_name', 'value', 'unit', 'timestamp'
+    ]
+    list_filter = ['metric__metric_type', 'metric__category', 'timestamp']
+    search_fields = ['metric__name']
+    readonly_fields = ['id', 'timestamp']
+    date_hierarchy = 'timestamp'
+    
+    def metric_name(self, obj):
+        return obj.metric.name
+    metric_name.short_description = 'Metric'
+    
+    def unit(self, obj):
+        return obj.metric.unit
+    unit.short_description = 'Unit'
+    
+    def get_queryset(self, request):
+        return super().get_queryset(request).select_related('metric')
+
+
+@admin.register(AlertRule)
+class AlertRuleAdmin(admin.ModelAdmin):
+    """Admin for AlertRule model"""
+    
+    list_display = [
+        'name', 'alert_type', 'severity', 'status', 'is_enabled',
+        'metric_name', 'target_name', 'created_at'
+    ]
+    list_filter = ['alert_type', 'severity', 'status', 'is_enabled']
+    search_fields = ['name', 'description']
+    readonly_fields = ['id', 'created_at', 'updated_at']
+    
+    fieldsets = (
+        ('Basic Information', {
+            'fields': ('id', 'name', 'description', 'alert_type', 'severity')
+        }),
+        ('Rule Configuration', {
+            'fields': ('condition', 'evaluation_interval_seconds')
+        }),
+        ('Related Objects', {
+            'fields': ('metric', 'target')
+        }),
+        ('Notifications', {
+            'fields': ('notification_channels', 'notification_template')
+        }),
+        ('Status', {
+            'fields': ('status', 'is_enabled')
+        }),
+        ('Metadata', {
+            'fields': ('created_by', 'created_at', 'updated_at'),
+            'classes': ('collapse',)
+        })
+    )
+    
+    def metric_name(self, obj):
+        return obj.metric.name if obj.metric else '-'
+    metric_name.short_description = 'Metric'
+    
+    def target_name(self, obj):
+        return obj.target.name if obj.target else '-'
+    target_name.short_description = 'Target'
+    
+    def get_queryset(self, request):
+        return super().get_queryset(request).select_related('metric', 'target', 'created_by')
+
+
+@admin.register(Alert)
+class AlertAdmin(admin.ModelAdmin):
+    """Admin for Alert model"""
+    
+    list_display = [
+        'title', 'severity', 'status', 'rule_name', 'triggered_value',
+        'threshold_value', 'triggered_at', 'acknowledged_by', 'resolved_by'
+    ]
+    list_filter = ['severity', 'status', 'rule__alert_type', 'triggered_at']
+    search_fields = ['title', 'description', 'rule__name']
+    readonly_fields = ['id', 'triggered_at']
+    date_hierarchy = 'triggered_at'
+    
+    fieldsets = (
+        ('Alert Information', {
+            'fields': ('id', 'rule', 'title', 'description', 'severity', 'status')
+        }),
+        ('Values', {
+            'fields': ('triggered_value', 'threshold_value', 'context_data')
+        }),
+        ('Timestamps', {
+            'fields': ('triggered_at', 'acknowledged_at', 'resolved_at')
+        }),
+        ('Assignment', {
+            'fields': ('acknowledged_by', 'resolved_by')
+        })
+    )
+    
+    def rule_name(self, obj):
+        return obj.rule.name
+    rule_name.short_description = 'Rule'
+    
+    def get_queryset(self, request):
+        return super().get_queryset(request).select_related(
+            'rule', 'acknowledged_by', 'resolved_by'
+        )
+
+
+@admin.register(MonitoringDashboard)
+class MonitoringDashboardAdmin(admin.ModelAdmin):
+    """Admin for MonitoringDashboard model"""
+    
+    list_display = [
+        'name', 'dashboard_type', 'is_active', 'is_public',
+        'auto_refresh_enabled', 'created_by', 'created_at'
+    ]
+    list_filter = ['dashboard_type', 'is_active', 'is_public', 'auto_refresh_enabled']
+    search_fields = ['name', 'description']
+    readonly_fields = ['id', 'created_at', 'updated_at']
+    filter_horizontal = ['allowed_users']
+    
+    fieldsets = (
+        ('Basic Information', {
+            'fields': ('id', 'name', 'description', 'dashboard_type')
+        }),
+        ('Configuration', {
+            'fields': ('layout_config', 'widget_configs')
+        }),
+        ('Access Control', {
+            'fields': ('is_public', 'allowed_users', 'allowed_roles')
+        }),
+        ('Refresh Settings', {
+            'fields': ('auto_refresh_enabled', 'refresh_interval_seconds')
+        }),
+        ('Status', {
+            'fields': ('is_active',)
+        }),
+        ('Metadata', {
+            'fields': ('created_by', 'created_at', 'updated_at'),
+            'classes': ('collapse',)
+        })
+    )
+    
+    def get_queryset(self, request):
+        return super().get_queryset(request).select_related('created_by')
+
+
+@admin.register(SystemStatus)
+class SystemStatusAdmin(admin.ModelAdmin):
+    """Admin for SystemStatus model"""
+    
+    list_display = [
+        'status', 'message', 'started_at', 'resolved_at', 'is_resolved',
+        'created_by'
+    ]
+    list_filter = ['status', 'started_at', 'resolved_at']
+    search_fields = ['message', 'affected_services']
+    readonly_fields = ['id', 'started_at', 'updated_at', 'is_resolved']
+    date_hierarchy = 'started_at'
+    
+    fieldsets = (
+        ('Status Information', {
+            'fields': ('id', 'status', 'message', 'affected_services')
+        }),
+        ('Timeline', {
+            'fields': ('started_at', 'updated_at', 'resolved_at', 'estimated_resolution')
+        }),
+        ('Metadata', {
+            'fields': ('created_by', 'is_resolved'),
+            'classes': ('collapse',)
+        })
+    )
+    
+    def get_queryset(self, request):
+        return super().get_queryset(request).select_related('created_by')
+
+
+# Custom admin site configuration
+admin.site.site_header = "ETB-API Monitoring Administration"
+admin.site.site_title = "ETB-API Monitoring"
+admin.site.index_title = "Monitoring System Administration"
--- a/ETB-API/monitoring/apps.py
+++ b/ETB-API/monitoring/apps.py
@@ -0,0 +1,12 @@
+from django.apps import AppConfig
+
+
+class MonitoringConfig(AppConfig):
+    default_auto_field = 'django.db.models.BigAutoField'
+    name = 'monitoring'
+    verbose_name = 'System Monitoring'
+    
+    def ready(self):
+        """Initialize monitoring when Django starts"""
+        import monitoring.signals
+        import monitoring.tasks
--- a/ETB-API/monitoring/enterprise_monitoring.py
+++ b/ETB-API/monitoring/enterprise_monitoring.py
@@ -0,0 +1,795 @@
+"""
+Enterprise Monitoring System for ETB-API
+Advanced monitoring with metrics, alerting, and observability
+"""
+import logging
+import time
+import psutil
+import json
+import os
+from datetime import datetime, timedelta
+from typing import Dict, List, Optional, Any, Union
+from django.http import HttpRequest, JsonResponse
+from django.conf import settings
+from django.utils import timezone
+from django.core.cache import cache
+from django.db import connection
+from django.core.management import call_command
+from rest_framework import status
+from rest_framework.response import Response
+from rest_framework.views import APIView
+from rest_framework.decorators import api_view, permission_classes
+from rest_framework.permissions import IsAuthenticated
+from django.core.management.base import BaseCommand
+import requests
+import redis
+from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
+from prometheus_client.core import CollectorRegistry
+import threading
+import queue
+
+logger = logging.getLogger(__name__)
+
+
+class MetricsCollector:
+    """Enterprise metrics collection system"""
+    
+    def __init__(self):
+        self.registry = CollectorRegistry()
+        self.metrics = self._initialize_metrics()
+        self.collection_interval = 60  # seconds
+        self.is_running = False
+        self.collection_thread = None
+        
+    def _initialize_metrics(self):
+        """Initialize Prometheus metrics"""
+        metrics = {}
+        
+        # Application metrics
+        metrics['http_requests_total'] = Counter(
+            'http_requests_total',
+            'Total HTTP requests',
+            ['method', 'endpoint', 'status_code'],
+            registry=self.registry
+        )
+        
+        metrics['http_request_duration_seconds'] = Histogram(
+            'http_request_duration_seconds',
+            'HTTP request duration in seconds',
+            ['method', 'endpoint'],
+            registry=self.registry
+        )
+        
+        metrics['active_users'] = Gauge(
+            'active_users',
+            'Number of active users',
+            registry=self.registry
+        )
+        
+        metrics['incident_count'] = Gauge(
+            'incident_count',
+            'Total number of incidents',
+            ['status', 'priority'],
+            registry=self.registry
+        )
+        
+        metrics['sla_breach_count'] = Gauge(
+            'sla_breach_count',
+            'Number of SLA breaches',
+            ['sla_type'],
+            registry=self.registry
+        )
+        
+        # System metrics
+        metrics['system_cpu_usage'] = Gauge(
+            'system_cpu_usage_percent',
+            'System CPU usage percentage',
+            registry=self.registry
+        )
+        
+        metrics['system_memory_usage'] = Gauge(
+            'system_memory_usage_percent',
+            'System memory usage percentage',
+            registry=self.registry
+        )
+        
+        metrics['system_disk_usage'] = Gauge(
+            'system_disk_usage_percent',
+            'System disk usage percentage',
+            registry=self.registry
+        )
+        
+        metrics['database_connections'] = Gauge(
+            'database_connections_active',
+            'Active database connections',
+            registry=self.registry
+        )
+        
+        metrics['cache_hit_ratio'] = Gauge(
+            'cache_hit_ratio',
+            'Cache hit ratio',
+            registry=self.registry
+        )
+        
+        # Business metrics
+        metrics['incident_resolution_time'] = Histogram(
+            'incident_resolution_time_seconds',
+            'Incident resolution time in seconds',
+            ['priority', 'category'],
+            registry=self.registry
+        )
+        
+        metrics['automation_success_rate'] = Gauge(
+            'automation_success_rate',
+            'Automation success rate',
+            ['automation_type'],
+            registry=self.registry
+        )
+        
+        metrics['user_satisfaction_score'] = Gauge(
+            'user_satisfaction_score',
+            'User satisfaction score',
+            registry=self.registry
+        )
+        
+        return metrics
+    
+    def start_collection(self):
+        """Start metrics collection in background thread"""
+        if self.is_running:
+            return
+        
+        self.is_running = True
+        self.collection_thread = threading.Thread(target=self._collect_metrics_loop)
+        self.collection_thread.daemon = True
+        self.collection_thread.start()
+        logger.info("Metrics collection started")
+    
+    def stop_collection(self):
+        """Stop metrics collection"""
+        self.is_running = False
+        if self.collection_thread:
+            self.collection_thread.join()
+        logger.info("Metrics collection stopped")
+    
+    def _collect_metrics_loop(self):
+        """Main metrics collection loop"""
+        while self.is_running:
+            try:
+                self._collect_system_metrics()
+                self._collect_application_metrics()
+                self._collect_business_metrics()
+                time.sleep(self.collection_interval)
+            except Exception as e:
+                logger.error(f"Error collecting metrics: {str(e)}")
+                time.sleep(self.collection_interval)
+    
+    def _collect_system_metrics(self):
+        """Collect system-level metrics"""
+        try:
+            # CPU usage
+            cpu_percent = psutil.cpu_percent(interval=1)
+            self.metrics['system_cpu_usage'].set(cpu_percent)
+            
+            # Memory usage
+            memory = psutil.virtual_memory()
+            self.metrics['system_memory_usage'].set(memory.percent)
+            
+            # Disk usage
+            disk = psutil.disk_usage('/')
+            disk_percent = (disk.used / disk.total) * 100
+            self.metrics['system_disk_usage'].set(disk_percent)
+            
+            # Database connections
+            with connection.cursor() as cursor:
+                cursor.execute("SELECT COUNT(*) FROM pg_stat_activity")
+                db_connections = cursor.fetchone()[0]
+                self.metrics['database_connections'].set(db_connections)
+            
+            # Cache hit ratio
+            cache_stats = cache._cache.get_stats()
+            if cache_stats:
+                hit_ratio = cache_stats.get('hit_ratio', 0)
+                self.metrics['cache_hit_ratio'].set(hit_ratio)
+            
+        except Exception as e:
+            logger.error(f"Error collecting system metrics: {str(e)}")
+    
+    def _collect_application_metrics(self):
+        """Collect application-level metrics"""
+        try:
+            # Active users (from cache)
+            active_users = cache.get('active_users_count', 0)
+            self.metrics['active_users'].set(active_users)
+            
+            # Incident counts
+            from incident_intelligence.models import Incident
+            from django.db import models
+            incident_counts = Incident.objects.values('status', 'priority').annotate(
+                count=models.Count('id')
+            )
+            
+            for incident in incident_counts:
+                self.metrics['incident_count'].labels(
+                    status=incident['status'],
+                    priority=incident['priority']
+                ).set(incident['count'])
+            
+            # SLA breach counts
+            from sla_oncall.models import SLAInstance
+            sla_breaches = SLAInstance.objects.filter(
+                status='breached'
+            ).values('sla_type').annotate(
+                count=models.Count('id')
+            )
+            
+            for breach in sla_breaches:
+                self.metrics['sla_breach_count'].labels(
+                    sla_type=breach['sla_type']
+                ).set(breach['count'])
+            
+        except Exception as e:
+            logger.error(f"Error collecting application metrics: {str(e)}")
+    
+    def _collect_business_metrics(self):
+        """Collect business-level metrics"""
+        try:
+            # Incident resolution times
+            from incident_intelligence.models import Incident
+            from django.db import models
+            resolved_incidents = Incident.objects.filter(
+                status='resolved',
+                resolved_at__isnull=False
+            ).values('priority', 'category')
+            
+            for incident in resolved_incidents:
+                resolution_time = (incident['resolved_at'] - incident['created_at']).total_seconds()
+                self.metrics['incident_resolution_time'].labels(
+                    priority=incident['priority'],
+                    category=incident['category']
+                ).observe(resolution_time)
+            
+            # Automation success rates
+            from automation_orchestration.models import AutomationExecution
+            from django.db import models
+            automation_stats = AutomationExecution.objects.values('automation_type').annotate(
+                total=models.Count('id'),
+                successful=models.Count('id', filter=models.Q(status='success'))
+            )
+            
+            for stat in automation_stats:
+                success_rate = (stat['successful'] / stat['total']) * 100 if stat['total'] > 0 else 0
+                self.metrics['automation_success_rate'].labels(
+                    automation_type=stat['automation_type']
+                ).set(success_rate)
+            
+            # User satisfaction score (from feedback)
+            from knowledge_learning.models import UserFeedback
+            from django.db import models
+            feedback_scores = UserFeedback.objects.values('rating').annotate(
+                count=models.Count('id')
+            )
+            
+            total_feedback = sum(f['count'] for f in feedback_scores)
+            if total_feedback > 0:
+                weighted_score = sum(f['rating'] * f['count'] for f in feedback_scores) / total_feedback
+                self.metrics['user_satisfaction_score'].set(weighted_score)
+            
+        except Exception as e:
+            logger.error(f"Error collecting business metrics: {str(e)}")
+    
+    def record_http_request(self, method: str, endpoint: str, status_code: int, duration: float):
+        """Record HTTP request metrics"""
+        self.metrics['http_requests_total'].labels(
+            method=method,
+            endpoint=endpoint,
+            status_code=str(status_code)
+        ).inc()
+        
+        self.metrics['http_request_duration_seconds'].labels(
+            method=method,
+            endpoint=endpoint
+        ).observe(duration)
+    
+    def get_metrics(self) -> str:
+        """Get metrics in Prometheus format"""
+        return generate_latest(self.registry)
+
+
+class AlertManager:
+    """Enterprise alert management system"""
+    
+    def __init__(self):
+        self.alert_rules = self._load_alert_rules()
+        self.notification_channels = self._load_notification_channels()
+        self.alert_queue = queue.Queue()
+        self.is_running = False
+        self.alert_thread = None
+    
+    def _load_alert_rules(self) -> List[Dict[str, Any]]:
+        """Load alert rules from configuration"""
+        return [
+            {
+                'name': 'high_cpu_usage',
+                'condition': 'system_cpu_usage > 80',
+                'severity': 'warning',
+                'duration': 300,  # 5 minutes
+                'enabled': True,
+            },
+            {
+                'name': 'high_memory_usage',
+                'condition': 'system_memory_usage > 85',
+                'severity': 'warning',
+                'duration': 300,
+                'enabled': True,
+            },
+            {
+                'name': 'disk_space_low',
+                'condition': 'system_disk_usage > 90',
+                'severity': 'critical',
+                'duration': 60,
+                'enabled': True,
+            },
+            {
+                'name': 'database_connections_high',
+                'condition': 'database_connections > 50',
+                'severity': 'warning',
+                'duration': 300,
+                'enabled': True,
+            },
+            {
+                'name': 'incident_volume_high',
+                'condition': 'incident_count > 100',
+                'severity': 'warning',
+                'duration': 600,
+                'enabled': True,
+            },
+            {
+                'name': 'sla_breach_detected',
+                'condition': 'sla_breach_count > 0',
+                'severity': 'critical',
+                'duration': 0,
+                'enabled': True,
+            },
+        ]
+    
+    def _load_notification_channels(self) -> List[Dict[str, Any]]:
+        """Load notification channels"""
+        return [
+            {
+                'name': 'email',
+                'type': 'email',
+                'enabled': True,
+                'config': {
+                    'recipients': ['admin@company.com'],
+                    'template': 'alert_email.html',
+                }
+            },
+            {
+                'name': 'slack',
+                'type': 'slack',
+                'enabled': True,
+                'config': {
+                    'webhook_url': os.getenv('SLACK_WEBHOOK_URL'),
+                    'channel': '#alerts',
+                }
+            },
+            {
+                'name': 'webhook',
+                'type': 'webhook',
+                'enabled': True,
+                'config': {
+                    'url': os.getenv('ALERT_WEBHOOK_URL'),
+                    'headers': {'Authorization': f'Bearer {os.getenv("ALERT_WEBHOOK_TOKEN")}'},
+                }
+            },
+        ]
+    
+    def start_monitoring(self):
+        """Start alert monitoring"""
+        if self.is_running:
+            return
+        
+        self.is_running = True
+        self.alert_thread = threading.Thread(target=self._monitor_alerts)
+        self.alert_thread.daemon = True
+        self.alert_thread.start()
+        logger.info("Alert monitoring started")
+    
+    def stop_monitoring(self):
+        """Stop alert monitoring"""
+        self.is_running = False
+        if self.alert_thread:
+            self.alert_thread.join()
+        logger.info("Alert monitoring stopped")
+    
+    def _monitor_alerts(self):
+        """Main alert monitoring loop"""
+        while self.is_running:
+            try:
+                self._check_alert_rules()
+                time.sleep(60)  # Check every minute
+            except Exception as e:
+                logger.error(f"Error monitoring alerts: {str(e)}")
+                time.sleep(60)
+    
+    def _check_alert_rules(self):
+        """Check all alert rules"""
+        for rule in self.alert_rules:
+            if not rule['enabled']:
+                continue
+            
+            try:
+                if self._evaluate_rule(rule):
+                    self._trigger_alert(rule)
+            except Exception as e:
+                logger.error(f"Error checking rule {rule['name']}: {str(e)}")
+    
+    def _evaluate_rule(self, rule: Dict[str, Any]) -> bool:
+        """Evaluate alert rule condition"""
+        condition = rule['condition']
+        
+        # Parse condition (simplified)
+        if 'system_cpu_usage' in condition:
+            cpu_usage = psutil.cpu_percent()
+            threshold = float(condition.split('>')[1].strip())
+            return cpu_usage > threshold
+        
+        elif 'system_memory_usage' in condition:
+            memory = psutil.virtual_memory()
+            threshold = float(condition.split('>')[1].strip())
+            return memory.percent > threshold
+        
+        elif 'system_disk_usage' in condition:
+            disk = psutil.disk_usage('/')
+            disk_percent = (disk.used / disk.total) * 100
+            threshold = float(condition.split('>')[1].strip())
+            return disk_percent > threshold
+        
+        elif 'database_connections' in condition:
+            with connection.cursor() as cursor:
+                cursor.execute("SELECT COUNT(*) FROM pg_stat_activity")
+                connections = cursor.fetchone()[0]
+            threshold = float(condition.split('>')[1].strip())
+            return connections > threshold
+        
+        elif 'incident_count' in condition:
+            from incident_intelligence.models import Incident
+            count = Incident.objects.count()
+            threshold = float(condition.split('>')[1].strip())
+            return count > threshold
+        
+        elif 'sla_breach_count' in condition:
+            from sla_oncall.models import SLAInstance
+            count = SLAInstance.objects.filter(status='breached').count()
+            threshold = float(condition.split('>')[1].strip())
+            return count > threshold
+        
+        return False
+    
+    def _trigger_alert(self, rule: Dict[str, Any]):
+        """Trigger alert for rule violation"""
+        alert = {
+            'rule_name': rule['name'],
+            'severity': rule['severity'],
+            'message': f"Alert: {rule['name']} - {rule['condition']}",
+            'timestamp': timezone.now().isoformat(),
+            'metadata': {
+                'condition': rule['condition'],
+                'duration': rule['duration'],
+            }
+        }
+        
+        # Send notifications
+        self._send_notifications(alert)
+        
+        # Store alert
+        self._store_alert(alert)
+        
+        logger.warning(f"Alert triggered: {rule['name']}")
+    
+    def _send_notifications(self, alert: Dict[str, Any]):
+        """Send alert notifications"""
+        for channel in self.notification_channels:
+            if not channel['enabled']:
+                continue
+            
+            try:
+                if channel['type'] == 'email':
+                    self._send_email_notification(alert, channel)
+                elif channel['type'] == 'slack':
+                    self._send_slack_notification(alert, channel)
+                elif channel['type'] == 'webhook':
+                    self._send_webhook_notification(alert, channel)
+            except Exception as e:
+                logger.error(f"Error sending notification via {channel['name']}: {str(e)}")
+    
+    def _send_email_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
+        """Send email notification"""
+        from django.core.mail import send_mail
+        
+        subject = f"ETB-API Alert: {alert['rule_name']}"
+        message = f"""
+        Alert: {alert['rule_name']}
+        Severity: {alert['severity']}
+        Message: {alert['message']}
+        Time: {alert['timestamp']}
+        """
+        
+        send_mail(
+            subject=subject,
+            message=message,
+            from_email=settings.DEFAULT_FROM_EMAIL,
+            recipient_list=channel['config']['recipients'],
+            fail_silently=False,
+        )
+    
+    def _send_slack_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
+        """Send Slack notification"""
+        webhook_url = channel['config']['webhook_url']
+        if not webhook_url:
+            return
+        
+        payload = {
+            'channel': channel['config']['channel'],
+            'text': f"🚨 ETB-API Alert: {alert['rule_name']}",
+            'attachments': [
+                {
+                    'color': 'danger' if alert['severity'] == 'critical' else 'warning',
+                    'fields': [
+                        {'title': 'Severity', 'value': alert['severity'], 'short': True},
+                        {'title': 'Message', 'value': alert['message'], 'short': False},
+                        {'title': 'Time', 'value': alert['timestamp'], 'short': True},
+                    ]
+                }
+            ]
+        }
+        
+        response = requests.post(webhook_url, json=payload, timeout=10)
+        response.raise_for_status()
+    
+    def _send_webhook_notification(self, alert: Dict[str, Any], channel: Dict[str, Any]):
+        """Send webhook notification"""
+        webhook_url = channel['config']['url']
+        if not webhook_url:
+            return
+        
+        headers = channel['config'].get('headers', {})
+        response = requests.post(webhook_url, json=alert, headers=headers, timeout=10)
+        response.raise_for_status()
+    
+    def _store_alert(self, alert: Dict[str, Any]):
+        """Store alert in database"""
+        try:
+            from monitoring.models import Alert
+            Alert.objects.create(
+                rule_name=alert['rule_name'],
+                severity=alert['severity'],
+                message=alert['message'],
+                metadata=alert['metadata'],
+                timestamp=timezone.now(),
+            )
+        except Exception as e:
+            logger.error(f"Error storing alert: {str(e)}")
+
+
+class PerformanceProfiler:
+    """Enterprise performance profiling system"""
+    
+    def __init__(self):
+        self.profiles = {}
+        self.is_enabled = True
+    
+    def start_profile(self, name: str) -> str:
+        """Start profiling a function or operation"""
+        if not self.is_enabled:
+            return None
+        
+        profile_id = f"{name}_{int(time.time() * 1000)}"
+        self.profiles[profile_id] = {
+            'name': name,
+            'start_time': time.time(),
+            'start_memory': psutil.Process().memory_info().rss,
+            'start_cpu': psutil.cpu_percent(),
+        }
+        
+        return profile_id
+    
+    def end_profile(self, profile_id: str) -> Dict[str, Any]:
+        """End profiling and return results"""
+        if not profile_id or profile_id not in self.profiles:
+            return None
+        
+        profile = self.profiles.pop(profile_id)
+        
+        end_time = time.time()
+        end_memory = psutil.Process().memory_info().rss
+        end_cpu = psutil.cpu_percent()
+        
+        result = {
+            'name': profile['name'],
+            'duration': end_time - profile['start_time'],
+            'memory_delta': end_memory - profile['start_memory'],
+            'cpu_delta': end_cpu - profile['start_cpu'],
+            'timestamp': timezone.now().isoformat(),
+        }
+        
+        # Log slow operations
+        if result['duration'] > 1.0:  # 1 second
+            logger.warning(f"Slow operation detected: {result['name']} took {result['duration']:.2f}s")
+        
+        return result
+    
+    def profile_function(self, func):
+        """Decorator to profile function execution"""
+        def wrapper(*args, **kwargs):
+            profile_id = self.start_profile(func.__name__)
+            try:
+                result = func(*args, **kwargs)
+                return result
+            finally:
+                if profile_id:
+                    self.end_profile(profile_id)
+        return wrapper
+
+
+# Global instances
+metrics_collector = MetricsCollector()
+alert_manager = AlertManager()
+performance_profiler = PerformanceProfiler()
+
+
+# API Views for monitoring
+@api_view(['GET'])
+@permission_classes([IsAuthenticated])
+def metrics_endpoint(request):
+    """Prometheus metrics endpoint"""
+    try:
+        metrics_data = metrics_collector.get_metrics()
+        return Response(metrics_data, content_type=CONTENT_TYPE_LATEST)
+    except Exception as e:
+        logger.error(f"Error getting metrics: {str(e)}")
+        return Response(
+            {'error': 'Failed to get metrics'},
+            status=status.HTTP_500_INTERNAL_SERVER_ERROR
+        )
+
+
+@api_view(['GET'])
+@permission_classes([IsAuthenticated])
+def monitoring_dashboard(request):
+    """Get monitoring dashboard data"""
+    try:
+        # Get system metrics
+        system_metrics = {
+            'cpu_usage': psutil.cpu_percent(),
+            'memory_usage': psutil.virtual_memory().percent,
+            'disk_usage': (psutil.disk_usage('/').used / psutil.disk_usage('/').total) * 100,
+            'load_average': psutil.getloadavg() if hasattr(psutil, 'getloadavg') else [0, 0, 0],
+        }
+        
+        # Get application metrics
+        from incident_intelligence.models import Incident
+        from sla_oncall.models import SLAInstance
+        
+        application_metrics = {
+            'total_incidents': Incident.objects.count(),
+            'active_incidents': Incident.objects.filter(status='active').count(),
+            'resolved_incidents': Incident.objects.filter(status='resolved').count(),
+            'sla_breaches': SLAInstance.objects.filter(status='breached').count(),
+            'active_users': cache.get('active_users_count', 0),
+        }
+        
+        # Get recent alerts
+        from monitoring.models import Alert
+        recent_alerts = Alert.objects.filter(
+            timestamp__gte=timezone.now() - timedelta(hours=24)
+        ).order_by('-timestamp')[:10]
+        
+        return Response({
+            'system_metrics': system_metrics,
+            'application_metrics': application_metrics,
+            'recent_alerts': [
+                {
+                    'rule_name': alert.rule_name,
+                    'severity': alert.severity,
+                    'message': alert.message,
+                    'timestamp': alert.timestamp.isoformat(),
+                }
+                for alert in recent_alerts
+            ],
+        })
+        
+    except Exception as e:
+        logger.error(f"Monitoring dashboard error: {str(e)}")
+        return Response(
+            {'error': 'Failed to load monitoring dashboard'},
+            status=status.HTTP_500_INTERNAL_SERVER_ERROR
+        )
+
+
+@api_view(['POST'])
+@permission_classes([IsAuthenticated])
+def test_alert(request):
+    """Test alert notification"""
+    try:
+        test_alert = {
+            'rule_name': 'test_alert',
+            'severity': 'info',
+            'message': 'This is a test alert',
+            'timestamp': timezone.now().isoformat(),
+            'metadata': {'test': True},
+        }
+        
+        alert_manager._send_notifications(test_alert)
+        
+        return Response({
+            'message': 'Test alert sent successfully',
+            'alert': test_alert,
+        })
+        
+    except Exception as e:
+        logger.error(f"Test alert error: {str(e)}")
+        return Response(
+            {'error': 'Failed to send test alert'},
+            status=status.HTTP_500_INTERNAL_SERVER_ERROR
+        )
+
+
+class MonitoringMiddleware:
+    """Middleware for request monitoring and metrics collection"""
+    
+    def __init__(self, get_response):
+        self.get_response = get_response
+    
+    def __call__(self, request):
+        start_time = time.time()
+        
+        response = self.get_response(request)
+        
+        # Calculate request duration
+        duration = time.time() - start_time
+        
+        # Record metrics
+        metrics_collector.record_http_request(
+            method=request.method,
+            endpoint=request.path,
+            status_code=response.status_code,
+            duration=duration
+        )
+        
+        # Add performance headers
+        response['X-Response-Time'] = f"{duration:.3f}s"
+        response['X-Request-ID'] = request.META.get('HTTP_X_REQUEST_ID', 'unknown')
+        
+        return response
+
+
+# Management command for starting monitoring services
+class StartMonitoringCommand(BaseCommand):
+    """Django management command to start monitoring services"""
+    
+    help = 'Start monitoring services (metrics collection and alerting)'
+    
+    def handle(self, *args, **options):
+        self.stdout.write('Starting monitoring services...')
+        
+        # Start metrics collection
+        metrics_collector.start_collection()
+        self.stdout.write(self.style.SUCCESS('Metrics collection started'))
+        
+        # Start alert monitoring
+        alert_manager.start_monitoring()
+        self.stdout.write(self.style.SUCCESS('Alert monitoring started'))
+        
+        self.stdout.write(self.style.SUCCESS('All monitoring services started successfully'))
+        
+        # Keep running
+        try:
+            while True:
+                time.sleep(1)
+        except KeyboardInterrupt:
+            self.stdout.write('Stopping monitoring services...')
+            metrics_collector.stop_collection()
+            alert_manager.stop_monitoring()
+            self.stdout.write(self.style.SUCCESS('Monitoring services stopped'))
--- a/ETB-API/monitoring/management/init.py
+++ b/ETB-API/monitoring/management/init.py
@@ -0,0 +1 @@
+# Management commands for monitoring
--- a/ETB-API/monitoring/management/pycache/init.cpython-312.pyc
+++ b/ETB-API/monitoring/management/pycache/init.cpython-312.pyc
--- a/ETB-API/monitoring/management/commands/init.py
+++ b/ETB-API/monitoring/management/commands/init.py
@@ -0,0 +1 @@
+# Management commands
--- a/ETB-API/monitoring/management/commands/pycache/init.cpython-312.pyc
+++ b/ETB-API/monitoring/management/commands/pycache/init.cpython-312.pyc
--- a/ETB-API/monitoring/management/commands/pycache/setup_monitoring.cpython-312.pyc
+++ b/ETB-API/monitoring/management/commands/pycache/setup_monitoring.cpython-312.pyc
--- a/ETB-API/monitoring/management/commands/setup_monitoring.py
+++ b/ETB-API/monitoring/management/commands/setup_monitoring.py
@@ -0,0 +1,665 @@
+"""
+Management command to set up initial monitoring configuration
+"""
+from django.core.management.base import BaseCommand
+from django.contrib.auth import get_user_model
+from monitoring.models import (
+    MonitoringTarget, SystemMetric, AlertRule, MonitoringDashboard
+)
+
+User = get_user_model()
+
+
+class Command(BaseCommand):
+    help = 'Set up initial monitoring configuration'
+    
+    def add_arguments(self, parser):
+        parser.add_argument(
+            '--admin-user',
+            type=str,
+            help='Username of admin user to create monitoring objects',
+            default='admin'
+        )
+    
+    def handle(self, *args, **options):
+        admin_username = options['admin_user']
+        
+        try:
+            admin_user = User.objects.get(username=admin_username)
+        except User.DoesNotExist:
+            self.stdout.write(
+                self.style.ERROR(f'Admin user "{admin_username}" not found')
+            )
+            return
+        
+        self.stdout.write('Setting up monitoring configuration...')
+        
+        # Create default monitoring targets
+        self.create_default_targets(admin_user)
+        
+        # Create default metrics
+        self.create_default_metrics(admin_user)
+        
+        # Create default alert rules
+        self.create_default_alert_rules(admin_user)
+        
+        # Create default dashboards
+        self.create_default_dashboards(admin_user)
+        
+        self.stdout.write(
+            self.style.SUCCESS('Monitoring configuration setup completed!')
+        )
+    
+    def create_default_targets(self, admin_user):
+        """Create default monitoring targets"""
+        self.stdout.write('Creating default monitoring targets...')
+        
+        targets = [
+            {
+                'name': 'Django Application',
+                'description': 'Main Django application health check',
+                'target_type': 'APPLICATION',
+                'endpoint_url': 'http://localhost:8000/health/',
+                'related_module': 'core',
+                'health_check_enabled': True,
+                'expected_status_codes': [200]
+            },
+            {
+                'name': 'Database',
+                'description': 'Database connection health check',
+                'target_type': 'DATABASE',
+                'related_module': 'core',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Cache System',
+                'description': 'Cache system health check',
+                'target_type': 'CACHE',
+                'related_module': 'core',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Celery Workers',
+                'description': 'Celery worker health check',
+                'target_type': 'QUEUE',
+                'related_module': 'core',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Security Module',
+                'description': 'Security module health check',
+                'target_type': 'MODULE',
+                'related_module': 'security',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Incident Intelligence Module',
+                'description': 'Incident Intelligence module health check',
+                'target_type': 'MODULE',
+                'related_module': 'incident_intelligence',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Automation Orchestration Module',
+                'description': 'Automation Orchestration module health check',
+                'target_type': 'MODULE',
+                'related_module': 'automation_orchestration',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'SLA OnCall Module',
+                'description': 'SLA OnCall module health check',
+                'target_type': 'MODULE',
+                'related_module': 'sla_oncall',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Collaboration War Rooms Module',
+                'description': 'Collaboration War Rooms module health check',
+                'target_type': 'MODULE',
+                'related_module': 'collaboration_war_rooms',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Compliance Governance Module',
+                'description': 'Compliance Governance module health check',
+                'target_type': 'MODULE',
+                'related_module': 'compliance_governance',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Analytics Predictive Insights Module',
+                'description': 'Analytics Predictive Insights module health check',
+                'target_type': 'MODULE',
+                'related_module': 'analytics_predictive_insights',
+                'health_check_enabled': True
+            },
+            {
+                'name': 'Knowledge Learning Module',
+                'description': 'Knowledge Learning module health check',
+                'target_type': 'MODULE',
+                'related_module': 'knowledge_learning',
+                'health_check_enabled': True
+            }
+        ]
+        
+        for target_data in targets:
+            target, created = MonitoringTarget.objects.get_or_create(
+                name=target_data['name'],
+                defaults={
+                    **target_data,
+                    'created_by': admin_user
+                }
+            )
+            if created:
+                self.stdout.write(f'  Created target: {target.name}')
+            else:
+                self.stdout.write(f'  Target already exists: {target.name}')
+    
+    def create_default_metrics(self, admin_user):
+        """Create default system metrics"""
+        self.stdout.write('Creating default system metrics...')
+        
+        metrics = [
+            {
+                'name': 'API Response Time',
+                'description': 'Average API response time in milliseconds',
+                'metric_type': 'PERFORMANCE',
+                'category': 'API_RESPONSE_TIME',
+                'unit': 'milliseconds',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 300,
+                'warning_threshold': 1000,
+                'critical_threshold': 2000,
+                'is_system_metric': True
+            },
+            {
+                'name': 'Request Throughput',
+                'description': 'Number of requests per minute',
+                'metric_type': 'PERFORMANCE',
+                'category': 'THROUGHPUT',
+                'unit': 'requests/minute',
+                'aggregation_method': 'SUM',
+                'collection_interval_seconds': 60,
+                'warning_threshold': 1000,
+                'critical_threshold': 2000,
+                'is_system_metric': True
+            },
+            {
+                'name': 'Error Rate',
+                'description': 'Percentage of failed requests',
+                'metric_type': 'PERFORMANCE',
+                'category': 'ERROR_RATE',
+                'unit': 'percentage',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 300,
+                'warning_threshold': 5.0,
+                'critical_threshold': 10.0,
+                'is_system_metric': True
+            },
+            {
+                'name': 'System Availability',
+                'description': 'System availability percentage',
+                'metric_type': 'INFRASTRUCTURE',
+                'category': 'AVAILABILITY',
+                'unit': 'percentage',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 300,
+                'warning_threshold': 99.0,
+                'critical_threshold': 95.0,
+                'is_system_metric': True
+            },
+            {
+                'name': 'Incident Count',
+                'description': 'Number of incidents in the last 24 hours',
+                'metric_type': 'BUSINESS',
+                'category': 'INCIDENT_COUNT',
+                'unit': 'count',
+                'aggregation_method': 'COUNT',
+                'collection_interval_seconds': 3600,
+                'warning_threshold': 10,
+                'critical_threshold': 20,
+                'is_system_metric': True,
+                'related_module': 'incident_intelligence'
+            },
+            {
+                'name': 'Mean Time to Resolve',
+                'description': 'Average time to resolve incidents in minutes',
+                'metric_type': 'BUSINESS',
+                'category': 'MTTR',
+                'unit': 'minutes',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 3600,
+                'warning_threshold': 120,
+                'critical_threshold': 240,
+                'is_system_metric': True,
+                'related_module': 'incident_intelligence'
+            },
+            {
+                'name': 'Mean Time to Acknowledge',
+                'description': 'Average time to acknowledge incidents in minutes',
+                'metric_type': 'BUSINESS',
+                'category': 'MTTA',
+                'unit': 'minutes',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 3600,
+                'warning_threshold': 15,
+                'critical_threshold': 30,
+                'is_system_metric': True,
+                'related_module': 'incident_intelligence'
+            },
+            {
+                'name': 'SLA Compliance',
+                'description': 'SLA compliance percentage',
+                'metric_type': 'BUSINESS',
+                'category': 'SLA_COMPLIANCE',
+                'unit': 'percentage',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 3600,
+                'warning_threshold': 95.0,
+                'critical_threshold': 90.0,
+                'is_system_metric': True,
+                'related_module': 'sla_oncall'
+            },
+            {
+                'name': 'Security Events',
+                'description': 'Number of security events in the last hour',
+                'metric_type': 'SECURITY',
+                'category': 'SECURITY_EVENTS',
+                'unit': 'count',
+                'aggregation_method': 'COUNT',
+                'collection_interval_seconds': 3600,
+                'warning_threshold': 5,
+                'critical_threshold': 10,
+                'is_system_metric': True,
+                'related_module': 'security'
+            },
+            {
+                'name': 'Automation Success Rate',
+                'description': 'Percentage of successful automation executions',
+                'metric_type': 'BUSINESS',
+                'category': 'AUTOMATION_SUCCESS',
+                'unit': 'percentage',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 3600,
+                'warning_threshold': 90.0,
+                'critical_threshold': 80.0,
+                'is_system_metric': True,
+                'related_module': 'automation_orchestration'
+            },
+            {
+                'name': 'AI Model Accuracy',
+                'description': 'AI model accuracy percentage',
+                'metric_type': 'BUSINESS',
+                'category': 'AI_ACCURACY',
+                'unit': 'percentage',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 3600,
+                'warning_threshold': 85.0,
+                'critical_threshold': 75.0,
+                'is_system_metric': True,
+                'related_module': 'incident_intelligence'
+            },
+            {
+                'name': 'Cost Impact',
+                'description': 'Total cost impact in USD for the last 30 days',
+                'metric_type': 'BUSINESS',
+                'category': 'COST_IMPACT',
+                'unit': 'USD',
+                'aggregation_method': 'SUM',
+                'collection_interval_seconds': 86400,
+                'warning_threshold': 10000,
+                'critical_threshold': 50000,
+                'is_system_metric': True,
+                'related_module': 'analytics_predictive_insights'
+            },
+            {
+                'name': 'User Activity',
+                'description': 'Number of active users in the last hour',
+                'metric_type': 'BUSINESS',
+                'category': 'USER_ACTIVITY',
+                'unit': 'count',
+                'aggregation_method': 'COUNT',
+                'collection_interval_seconds': 3600,
+                'warning_threshold': 50,
+                'critical_threshold': 100,
+                'is_system_metric': True
+            },
+            {
+                'name': 'CPU Usage',
+                'description': 'System CPU usage percentage',
+                'metric_type': 'INFRASTRUCTURE',
+                'category': 'SYSTEM_RESOURCES',
+                'unit': 'percentage',
+                'aggregation_method': 'AVERAGE',
+                'collection_interval_seconds': 300,
+                'warning_threshold': 80.0,
+                'critical_threshold': 90.0,
+                'is_system_metric': True
+            }
+        ]
+        
+        for metric_data in metrics:
+            metric, created = SystemMetric.objects.get_or_create(
+                name=metric_data['name'],
+                defaults={
+                    **metric_data,
+                    'created_by': admin_user
+                }
+            )
+            if created:
+                self.stdout.write(f'  Created metric: {metric.name}')
+            else:
+                self.stdout.write(f'  Metric already exists: {metric.name}')
+    
+    def create_default_alert_rules(self, admin_user):
+        """Create default alert rules"""
+        self.stdout.write('Creating default alert rules...')
+        
+        # Get metrics for alert rules
+        api_response_metric = SystemMetric.objects.filter(name='API Response Time').first()
+        error_rate_metric = SystemMetric.objects.filter(name='Error Rate').first()
+        availability_metric = SystemMetric.objects.filter(name='System Availability').first()
+        incident_count_metric = SystemMetric.objects.filter(name='Incident Count').first()
+        mttr_metric = SystemMetric.objects.filter(name='Mean Time to Resolve').first()
+        security_events_metric = SystemMetric.objects.filter(name='Security Events').first()
+        cpu_metric = SystemMetric.objects.filter(name='CPU Usage').first()
+        
+        alert_rules = [
+            {
+                'name': 'High API Response Time',
+                'description': 'Alert when API response time exceeds threshold',
+                'alert_type': 'THRESHOLD',
+                'severity': 'HIGH',
+                'condition': {
+                    'type': 'THRESHOLD',
+                    'operator': '>',
+                    'threshold': 2000
+                },
+                'metric': api_response_metric,
+                'notification_channels': [
+                    {
+                        'type': 'EMAIL',
+                        'recipients': ['admin@example.com']
+                    }
+                ]
+            },
+            {
+                'name': 'High Error Rate',
+                'description': 'Alert when error rate exceeds threshold',
+                'alert_type': 'THRESHOLD',
+                'severity': 'CRITICAL',
+                'condition': {
+                    'type': 'THRESHOLD',
+                    'operator': '>',
+                    'threshold': 10.0
+                },
+                'metric': error_rate_metric,
+                'notification_channels': [
+                    {
+                        'type': 'EMAIL',
+                        'recipients': ['admin@example.com']
+                    }
+                ]
+            },
+            {
+                'name': 'Low System Availability',
+                'description': 'Alert when system availability drops below threshold',
+                'alert_type': 'AVAILABILITY',
+                'severity': 'CRITICAL',
+                'condition': {
+                    'type': 'THRESHOLD',
+                    'operator': '<',
+                    'threshold': 95.0
+                },
+                'metric': availability_metric,
+                'notification_channels': [
+                    {
+                        'type': 'EMAIL',
+                        'recipients': ['admin@example.com']
+                    }
+                ]
+            },
+            {
+                'name': 'High Incident Count',
+                'description': 'Alert when incident count exceeds threshold',
+                'alert_type': 'THRESHOLD',
+                'severity': 'HIGH',
+                'condition': {
+                    'type': 'THRESHOLD',
+                    'operator': '>',
+                    'threshold': 20
+                },
+                'metric': incident_count_metric,
+                'notification_channels': [
+                    {
+                        'type': 'EMAIL',
+                        'recipients': ['admin@example.com']
+                    }
+                ]
+            },
+            {
+                'name': 'High MTTR',
+                'description': 'Alert when mean time to resolve exceeds threshold',
+                'alert_type': 'THRESHOLD',
+                'severity': 'MEDIUM',
+                'condition': {
+                    'type': 'THRESHOLD',
+                    'operator': '>',
+                    'threshold': 240
+                },
+                'metric': mttr_metric,
+                'notification_channels': [
+                    {
+                        'type': 'EMAIL',
+                        'recipients': ['admin@example.com']
+                    }
+                ]
+            },
+            {
+                'name': 'High Security Events',
+                'description': 'Alert when security events exceed threshold',
+                'alert_type': 'THRESHOLD',
+                'severity': 'HIGH',
+                'condition': {
+                    'type': 'THRESHOLD',
+                    'operator': '>',
+                    'threshold': 10
+                },
+                'metric': security_events_metric,
+                'notification_channels': [
+                    {
+                        'type': 'EMAIL',
+                        'recipients': ['admin@example.com']
+                    }
+                ]
+            },
+            {
+                'name': 'High CPU Usage',
+                'description': 'Alert when CPU usage exceeds threshold',
+                'alert_type': 'THRESHOLD',
+                'severity': 'HIGH',
+                'condition': {
+                    'type': 'THRESHOLD',
+                    'operator': '>',
+                    'threshold': 90.0
+                },
+                'metric': cpu_metric,
+                'notification_channels': [
+                    {
+                        'type': 'EMAIL',
+                        'recipients': ['admin@example.com']
+                    }
+                ]
+            }
+        ]
+        
+        for rule_data in alert_rules:
+            if rule_data['metric']:  # Only create if metric exists
+                rule, created = AlertRule.objects.get_or_create(
+                    name=rule_data['name'],
+                    defaults={
+                        **rule_data,
+                        'created_by': admin_user
+                    }
+                )
+                if created:
+                    self.stdout.write(f'  Created alert rule: {rule.name}')
+                else:
+                    self.stdout.write(f'  Alert rule already exists: {rule.name}')
+    
+    def create_default_dashboards(self, admin_user):
+        """Create default monitoring dashboards"""
+        self.stdout.write('Creating default monitoring dashboards...')
+        
+        dashboards = [
+            {
+                'name': 'System Overview',
+                'description': 'High-level system overview dashboard',
+                'dashboard_type': 'SYSTEM_OVERVIEW',
+                'is_public': True,
+                'auto_refresh_enabled': True,
+                'refresh_interval_seconds': 30,
+                'layout_config': {
+                    'columns': 3,
+                    'rows': 4
+                },
+                'widget_configs': [
+                    {
+                        'type': 'system_status',
+                        'position': {'x': 0, 'y': 0, 'width': 3, 'height': 1}
+                    },
+                    {
+                        'type': 'health_summary',
+                        'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'alert_summary',
+                        'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'system_resources',
+                        'position': {'x': 2, 'y': 1, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'recent_incidents',
+                        'position': {'x': 0, 'y': 2, 'width': 2, 'height': 2}
+                    },
+                    {
+                        'type': 'metric_trends',
+                        'position': {'x': 2, 'y': 2, 'width': 1, 'height': 2}
+                    }
+                ]
+            },
+            {
+                'name': 'Performance Dashboard',
+                'description': 'System performance metrics dashboard',
+                'dashboard_type': 'PERFORMANCE',
+                'is_public': True,
+                'auto_refresh_enabled': True,
+                'refresh_interval_seconds': 60,
+                'layout_config': {
+                    'columns': 2,
+                    'rows': 3
+                },
+                'widget_configs': [
+                    {
+                        'type': 'api_response_time',
+                        'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'throughput',
+                        'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'error_rate',
+                        'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'availability',
+                        'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'system_resources',
+                        'position': {'x': 0, 'y': 2, 'width': 2, 'height': 1}
+                    }
+                ]
+            },
+            {
+                'name': 'Business Metrics Dashboard',
+                'description': 'Business and operational metrics dashboard',
+                'dashboard_type': 'BUSINESS_METRICS',
+                'is_public': True,
+                'auto_refresh_enabled': True,
+                'refresh_interval_seconds': 300,
+                'layout_config': {
+                    'columns': 2,
+                    'rows': 3
+                },
+                'widget_configs': [
+                    {
+                        'type': 'incident_count',
+                        'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'mttr',
+                        'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'mtta',
+                        'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'sla_compliance',
+                        'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'cost_impact',
+                        'position': {'x': 0, 'y': 2, 'width': 2, 'height': 1}
+                    }
+                ]
+            },
+            {
+                'name': 'Security Dashboard',
+                'description': 'Security monitoring dashboard',
+                'dashboard_type': 'SECURITY',
+                'is_public': False,
+                'auto_refresh_enabled': True,
+                'refresh_interval_seconds': 60,
+                'layout_config': {
+                    'columns': 2,
+                    'rows': 2
+                },
+                'widget_configs': [
+                    {
+                        'type': 'security_events',
+                        'position': {'x': 0, 'y': 0, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'failed_logins',
+                        'position': {'x': 1, 'y': 0, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'risk_assessments',
+                        'position': {'x': 0, 'y': 1, 'width': 1, 'height': 1}
+                    },
+                    {
+                        'type': 'device_posture',
+                        'position': {'x': 1, 'y': 1, 'width': 1, 'height': 1}
+                    }
+                ]
+            }
+        ]
+        
+        for dashboard_data in dashboards:
+            dashboard, created = MonitoringDashboard.objects.get_or_create(
+                name=dashboard_data['name'],
+                defaults={
+                    **dashboard_data,
+                    'created_by': admin_user
+                }
+            )
+            if created:
+                self.stdout.write(f'  Created dashboard: {dashboard.name}')
+            else:
+                self.stdout.write(f'  Dashboard already exists: {dashboard.name}')
--- a/ETB-API/monitoring/migrations/0001_initial.py
+++ b/ETB-API/monitoring/migrations/0001_initial.py
@@ -0,0 +1,252 @@
+# Generated by Django 5.2.6 on 2025-09-18 19:44
+
+import django.db.models.deletion
+import uuid
+from django.conf import settings
+from django.db import migrations, models
+
+
+class Migration(migrations.Migration):
+
+    initial = True
+
+    dependencies = [
+        migrations.swappable_dependency(settings.AUTH_USER_MODEL),
+    ]
+
+    operations = [
+        migrations.CreateModel(
+            name='MonitoringTarget',
+            fields=[
+                ('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
+                ('name', models.CharField(max_length=200, unique=True)),
+                ('description', models.TextField()),
+                ('target_type', models.CharField(choices=[('APPLICATION', 'Application'), ('DATABASE', 'Database'), ('CACHE', 'Cache'), ('QUEUE', 'Message Queue'), ('EXTERNAL_API', 'External API'), ('SERVICE', 'Internal Service'), ('INFRASTRUCTURE', 'Infrastructure'), ('MODULE', 'Django Module')], max_length=20)),
+                ('endpoint_url', models.URLField(blank=True, null=True)),
+                ('connection_config', models.JSONField(default=dict, help_text='Connection configuration (credentials, timeouts, etc.)')),
+                ('check_interval_seconds', models.PositiveIntegerField(default=60)),
+                ('timeout_seconds', models.PositiveIntegerField(default=30)),
+                ('retry_count', models.PositiveIntegerField(default=3)),
+                ('health_check_enabled', models.BooleanField(default=True)),
+                ('health_check_endpoint', models.CharField(blank=True, max_length=200, null=True)),
+                ('expected_status_codes', models.JSONField(default=list, help_text='Expected HTTP status codes for health checks')),
+                ('status', models.CharField(choices=[('ACTIVE', 'Active'), ('INACTIVE', 'Inactive'), ('MAINTENANCE', 'Maintenance'), ('ERROR', 'Error')], default='ACTIVE', max_length=20)),
+                ('last_checked', models.DateTimeField(blank=True, null=True)),
+                ('last_status', models.CharField(choices=[('HEALTHY', 'Healthy'), ('WARNING', 'Warning'), ('CRITICAL', 'Critical'), ('UNKNOWN', 'Unknown')], default='UNKNOWN', max_length=20)),
+                ('related_module', models.CharField(blank=True, help_text="Related Django module (e.g., 'security', 'incident_intelligence')", max_length=50, null=True)),
+                ('created_at', models.DateTimeField(auto_now_add=True)),
+                ('updated_at', models.DateTimeField(auto_now=True)),
+                ('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
+            ],
+            options={
+                'ordering': ['name'],
+            },
+        ),
+        migrations.CreateModel(
+            name='HealthCheck',
+            fields=[
+                ('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
+                ('check_type', models.CharField(choices=[('HTTP', 'HTTP Health Check'), ('DATABASE', 'Database Connection'), ('CACHE', 'Cache Connection'), ('QUEUE', 'Message Queue'), ('CUSTOM', 'Custom Check'), ('PING', 'Network Ping'), ('SSL', 'SSL Certificate')], max_length=20)),
+                ('status', models.CharField(choices=[('HEALTHY', 'Healthy'), ('WARNING', 'Warning'), ('CRITICAL', 'Critical'), ('UNKNOWN', 'Unknown')], max_length=20)),
+                ('response_time_ms', models.PositiveIntegerField(blank=True, null=True)),
+                ('status_code', models.PositiveIntegerField(blank=True, null=True)),
+                ('response_body', models.TextField(blank=True, null=True)),
+                ('error_message', models.TextField(blank=True, null=True)),
+                ('cpu_usage_percent', models.FloatField(blank=True, null=True)),
+                ('memory_usage_percent', models.FloatField(blank=True, null=True)),
+                ('disk_usage_percent', models.FloatField(blank=True, null=True)),
+                ('checked_at', models.DateTimeField(auto_now_add=True)),
+                ('target', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='health_checks', to='monitoring.monitoringtarget')),
+            ],
+            options={
+                'ordering': ['-checked_at'],
+            },
+        ),
+        migrations.CreateModel(
+            name='SystemMetric',
+            fields=[
+                ('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
+                ('name', models.CharField(max_length=200)),
+                ('description', models.TextField()),
+                ('metric_type', models.CharField(choices=[('PERFORMANCE', 'Performance Metric'), ('BUSINESS', 'Business Metric'), ('SECURITY', 'Security Metric'), ('INFRASTRUCTURE', 'Infrastructure Metric'), ('CUSTOM', 'Custom Metric')], max_length=20)),
+                ('category', models.CharField(choices=[('API_RESPONSE_TIME', 'API Response Time'), ('THROUGHPUT', 'Throughput'), ('ERROR_RATE', 'Error Rate'), ('AVAILABILITY', 'Availability'), ('INCIDENT_COUNT', 'Incident Count'), ('MTTR', 'Mean Time to Resolve'), ('MTTA', 'Mean Time to Acknowledge'), ('SLA_COMPLIANCE', 'SLA Compliance'), ('SECURITY_EVENTS', 'Security Events'), ('AUTOMATION_SUCCESS', 'Automation Success Rate'), ('AI_ACCURACY', 'AI Model Accuracy'), ('COST_IMPACT', 'Cost Impact'), ('USER_ACTIVITY', 'User Activity'), ('SYSTEM_RESOURCES', 'System Resources')], max_length=30)),
+                ('unit', models.CharField(help_text='Unit of measurement', max_length=50)),
+                ('aggregation_method', models.CharField(choices=[('AVERAGE', 'Average'), ('SUM', 'Sum'), ('COUNT', 'Count'), ('MIN', 'Minimum'), ('MAX', 'Maximum'), ('PERCENTILE_95', '95th Percentile'), ('PERCENTILE_99', '99th Percentile')], max_length=20)),
+                ('collection_interval_seconds', models.PositiveIntegerField(default=300)),
+                ('retention_days', models.PositiveIntegerField(default=90)),
+                ('warning_threshold', models.FloatField(blank=True, null=True)),
+                ('critical_threshold', models.FloatField(blank=True, null=True)),
+                ('is_active', models.BooleanField(default=True)),
+                ('is_system_metric', models.BooleanField(default=False)),
+                ('related_module', models.CharField(blank=True, help_text='Related Django module', max_length=50, null=True)),
+                ('created_at', models.DateTimeField(auto_now_add=True)),
+                ('updated_at', models.DateTimeField(auto_now=True)),
+                ('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
+            ],
+            options={
+                'ordering': ['name'],
+            },
+        ),
+        migrations.CreateModel(
+            name='MetricMeasurement',
+            fields=[
+                ('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
+                ('value', models.DecimalField(decimal_places=4, max_digits=15)),
+                ('timestamp', models.DateTimeField(auto_now_add=True)),
+                ('tags', models.JSONField(default=dict, help_text='Additional tags for this measurement')),
+                ('metadata', models.JSONField(default=dict, help_text='Additional metadata')),
+                ('metric', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='measurements', to='monitoring.systemmetric')),
+            ],
+            options={
+                'ordering': ['-timestamp'],
+            },
+        ),
+        migrations.CreateModel(
+            name='AlertRule',
+            fields=[
+                ('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
+                ('name', models.CharField(max_length=200)),
+                ('description', models.TextField()),
+                ('alert_type', models.CharField(choices=[('THRESHOLD', 'Threshold Alert'), ('ANOMALY', 'Anomaly Alert'), ('PATTERN', 'Pattern Alert'), ('AVAILABILITY', 'Availability Alert'), ('PERFORMANCE', 'Performance Alert')], max_length=20)),
+                ('severity', models.CharField(choices=[('LOW', 'Low'), ('MEDIUM', 'Medium'), ('HIGH', 'High'), ('CRITICAL', 'Critical')], max_length=20)),
+                ('condition', models.JSONField(help_text='Alert condition configuration')),
+                ('evaluation_interval_seconds', models.PositiveIntegerField(default=60)),
+                ('notification_channels', models.JSONField(default=list, help_text='List of notification channels (email, slack, webhook, etc.)')),
+                ('notification_template', models.TextField(blank=True, help_text='Custom notification template', null=True)),
+                ('status', models.CharField(choices=[('ACTIVE', 'Active'), ('INACTIVE', 'Inactive'), ('MAINTENANCE', 'Maintenance')], default='ACTIVE', max_length=20)),
+                ('is_enabled', models.BooleanField(default=True)),
+                ('created_at', models.DateTimeField(auto_now_add=True)),
+                ('updated_at', models.DateTimeField(auto_now=True)),
+                ('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
+                ('target', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.CASCADE, related_name='alert_rules', to='monitoring.monitoringtarget')),
+                ('metric', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.CASCADE, related_name='alert_rules', to='monitoring.systemmetric')),
+            ],
+            options={
+                'ordering': ['name'],
+            },
+        ),
+        migrations.CreateModel(
+            name='SystemStatus',
+            fields=[
+                ('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
+                ('status', models.CharField(choices=[('OPERATIONAL', 'Operational'), ('DEGRADED', 'Degraded'), ('PARTIAL_OUTAGE', 'Partial Outage'), ('MAJOR_OUTAGE', 'Major Outage'), ('MAINTENANCE', 'Maintenance')], max_length=20)),
+                ('message', models.TextField(help_text='Status message for users')),
+                ('affected_services', models.JSONField(default=list, help_text='List of affected services')),
+                ('estimated_resolution', models.DateTimeField(blank=True, null=True)),
+                ('started_at', models.DateTimeField(auto_now_add=True)),
+                ('updated_at', models.DateTimeField(auto_now=True)),
+                ('resolved_at', models.DateTimeField(blank=True, null=True)),
+                ('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
+            ],
+            options={
+                'ordering': ['-started_at'],
+            },
+        ),
+        migrations.CreateModel(
+            name='Alert',
+            fields=[
+                ('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
+                ('title', models.CharField(max_length=200)),
+                ('description', models.TextField()),
+                ('severity', models.CharField(choices=[('LOW', 'Low'), ('MEDIUM', 'Medium'), ('HIGH', 'High'), ('CRITICAL', 'Critical')], max_length=20)),
+                ('status', models.CharField(choices=[('TRIGGERED', 'Triggered'), ('ACKNOWLEDGED', 'Acknowledged'), ('RESOLVED', 'Resolved'), ('SUPPRESSED', 'Suppressed')], default='TRIGGERED', max_length=20)),
+                ('triggered_value', models.DecimalField(blank=True, decimal_places=4, max_digits=15, null=True)),
+                ('threshold_value', models.DecimalField(blank=True, decimal_places=4, max_digits=15, null=True)),
+                ('context_data', models.JSONField(default=dict, help_text='Additional context data for the alert')),
+                ('triggered_at', models.DateTimeField(auto_now_add=True)),
+                ('acknowledged_at', models.DateTimeField(blank=True, null=True)),
+                ('resolved_at', models.DateTimeField(blank=True, null=True)),
+                ('acknowledged_by', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, related_name='acknowledged_alerts', to=settings.AUTH_USER_MODEL)),
+                ('resolved_by', models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, related_name='resolved_alerts', to=settings.AUTH_USER_MODEL)),
+                ('rule', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='alerts', to='monitoring.alertrule')),
+            ],
+            options={
+                'ordering': ['-triggered_at'],
+                'indexes': [models.Index(fields=['rule', 'status'], name='monitoring__rule_id_0ff7d3_idx'), models.Index(fields=['severity', 'status'], name='monitoring__severit_1e6a2c_idx'), models.Index(fields=['triggered_at'], name='monitoring__trigger_743dcf_idx')],
+            },
+        ),
+        migrations.CreateModel(
+            name='MonitoringDashboard',
+            fields=[
+                ('id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
+                ('name', models.CharField(max_length=200)),
+                ('description', models.TextField()),
+                ('dashboard_type', models.CharField(choices=[('SYSTEM_OVERVIEW', 'System Overview'), ('PERFORMANCE', 'Performance'), ('BUSINESS_METRICS', 'Business Metrics'), ('SECURITY', 'Security'), ('INFRASTRUCTURE', 'Infrastructure'), ('CUSTOM', 'Custom')], max_length=20)),
+                ('layout_config', models.JSONField(default=dict, help_text='Dashboard layout configuration')),
+                ('widget_configs', models.JSONField(default=list, help_text='Configuration for dashboard widgets')),
+                ('is_public', models.BooleanField(default=False)),
+                ('allowed_roles', models.JSONField(default=list, help_text='List of roles that can access this dashboard')),
+                ('auto_refresh_enabled', models.BooleanField(default=True)),
+                ('refresh_interval_seconds', models.PositiveIntegerField(default=30)),
+                ('is_active', models.BooleanField(default=True)),
+                ('created_at', models.DateTimeField(auto_now_add=True)),
+                ('updated_at', models.DateTimeField(auto_now=True)),
+                ('allowed_users', models.ManyToManyField(blank=True, related_name='accessible_monitoring_dashboards', to=settings.AUTH_USER_MODEL)),
+                ('created_by', models.ForeignKey(null=True, on_delete=django.db.models.deletion.SET_NULL, to=settings.AUTH_USER_MODEL)),
+            ],
+            options={
+                'ordering': ['name'],
+                'indexes': [models.Index(fields=['dashboard_type', 'is_active'], name='monitoring__dashboa_2e7a27_idx'), models.Index(fields=['is_public'], name='monitoring__is_publ_811f62_idx')],
+            },
+        ),
+        migrations.AddIndex(
+            model_name='monitoringtarget',
+            index=models.Index(fields=['target_type', 'status'], name='monitoring__target__f37347_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='monitoringtarget',
+            index=models.Index(fields=['related_module'], name='monitoring__related_0c51fc_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='monitoringtarget',
+            index=models.Index(fields=['last_checked'], name='monitoring__last_ch_83ce18_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='healthcheck',
+            index=models.Index(fields=['target', 'checked_at'], name='monitoring__target__8d1cd6_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='healthcheck',
+            index=models.Index(fields=['status', 'checked_at'], name='monitoring__status_636b2b_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='healthcheck',
+            index=models.Index(fields=['check_type'], name='monitoring__check_t_b442f3_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='systemmetric',
+            index=models.Index(fields=['metric_type', 'category'], name='monitoring__metric__df4606_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='systemmetric',
+            index=models.Index(fields=['related_module'], name='monitoring__related_7b383b_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='systemmetric',
+            index=models.Index(fields=['is_active'], name='monitoring__is_acti_c90676_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='metricmeasurement',
+            index=models.Index(fields=['metric', 'timestamp'], name='monitoring__metric__216cac_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='metricmeasurement',
+            index=models.Index(fields=['timestamp'], name='monitoring__timesta_75a739_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='alertrule',
+            index=models.Index(fields=['alert_type', 'severity'], name='monitoring__alert_t_915b15_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='alertrule',
+            index=models.Index(fields=['status', 'is_enabled'], name='monitoring__status_e905cc_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='systemstatus',
+            index=models.Index(fields=['status', 'started_at'], name='monitoring__status_18966f_idx'),
+        ),
+        migrations.AddIndex(
+            model_name='systemstatus',
+            index=models.Index(fields=['started_at'], name='monitoring__started_d85786_idx'),
+        ),
+    ]
--- a/ETB-API/monitoring/migrations/init.py
+++ b/ETB-API/monitoring/migrations/init.py
--- a/ETB-API/monitoring/migrations/pycache/0001_initial.cpython-312.pyc
+++ b/ETB-API/monitoring/migrations/pycache/0001_initial.cpython-312.pyc
--- a/ETB-API/monitoring/migrations/pycache/init.cpython-312.pyc
+++ b/ETB-API/monitoring/migrations/pycache/init.cpython-312.pyc
--- a/ETB-API/monitoring/models.py
+++ b/ETB-API/monitoring/models.py
@@ -0,0 +1,515 @@
+"""
+Monitoring models for comprehensive system observability
+"""
+import uuid
+import json
+from datetime import datetime, timedelta
+from typing import Dict, Any, Optional, List
+from decimal import Decimal
+
+from django.db import models
+from django.contrib.auth import get_user_model
+from django.core.validators import MinValueValidator, MaxValueValidator
+from django.utils import timezone
+from django.core.exceptions import ValidationError
+
+User = get_user_model()
+
+
+class MonitoringTarget(models.Model):
+    """Target systems, services, or components to monitor"""
+    
+    TARGET_TYPES = [
+        ('APPLICATION', 'Application'),
+        ('DATABASE', 'Database'),
+        ('CACHE', 'Cache'),
+        ('QUEUE', 'Message Queue'),
+        ('EXTERNAL_API', 'External API'),
+        ('SERVICE', 'Internal Service'),
+        ('INFRASTRUCTURE', 'Infrastructure'),
+        ('MODULE', 'Django Module'),
+    ]
+    
+    STATUS_CHOICES = [
+        ('ACTIVE', 'Active'),
+        ('INACTIVE', 'Inactive'),
+        ('MAINTENANCE', 'Maintenance'),
+        ('ERROR', 'Error'),
+    ]
+    
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
+    name = models.CharField(max_length=200, unique=True)
+    description = models.TextField()
+    target_type = models.CharField(max_length=20, choices=TARGET_TYPES)
+    
+    # Connection details
+    endpoint_url = models.URLField(blank=True, null=True)
+    connection_config = models.JSONField(
+        default=dict,
+        help_text="Connection configuration (credentials, timeouts, etc.)"
+    )
+    
+    # Monitoring configuration
+    check_interval_seconds = models.PositiveIntegerField(default=60)
+    timeout_seconds = models.PositiveIntegerField(default=30)
+    retry_count = models.PositiveIntegerField(default=3)
+    
+    # Health check configuration
+    health_check_enabled = models.BooleanField(default=True)
+    health_check_endpoint = models.CharField(max_length=200, blank=True, null=True)
+    expected_status_codes = models.JSONField(
+        default=list,
+        help_text="Expected HTTP status codes for health checks"
+    )
+    
+    # Status and metadata
+    status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='ACTIVE')
+    last_checked = models.DateTimeField(null=True, blank=True)
+    last_status = models.CharField(max_length=20, choices=[
+        ('HEALTHY', 'Healthy'),
+        ('WARNING', 'Warning'),
+        ('CRITICAL', 'Critical'),
+        ('UNKNOWN', 'Unknown'),
+    ], default='UNKNOWN')
+    
+    # Related module (if applicable)
+    related_module = models.CharField(
+        max_length=50,
+        blank=True,
+        null=True,
+        help_text="Related Django module (e.g., 'security', 'incident_intelligence')"
+    )
+    
+    # Metadata
+    created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
+    created_at = models.DateTimeField(auto_now_add=True)
+    updated_at = models.DateTimeField(auto_now=True)
+    
+    class Meta:
+        ordering = ['name']
+        indexes = [
+            models.Index(fields=['target_type', 'status']),
+            models.Index(fields=['related_module']),
+            models.Index(fields=['last_checked']),
+        ]
+    
+    def __str__(self):
+        return f"{self.name} ({self.target_type})"
+
+
+class HealthCheck(models.Model):
+    """Individual health check results"""
+    
+    CHECK_TYPES = [
+        ('HTTP', 'HTTP Health Check'),
+        ('DATABASE', 'Database Connection'),
+        ('CACHE', 'Cache Connection'),
+        ('QUEUE', 'Message Queue'),
+        ('CUSTOM', 'Custom Check'),
+        ('PING', 'Network Ping'),
+        ('SSL', 'SSL Certificate'),
+    ]
+    
+    STATUS_CHOICES = [
+        ('HEALTHY', 'Healthy'),
+        ('WARNING', 'Warning'),
+        ('CRITICAL', 'Critical'),
+        ('UNKNOWN', 'Unknown'),
+    ]
+    
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
+    target = models.ForeignKey(MonitoringTarget, on_delete=models.CASCADE, related_name='health_checks')
+    
+    # Check details
+    check_type = models.CharField(max_length=20, choices=CHECK_TYPES)
+    status = models.CharField(max_length=20, choices=STATUS_CHOICES)
+    response_time_ms = models.PositiveIntegerField(null=True, blank=True)
+    
+    # Response details
+    status_code = models.PositiveIntegerField(null=True, blank=True)
+    response_body = models.TextField(blank=True, null=True)
+    error_message = models.TextField(blank=True, null=True)
+    
+    # Metrics
+    cpu_usage_percent = models.FloatField(null=True, blank=True)
+    memory_usage_percent = models.FloatField(null=True, blank=True)
+    disk_usage_percent = models.FloatField(null=True, blank=True)
+    
+    # Timestamps
+    checked_at = models.DateTimeField(auto_now_add=True)
+    
+    class Meta:
+        ordering = ['-checked_at']
+        indexes = [
+            models.Index(fields=['target', 'checked_at']),
+            models.Index(fields=['status', 'checked_at']),
+            models.Index(fields=['check_type']),
+        ]
+    
+    def __str__(self):
+        return f"{self.target.name} - {self.status} ({self.checked_at})"
+
+
+class SystemMetric(models.Model):
+    """System performance and operational metrics"""
+    
+    METRIC_TYPES = [
+        ('PERFORMANCE', 'Performance Metric'),
+        ('BUSINESS', 'Business Metric'),
+        ('SECURITY', 'Security Metric'),
+        ('INFRASTRUCTURE', 'Infrastructure Metric'),
+        ('CUSTOM', 'Custom Metric'),
+    ]
+    
+    METRIC_CATEGORIES = [
+        ('API_RESPONSE_TIME', 'API Response Time'),
+        ('THROUGHPUT', 'Throughput'),
+        ('ERROR_RATE', 'Error Rate'),
+        ('AVAILABILITY', 'Availability'),
+        ('INCIDENT_COUNT', 'Incident Count'),
+        ('MTTR', 'Mean Time to Resolve'),
+        ('MTTA', 'Mean Time to Acknowledge'),
+        ('SLA_COMPLIANCE', 'SLA Compliance'),
+        ('SECURITY_EVENTS', 'Security Events'),
+        ('AUTOMATION_SUCCESS', 'Automation Success Rate'),
+        ('AI_ACCURACY', 'AI Model Accuracy'),
+        ('COST_IMPACT', 'Cost Impact'),
+        ('USER_ACTIVITY', 'User Activity'),
+        ('SYSTEM_RESOURCES', 'System Resources'),
+    ]
+    
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
+    name = models.CharField(max_length=200)
+    description = models.TextField()
+    metric_type = models.CharField(max_length=20, choices=METRIC_TYPES)
+    category = models.CharField(max_length=30, choices=METRIC_CATEGORIES)
+    
+    # Metric configuration
+    unit = models.CharField(max_length=50, help_text="Unit of measurement")
+    aggregation_method = models.CharField(
+        max_length=20,
+        choices=[
+            ('AVERAGE', 'Average'),
+            ('SUM', 'Sum'),
+            ('COUNT', 'Count'),
+            ('MIN', 'Minimum'),
+            ('MAX', 'Maximum'),
+            ('PERCENTILE_95', '95th Percentile'),
+            ('PERCENTILE_99', '99th Percentile'),
+        ]
+    )
+    
+    # Collection configuration
+    collection_interval_seconds = models.PositiveIntegerField(default=300)  # 5 minutes
+    retention_days = models.PositiveIntegerField(default=90)
+    
+    # Thresholds
+    warning_threshold = models.FloatField(null=True, blank=True)
+    critical_threshold = models.FloatField(null=True, blank=True)
+    
+    # Status
+    is_active = models.BooleanField(default=True)
+    is_system_metric = models.BooleanField(default=False)
+    
+    # Related module
+    related_module = models.CharField(
+        max_length=50,
+        blank=True,
+        null=True,
+        help_text="Related Django module"
+    )
+    
+    # Metadata
+    created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
+    created_at = models.DateTimeField(auto_now_add=True)
+    updated_at = models.DateTimeField(auto_now=True)
+    
+    class Meta:
+        ordering = ['name']
+        indexes = [
+            models.Index(fields=['metric_type', 'category']),
+            models.Index(fields=['related_module']),
+            models.Index(fields=['is_active']),
+        ]
+    
+    def __str__(self):
+        return f"{self.name} ({self.category})"
+
+
+class MetricMeasurement(models.Model):
+    """Individual metric measurements"""
+    
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
+    metric = models.ForeignKey(SystemMetric, on_delete=models.CASCADE, related_name='measurements')
+    
+    # Measurement details
+    value = models.DecimalField(max_digits=15, decimal_places=4)
+    timestamp = models.DateTimeField(auto_now_add=True)
+    
+    # Context
+    tags = models.JSONField(
+        default=dict,
+        help_text="Additional tags for this measurement"
+    )
+    metadata = models.JSONField(
+        default=dict,
+        help_text="Additional metadata"
+    )
+    
+    class Meta:
+        ordering = ['-timestamp']
+        indexes = [
+            models.Index(fields=['metric', 'timestamp']),
+            models.Index(fields=['timestamp']),
+        ]
+    
+    def __str__(self):
+        return f"{self.metric.name}: {self.value} ({self.timestamp})"
+
+
+class AlertRule(models.Model):
+    """Alert rules for monitoring thresholds"""
+    
+    ALERT_TYPES = [
+        ('THRESHOLD', 'Threshold Alert'),
+        ('ANOMALY', 'Anomaly Alert'),
+        ('PATTERN', 'Pattern Alert'),
+        ('AVAILABILITY', 'Availability Alert'),
+        ('PERFORMANCE', 'Performance Alert'),
+    ]
+    
+    SEVERITY_CHOICES = [
+        ('LOW', 'Low'),
+        ('MEDIUM', 'Medium'),
+        ('HIGH', 'High'),
+        ('CRITICAL', 'Critical'),
+    ]
+    
+    STATUS_CHOICES = [
+        ('ACTIVE', 'Active'),
+        ('INACTIVE', 'Inactive'),
+        ('MAINTENANCE', 'Maintenance'),
+    ]
+    
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
+    name = models.CharField(max_length=200)
+    description = models.TextField()
+    alert_type = models.CharField(max_length=20, choices=ALERT_TYPES)
+    severity = models.CharField(max_length=20, choices=SEVERITY_CHOICES)
+    
+    # Rule configuration
+    condition = models.JSONField(
+        help_text="Alert condition configuration"
+    )
+    evaluation_interval_seconds = models.PositiveIntegerField(default=60)
+    
+    # Related objects
+    metric = models.ForeignKey(
+        SystemMetric,
+        on_delete=models.CASCADE,
+        null=True,
+        blank=True,
+        related_name='alert_rules'
+    )
+    target = models.ForeignKey(
+        MonitoringTarget,
+        on_delete=models.CASCADE,
+        null=True,
+        blank=True,
+        related_name='alert_rules'
+    )
+    
+    # Notification configuration
+    notification_channels = models.JSONField(
+        default=list,
+        help_text="List of notification channels (email, slack, webhook, etc.)"
+    )
+    notification_template = models.TextField(
+        blank=True,
+        null=True,
+        help_text="Custom notification template"
+    )
+    
+    # Status
+    status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='ACTIVE')
+    is_enabled = models.BooleanField(default=True)
+    
+    # Metadata
+    created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
+    created_at = models.DateTimeField(auto_now_add=True)
+    updated_at = models.DateTimeField(auto_now=True)
+    
+    class Meta:
+        ordering = ['name']
+        indexes = [
+            models.Index(fields=['alert_type', 'severity']),
+            models.Index(fields=['status', 'is_enabled']),
+        ]
+    
+    def __str__(self):
+        return f"{self.name} ({self.severity})"
+
+
+class Alert(models.Model):
+    """Alert instances"""
+    
+    STATUS_CHOICES = [
+        ('TRIGGERED', 'Triggered'),
+        ('ACKNOWLEDGED', 'Acknowledged'),
+        ('RESOLVED', 'Resolved'),
+        ('SUPPRESSED', 'Suppressed'),
+    ]
+    
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
+    rule = models.ForeignKey(AlertRule, on_delete=models.CASCADE, related_name='alerts')
+    
+    # Alert details
+    title = models.CharField(max_length=200)
+    description = models.TextField()
+    severity = models.CharField(max_length=20, choices=AlertRule.SEVERITY_CHOICES)
+    status = models.CharField(max_length=20, choices=STATUS_CHOICES, default='TRIGGERED')
+    
+    # Context
+    triggered_value = models.DecimalField(max_digits=15, decimal_places=4, null=True, blank=True)
+    threshold_value = models.DecimalField(max_digits=15, decimal_places=4, null=True, blank=True)
+    context_data = models.JSONField(
+        default=dict,
+        help_text="Additional context data for the alert"
+    )
+    
+    # Timestamps
+    triggered_at = models.DateTimeField(auto_now_add=True)
+    acknowledged_at = models.DateTimeField(null=True, blank=True)
+    resolved_at = models.DateTimeField(null=True, blank=True)
+    
+    # Assignment
+    acknowledged_by = models.ForeignKey(
+        User,
+        on_delete=models.SET_NULL,
+        null=True,
+        blank=True,
+        related_name='acknowledged_alerts'
+    )
+    resolved_by = models.ForeignKey(
+        User,
+        on_delete=models.SET_NULL,
+        null=True,
+        blank=True,
+        related_name='resolved_alerts'
+    )
+    
+    class Meta:
+        ordering = ['-triggered_at']
+        indexes = [
+            models.Index(fields=['rule', 'status']),
+            models.Index(fields=['severity', 'status']),
+            models.Index(fields=['triggered_at']),
+        ]
+    
+    def __str__(self):
+        return f"{self.title} ({self.severity}) - {self.status}"
+
+
+class MonitoringDashboard(models.Model):
+    """Monitoring dashboard configurations"""
+    
+    DASHBOARD_TYPES = [
+        ('SYSTEM_OVERVIEW', 'System Overview'),
+        ('PERFORMANCE', 'Performance'),
+        ('BUSINESS_METRICS', 'Business Metrics'),
+        ('SECURITY', 'Security'),
+        ('INFRASTRUCTURE', 'Infrastructure'),
+        ('CUSTOM', 'Custom'),
+    ]
+    
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
+    name = models.CharField(max_length=200)
+    description = models.TextField()
+    dashboard_type = models.CharField(max_length=20, choices=DASHBOARD_TYPES)
+    
+    # Dashboard configuration
+    layout_config = models.JSONField(
+        default=dict,
+        help_text="Dashboard layout configuration"
+    )
+    widget_configs = models.JSONField(
+        default=list,
+        help_text="Configuration for dashboard widgets"
+    )
+    
+    # Access control
+    is_public = models.BooleanField(default=False)
+    allowed_users = models.ManyToManyField(
+        User,
+        blank=True,
+        related_name='accessible_monitoring_dashboards'
+    )
+    allowed_roles = models.JSONField(
+        default=list,
+        help_text="List of roles that can access this dashboard"
+    )
+    
+    # Refresh configuration
+    auto_refresh_enabled = models.BooleanField(default=True)
+    refresh_interval_seconds = models.PositiveIntegerField(default=30)
+    
+    # Status
+    is_active = models.BooleanField(default=True)
+    created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
+    created_at = models.DateTimeField(auto_now_add=True)
+    updated_at = models.DateTimeField(auto_now=True)
+    
+    class Meta:
+        ordering = ['name']
+        indexes = [
+            models.Index(fields=['dashboard_type', 'is_active']),
+            models.Index(fields=['is_public']),
+        ]
+    
+    def __str__(self):
+        return f"{self.name} ({self.dashboard_type})"
+
+
+class SystemStatus(models.Model):
+    """Overall system status tracking"""
+    
+    STATUS_CHOICES = [
+        ('OPERATIONAL', 'Operational'),
+        ('DEGRADED', 'Degraded'),
+        ('PARTIAL_OUTAGE', 'Partial Outage'),
+        ('MAJOR_OUTAGE', 'Major Outage'),
+        ('MAINTENANCE', 'Maintenance'),
+    ]
+    
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
+    status = models.CharField(max_length=20, choices=STATUS_CHOICES)
+    message = models.TextField(help_text="Status message for users")
+    
+    # Impact details
+    affected_services = models.JSONField(
+        default=list,
+        help_text="List of affected services"
+    )
+    estimated_resolution = models.DateTimeField(null=True, blank=True)
+    
+    # Timestamps
+    started_at = models.DateTimeField(auto_now_add=True)
+    updated_at = models.DateTimeField(auto_now=True)
+    resolved_at = models.DateTimeField(null=True, blank=True)
+    
+    # Metadata
+    created_by = models.ForeignKey(User, on_delete=models.SET_NULL, null=True)
+    
+    class Meta:
+        ordering = ['-started_at']
+        indexes = [
+            models.Index(fields=['status', 'started_at']),
+            models.Index(fields=['started_at']),
+        ]
+    
+    def __str__(self):
+        return f"System Status: {self.status} ({self.started_at})"
+    
+    @property
+    def is_resolved(self):
+        return self.resolved_at is not None
--- a/ETB-API/monitoring/serializers.py
+++ b/ETB-API/monitoring/serializers.py
@@ -0,0 +1,200 @@
+"""
+Serializers for monitoring models
+"""
+from rest_framework import serializers
+from monitoring.models import (
+    MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
+    AlertRule, Alert, MonitoringDashboard, SystemStatus
+)
+
+
+class MonitoringTargetSerializer(serializers.ModelSerializer):
+    """Serializer for MonitoringTarget model"""
+    
+    last_status_display = serializers.CharField(source='get_last_status_display', read_only=True)
+    target_type_display = serializers.CharField(source='get_target_type_display', read_only=True)
+    
+    class Meta:
+        model = MonitoringTarget
+        fields = [
+            'id', 'name', 'description', 'target_type', 'target_type_display',
+            'endpoint_url', 'connection_config', 'check_interval_seconds',
+            'timeout_seconds', 'retry_count', 'health_check_enabled',
+            'health_check_endpoint', 'expected_status_codes', 'status',
+            'last_checked', 'last_status', 'last_status_display',
+            'related_module', 'created_by', 'created_at', 'updated_at'
+        ]
+        read_only_fields = ['id', 'created_at', 'updated_at', 'last_checked']
+
+
+class HealthCheckSerializer(serializers.ModelSerializer):
+    """Serializer for HealthCheck model"""
+    
+    target_name = serializers.CharField(source='target.name', read_only=True)
+    status_display = serializers.CharField(source='get_status_display', read_only=True)
+    check_type_display = serializers.CharField(source='get_check_type_display', read_only=True)
+    
+    class Meta:
+        model = HealthCheck
+        fields = [
+            'id', 'target', 'target_name', 'check_type', 'check_type_display',
+            'status', 'status_display', 'response_time_ms', 'status_code',
+            'response_body', 'error_message', 'cpu_usage_percent',
+            'memory_usage_percent', 'disk_usage_percent', 'checked_at'
+        ]
+        read_only_fields = ['id', 'checked_at']
+
+
+class SystemMetricSerializer(serializers.ModelSerializer):
+    """Serializer for SystemMetric model"""
+    
+    metric_type_display = serializers.CharField(source='get_metric_type_display', read_only=True)
+    category_display = serializers.CharField(source='get_category_display', read_only=True)
+    aggregation_method_display = serializers.CharField(source='get_aggregation_method_display', read_only=True)
+    
+    class Meta:
+        model = SystemMetric
+        fields = [
+            'id', 'name', 'description', 'metric_type', 'metric_type_display',
+            'category', 'category_display', 'unit', 'aggregation_method',
+            'aggregation_method_display', 'collection_interval_seconds',
+            'retention_days', 'warning_threshold', 'critical_threshold',
+            'is_active', 'is_system_metric', 'related_module',
+            'created_by', 'created_at', 'updated_at'
+        ]
+        read_only_fields = ['id', 'created_at', 'updated_at']
+
+
+class MetricMeasurementSerializer(serializers.ModelSerializer):
+    """Serializer for MetricMeasurement model"""
+    
+    metric_name = serializers.CharField(source='metric.name', read_only=True)
+    metric_unit = serializers.CharField(source='metric.unit', read_only=True)
+    
+    class Meta:
+        model = MetricMeasurement
+        fields = [
+            'id', 'metric', 'metric_name', 'metric_unit', 'value',
+            'timestamp', 'tags', 'metadata'
+        ]
+        read_only_fields = ['id', 'timestamp']
+
+
+class AlertRuleSerializer(serializers.ModelSerializer):
+    """Serializer for AlertRule model"""
+    
+    alert_type_display = serializers.CharField(source='get_alert_type_display', read_only=True)
+    severity_display = serializers.CharField(source='get_severity_display', read_only=True)
+    status_display = serializers.CharField(source='get_status_display', read_only=True)
+    metric_name = serializers.CharField(source='metric.name', read_only=True)
+    target_name = serializers.CharField(source='target.name', read_only=True)
+    
+    class Meta:
+        model = AlertRule
+        fields = [
+            'id', 'name', 'description', 'alert_type', 'alert_type_display',
+            'severity', 'severity_display', 'condition', 'evaluation_interval_seconds',
+            'metric', 'metric_name', 'target', 'target_name',
+            'notification_channels', 'notification_template', 'status',
+            'status_display', 'is_enabled', 'created_by', 'created_at', 'updated_at'
+        ]
+        read_only_fields = ['id', 'created_at', 'updated_at']
+
+
+class AlertSerializer(serializers.ModelSerializer):
+    """Serializer for Alert model"""
+    
+    rule_name = serializers.CharField(source='rule.name', read_only=True)
+    severity_display = serializers.CharField(source='get_severity_display', read_only=True)
+    status_display = serializers.CharField(source='get_status_display', read_only=True)
+    acknowledged_by_username = serializers.CharField(source='acknowledged_by.username', read_only=True)
+    resolved_by_username = serializers.CharField(source='resolved_by.username', read_only=True)
+    
+    class Meta:
+        model = Alert
+        fields = [
+            'id', 'rule', 'rule_name', 'title', 'description', 'severity',
+            'severity_display', 'status', 'status_display', 'triggered_value',
+            'threshold_value', 'context_data', 'triggered_at', 'acknowledged_at',
+            'resolved_at', 'acknowledged_by', 'acknowledged_by_username',
+            'resolved_by', 'resolved_by_username'
+        ]
+        read_only_fields = ['id', 'triggered_at']
+
+
+class MonitoringDashboardSerializer(serializers.ModelSerializer):
+    """Serializer for MonitoringDashboard model"""
+    
+    dashboard_type_display = serializers.CharField(source='get_dashboard_type_display', read_only=True)
+    created_by_username = serializers.CharField(source='created_by.username', read_only=True)
+    
+    class Meta:
+        model = MonitoringDashboard
+        fields = [
+            'id', 'name', 'description', 'dashboard_type', 'dashboard_type_display',
+            'layout_config', 'widget_configs', 'is_public', 'allowed_users',
+            'allowed_roles', 'auto_refresh_enabled', 'refresh_interval_seconds',
+            'is_active', 'created_by', 'created_by_username', 'created_at', 'updated_at'
+        ]
+        read_only_fields = ['id', 'created_at', 'updated_at']
+
+
+class SystemStatusSerializer(serializers.ModelSerializer):
+    """Serializer for SystemStatus model"""
+    
+    status_display = serializers.CharField(source='get_status_display', read_only=True)
+    created_by_username = serializers.CharField(source='created_by.username', read_only=True)
+    is_resolved = serializers.BooleanField(read_only=True)
+    
+    class Meta:
+        model = SystemStatus
+        fields = [
+            'id', 'status', 'status_display', 'message', 'affected_services',
+            'estimated_resolution', 'started_at', 'updated_at', 'resolved_at',
+            'created_by', 'created_by_username', 'is_resolved'
+        ]
+        read_only_fields = ['id', 'started_at', 'updated_at']
+
+
+class HealthCheckSummarySerializer(serializers.Serializer):
+    """Serializer for health check summary"""
+    
+    overall_status = serializers.CharField()
+    total_targets = serializers.IntegerField()
+    healthy_targets = serializers.IntegerField()
+    warning_targets = serializers.IntegerField()
+    critical_targets = serializers.IntegerField()
+    health_percentage = serializers.FloatField()
+    last_updated = serializers.DateTimeField()
+
+
+class MetricTrendSerializer(serializers.Serializer):
+    """Serializer for metric trends"""
+    
+    metric_name = serializers.CharField()
+    period_days = serializers.IntegerField()
+    daily_data = serializers.ListField()
+    trend = serializers.CharField()
+
+
+class AlertSummarySerializer(serializers.Serializer):
+    """Serializer for alert summary"""
+    
+    total_alerts = serializers.IntegerField()
+    critical_alerts = serializers.IntegerField()
+    high_alerts = serializers.IntegerField()
+    medium_alerts = serializers.IntegerField()
+    low_alerts = serializers.IntegerField()
+    acknowledged_alerts = serializers.IntegerField()
+    resolved_alerts = serializers.IntegerField()
+
+
+class SystemOverviewSerializer(serializers.Serializer):
+    """Serializer for system overview"""
+    
+    system_status = SystemStatusSerializer()
+    health_summary = HealthCheckSummarySerializer()
+    alert_summary = AlertSummarySerializer()
+    recent_incidents = serializers.ListField()
+    top_metrics = serializers.ListField()
+    system_resources = serializers.DictField()
--- a/ETB-API/monitoring/services/init.py
+++ b/ETB-API/monitoring/services/init.py
@@ -0,0 +1 @@
+# Monitoring services
--- a/ETB-API/monitoring/services/pycache/init.cpython-312.pyc
+++ b/ETB-API/monitoring/services/pycache/init.cpython-312.pyc
--- a/ETB-API/monitoring/services/pycache/alerting.cpython-312.pyc
+++ b/ETB-API/monitoring/services/pycache/alerting.cpython-312.pyc
--- a/ETB-API/monitoring/services/pycache/health_checks.cpython-312.pyc
+++ b/ETB-API/monitoring/services/pycache/health_checks.cpython-312.pyc
--- a/ETB-API/monitoring/services/pycache/metrics_collector.cpython-312.pyc
+++ b/ETB-API/monitoring/services/pycache/metrics_collector.cpython-312.pyc
--- a/ETB-API/monitoring/services/alerting.py
+++ b/ETB-API/monitoring/services/alerting.py
@@ -0,0 +1,449 @@
+"""
+Alerting service for monitoring system
+"""
+import logging
+from typing import Dict, Any, List, Optional
+from datetime import datetime, timedelta
+from django.utils import timezone
+from django.core.mail import send_mail
+from django.conf import settings
+from django.contrib.auth import get_user_model
+
+from monitoring.models import AlertRule, Alert, SystemMetric, MetricMeasurement, MonitoringTarget
+
+User = get_user_model()
+logger = logging.getLogger(__name__)
+
+
+class AlertEvaluator:
+    """Service for evaluating alert conditions"""
+    
+    def __init__(self):
+        self.aggregator = None  # Will be imported to avoid circular imports
+    
+    def evaluate_alert_rules(self) -> List[Dict[str, Any]]:
+        """Evaluate all active alert rules"""
+        triggered_alerts = []
+        
+        active_rules = AlertRule.objects.filter(
+            status='ACTIVE',
+            is_enabled=True
+        )
+        
+        for rule in active_rules:
+            try:
+                if self._evaluate_rule(rule):
+                    alert_data = self._create_alert(rule)
+                    triggered_alerts.append(alert_data)
+            except Exception as e:
+                logger.error(f"Failed to evaluate alert rule {rule.name}: {e}")
+        
+        return triggered_alerts
+    
+    def _evaluate_rule(self, rule: AlertRule) -> bool:
+        """Evaluate if an alert rule condition is met"""
+        condition = rule.condition
+        condition_type = condition.get('type')
+        
+        if condition_type == 'THRESHOLD':
+            return self._evaluate_threshold_condition(rule, condition)
+        elif condition_type == 'ANOMALY':
+            return self._evaluate_anomaly_condition(rule, condition)
+        elif condition_type == 'AVAILABILITY':
+            return self._evaluate_availability_condition(rule, condition)
+        elif condition_type == 'PATTERN':
+            return self._evaluate_pattern_condition(rule, condition)
+        else:
+            logger.warning(f"Unknown condition type: {condition_type}")
+            return False
+    
+    def _evaluate_threshold_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
+        """Evaluate threshold-based alert conditions"""
+        if not rule.metric:
+            return False
+        
+        # Get latest metric value
+        latest_measurement = MetricMeasurement.objects.filter(
+            metric=rule.metric
+        ).order_by('-timestamp').first()
+        
+        if not latest_measurement:
+            return False
+        
+        current_value = float(latest_measurement.value)
+        threshold_value = condition.get('threshold')
+        operator = condition.get('operator', '>')
+        
+        if operator == '>':
+            return current_value > threshold_value
+        elif operator == '>=':
+            return current_value >= threshold_value
+        elif operator == '<':
+            return current_value < threshold_value
+        elif operator == '<=':
+            return current_value <= threshold_value
+        elif operator == '==':
+            return current_value == threshold_value
+        elif operator == '!=':
+            return current_value != threshold_value
+        else:
+            logger.warning(f"Unknown operator: {operator}")
+            return False
+    
+    def _evaluate_anomaly_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
+        """Evaluate anomaly-based alert conditions"""
+        # This would integrate with anomaly detection models
+        # For now, implement a simple statistical anomaly detection
+        
+        if not rule.metric:
+            return False
+        
+        # Get recent measurements
+        since = timezone.now() - timedelta(hours=24)
+        measurements = MetricMeasurement.objects.filter(
+            metric=rule.metric,
+            timestamp__gte=since
+        ).order_by('-timestamp')[:100]  # Last 100 measurements
+        
+        if len(measurements) < 10:  # Need minimum data points
+            return False
+        
+        values = [float(m.value) for m in measurements]
+        
+        # Calculate mean and standard deviation
+        mean = sum(values) / len(values)
+        variance = sum((x - mean) ** 2 for x in values) / len(values)
+        std_dev = variance ** 0.5
+        
+        # Check if latest value is an anomaly (more than 2 standard deviations)
+        latest_value = values[0]
+        anomaly_threshold = condition.get('threshold', 2.0)  # Default 2 sigma
+        
+        return abs(latest_value - mean) > (anomaly_threshold * std_dev)
+    
+    def _evaluate_availability_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
+        """Evaluate availability-based alert conditions"""
+        if not rule.target:
+            return False
+        
+        # Check if target is in critical state
+        return rule.target.last_status == 'CRITICAL'
+    
+    def _evaluate_pattern_condition(self, rule: AlertRule, condition: Dict[str, Any]) -> bool:
+        """Evaluate pattern-based alert conditions"""
+        # This would integrate with pattern detection algorithms
+        # For now, return False as placeholder
+        return False
+    
+    def _create_alert(self, rule: AlertRule) -> Dict[str, Any]:
+        """Create an alert instance"""
+        # Get current value for context
+        current_value = None
+        threshold_value = None
+        
+        if rule.metric:
+            latest_measurement = MetricMeasurement.objects.filter(
+                metric=rule.metric
+            ).order_by('-timestamp').first()
+            if latest_measurement:
+                current_value = float(latest_measurement.value)
+                threshold_value = rule.metric.critical_threshold
+        
+        # Create alert
+        alert = Alert.objects.create(
+            rule=rule,
+            title=f"{rule.name} - {rule.severity}",
+            description=self._generate_alert_description(rule, current_value, threshold_value),
+            severity=rule.severity,
+            triggered_value=current_value,
+            threshold_value=threshold_value,
+            context_data={
+                'rule_id': str(rule.id),
+                'metric_name': rule.metric.name if rule.metric else None,
+                'target_name': rule.target.name if rule.target else None,
+                'condition': rule.condition
+            }
+        )
+        
+        return {
+            'alert_id': str(alert.id),
+            'rule_name': rule.name,
+            'severity': rule.severity,
+            'title': alert.title,
+            'description': alert.description,
+            'current_value': current_value,
+            'threshold_value': threshold_value
+        }
+    
+    def _generate_alert_description(self, rule: AlertRule, current_value: Optional[float], threshold_value: Optional[float]) -> str:
+        """Generate alert description"""
+        description = f"Alert rule '{rule.name}' has been triggered.\n"
+        
+        if rule.metric and current_value is not None:
+            description += f"Current value: {current_value} {rule.metric.unit}\n"
+        
+        if threshold_value is not None:
+            description += f"Threshold: {threshold_value} {rule.metric.unit if rule.metric else ''}\n"
+        
+        if rule.target:
+            description += f"Target: {rule.target.name}\n"
+        
+        description += f"Severity: {rule.severity}\n"
+        description += f"Time: {timezone.now().strftime('%Y-%m-%d %H:%M:%S')}"
+        
+        return description
+
+
+class NotificationService:
+    """Service for sending alert notifications"""
+    
+    def __init__(self):
+        self.evaluator = AlertEvaluator()
+    
+    def send_alert_notifications(self, alert_data: Dict[str, Any]) -> Dict[str, Any]:
+        """Send notifications for an alert"""
+        results = {}
+        
+        # Get alert rule to determine notification channels
+        rule_id = alert_data.get('rule_id')
+        if not rule_id:
+            return {'error': 'No rule ID provided'}
+        
+        try:
+            rule = AlertRule.objects.get(id=rule_id)
+        except AlertRule.DoesNotExist:
+            return {'error': 'Alert rule not found'}
+        
+        notification_channels = rule.notification_channels or []
+        
+        for channel in notification_channels:
+            try:
+                if channel['type'] == 'EMAIL':
+                    result = self._send_email_notification(alert_data, channel)
+                elif channel['type'] == 'SLACK':
+                    result = self._send_slack_notification(alert_data, channel)
+                elif channel['type'] == 'WEBHOOK':
+                    result = self._send_webhook_notification(alert_data, channel)
+                else:
+                    result = {'error': f'Unknown notification channel type: {channel["type"]}'}
+                
+                results[channel['type']] = result
+                
+            except Exception as e:
+                logger.error(f"Failed to send {channel['type']} notification: {e}")
+                results[channel['type']] = {'error': str(e)}
+        
+        return results
+    
+    def _send_email_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
+        """Send email notification"""
+        try:
+            recipients = channel.get('recipients', [])
+            if not recipients:
+                return {'error': 'No email recipients configured'}
+            
+            subject = f"[{alert_data.get('severity', 'ALERT')}] {alert_data.get('title', 'System Alert')}"
+            message = alert_data.get('description', '')
+            
+            send_mail(
+                subject=subject,
+                message=message,
+                from_email=settings.DEFAULT_FROM_EMAIL,
+                recipient_list=recipients,
+                fail_silently=False
+            )
+            
+            return {'status': 'sent', 'recipients': recipients}
+            
+        except Exception as e:
+            return {'error': str(e)}
+    
+    def _send_slack_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
+        """Send Slack notification"""
+        try:
+            webhook_url = channel.get('webhook_url')
+            if not webhook_url:
+                return {'error': 'No Slack webhook URL configured'}
+            
+            # Create Slack message
+            color = self._get_slack_color(alert_data.get('severity', 'MEDIUM'))
+            
+            slack_message = {
+                "text": alert_data.get('title', 'System Alert'),
+                "attachments": [
+                    {
+                        "color": color,
+                        "fields": [
+                            {
+                                "title": "Description",
+                                "value": alert_data.get('description', ''),
+                                "short": False
+                            },
+                            {
+                                "title": "Severity",
+                                "value": alert_data.get('severity', 'UNKNOWN'),
+                                "short": True
+                            },
+                            {
+                                "title": "Time",
+                                "value": timezone.now().strftime('%Y-%m-%d %H:%M:%S'),
+                                "short": True
+                            }
+                        ]
+                    }
+                ]
+            }
+            
+            # Send to Slack (would use requests in real implementation)
+            # requests.post(webhook_url, json=slack_message)
+            
+            return {'status': 'sent', 'channel': channel.get('channel', '#alerts')}
+            
+        except Exception as e:
+            return {'error': str(e)}
+    
+    def _send_webhook_notification(self, alert_data: Dict[str, Any], channel: Dict[str, Any]) -> Dict[str, Any]:
+        """Send webhook notification"""
+        try:
+            webhook_url = channel.get('url')
+            if not webhook_url:
+                return {'error': 'No webhook URL configured'}
+            
+            # Prepare webhook payload
+            payload = {
+                'alert': alert_data,
+                'timestamp': timezone.now().isoformat(),
+                'source': 'ETB-API-Monitoring'
+            }
+            
+            # Send webhook (would use requests in real implementation)
+            # requests.post(webhook_url, json=payload)
+            
+            return {'status': 'sent', 'url': webhook_url}
+            
+        except Exception as e:
+            return {'error': str(e)}
+    
+    def _get_slack_color(self, severity: str) -> str:
+        """Get Slack color based on severity"""
+        color_map = {
+            'LOW': 'good',
+            'MEDIUM': 'warning',
+            'HIGH': 'danger',
+            'CRITICAL': 'danger'
+        }
+        return color_map.get(severity, 'warning')
+
+
+class AlertingService:
+    """Main alerting service that coordinates alert evaluation and notification"""
+    
+    def __init__(self):
+        self.evaluator = AlertEvaluator()
+        self.notification_service = NotificationService()
+    
+    def run_alert_evaluation(self) -> Dict[str, Any]:
+        """Run alert evaluation and send notifications"""
+        results = {
+            'evaluated_rules': 0,
+            'triggered_alerts': 0,
+            'notifications_sent': 0,
+            'errors': []
+        }
+        
+        try:
+            # Evaluate all alert rules
+            triggered_alerts = self.evaluator.evaluate_alert_rules()
+            results['triggered_alerts'] = len(triggered_alerts)
+            
+            # Send notifications for triggered alerts
+            for alert_data in triggered_alerts:
+                try:
+                    notification_results = self.notification_service.send_alert_notifications(alert_data)
+                    results['notifications_sent'] += 1
+                except Exception as e:
+                    logger.error(f"Failed to send notifications for alert {alert_data.get('alert_id')}: {e}")
+                    results['errors'].append(str(e))
+            
+            # Count evaluated rules
+            results['evaluated_rules'] = AlertRule.objects.filter(
+                status='ACTIVE',
+                is_enabled=True
+            ).count()
+            
+        except Exception as e:
+            logger.error(f"Alert evaluation failed: {e}")
+            results['errors'].append(str(e))
+        
+        return results
+    
+    def acknowledge_alert(self, alert_id: str, user: User) -> Dict[str, Any]:
+        """Acknowledge an alert"""
+        try:
+            alert = Alert.objects.get(id=alert_id)
+            alert.status = 'ACKNOWLEDGED'
+            alert.acknowledged_by = user
+            alert.acknowledged_at = timezone.now()
+            alert.save()
+            
+            return {
+                'status': 'success',
+                'message': f'Alert {alert_id} acknowledged by {user.username}'
+            }
+            
+        except Alert.DoesNotExist:
+            return {
+                'status': 'error',
+                'message': f'Alert {alert_id} not found'
+            }
+        except Exception as e:
+            return {
+                'status': 'error',
+                'message': str(e)
+            }
+    
+    def resolve_alert(self, alert_id: str, user: User) -> Dict[str, Any]:
+        """Resolve an alert"""
+        try:
+            alert = Alert.objects.get(id=alert_id)
+            alert.status = 'RESOLVED'
+            alert.resolved_by = user
+            alert.resolved_at = timezone.now()
+            alert.save()
+            
+            return {
+                'status': 'success',
+                'message': f'Alert {alert_id} resolved by {user.username}'
+            }
+            
+        except Alert.DoesNotExist:
+            return {
+                'status': 'error',
+                'message': f'Alert {alert_id} not found'
+            }
+        except Exception as e:
+            return {
+                'status': 'error',
+                'message': str(e)
+            }
+    
+    def get_active_alerts(self, severity: Optional[str] = None) -> List[Dict[str, Any]]:
+        """Get active alerts"""
+        alerts = Alert.objects.filter(status='TRIGGERED')
+        
+        if severity:
+            alerts = alerts.filter(severity=severity)
+        
+        return [
+            {
+                'id': str(alert.id),
+                'title': alert.title,
+                'description': alert.description,
+                'severity': alert.severity,
+                'triggered_at': alert.triggered_at,
+                'rule_name': alert.rule.name,
+                'current_value': float(alert.triggered_value) if alert.triggered_value else None,
+                'threshold_value': float(alert.threshold_value) if alert.threshold_value else None
+            }
+            for alert in alerts.order_by('-triggered_at')
+        ]
--- a/ETB-API/monitoring/services/health_checks.py
+++ b/ETB-API/monitoring/services/health_checks.py
@@ -0,0 +1,372 @@
+"""
+Health check services for monitoring system components
+"""
+import time
+import requests
+import psutil
+import logging
+from typing import Dict, Any, Optional, Tuple
+from django.conf import settings
+from django.db import connection
+from django.core.cache import cache
+from django.utils import timezone
+from celery import current_app as celery_app
+
+logger = logging.getLogger(__name__)
+
+
+class BaseHealthCheck:
+    """Base class for health checks"""
+    
+    def __init__(self, target):
+        self.target = target
+        self.start_time = None
+        self.end_time = None
+    
+    def execute(self) -> Dict[str, Any]:
+        """Execute the health check and return results"""
+        self.start_time = time.time()
+        try:
+            result = self._perform_check()
+            self.end_time = time.time()
+            
+            result.update({
+                'response_time_ms': int((self.end_time - self.start_time) * 1000),
+                'checked_at': timezone.now(),
+                'error_message': None
+            })
+            
+            return result
+        except Exception as e:
+            self.end_time = time.time()
+            logger.error(f"Health check failed for {self.target.name}: {e}")
+            return {
+                'status': 'CRITICAL',
+                'response_time_ms': int((self.end_time - self.start_time) * 1000),
+                'checked_at': timezone.now(),
+                'error_message': str(e)
+            }
+    
+    def _perform_check(self) -> Dict[str, Any]:
+        """Override in subclasses to implement specific checks"""
+        raise NotImplementedError
+
+
+class HTTPHealthCheck(BaseHealthCheck):
+    """HTTP-based health check"""
+    
+    def _perform_check(self) -> Dict[str, Any]:
+        url = self.target.endpoint_url
+        if not url:
+            raise ValueError("No endpoint URL configured")
+        
+        timeout = self.target.timeout_seconds
+        expected_codes = self.target.expected_status_codes or [200]
+        
+        response = requests.get(url, timeout=timeout)
+        
+        if response.status_code in expected_codes:
+            status = 'HEALTHY'
+        elif response.status_code >= 500:
+            status = 'CRITICAL'
+        else:
+            status = 'WARNING'
+        
+        return {
+            'status': status,
+            'status_code': response.status_code,
+            'response_body': response.text[:1000]  # Limit response body size
+        }
+
+
+class DatabaseHealthCheck(BaseHealthCheck):
+    """Database connection health check"""
+    
+    def _perform_check(self) -> Dict[str, Any]:
+        try:
+            with connection.cursor() as cursor:
+                cursor.execute("SELECT 1")
+                result = cursor.fetchone()
+                
+            if result and result[0] == 1:
+                return {
+                    'status': 'HEALTHY',
+                    'status_code': 200
+                }
+            else:
+                return {
+                    'status': 'CRITICAL',
+                    'status_code': 500,
+                    'error_message': 'Database query returned unexpected result'
+                }
+        except Exception as e:
+            return {
+                'status': 'CRITICAL',
+                'status_code': 500,
+                'error_message': f'Database connection failed: {str(e)}'
+            }
+
+
+class CacheHealthCheck(BaseHealthCheck):
+    """Cache system health check"""
+    
+    def _perform_check(self) -> Dict[str, Any]:
+        try:
+            # Test cache write/read
+            test_key = f"health_check_{int(time.time())}"
+            test_value = "health_check_value"
+            
+            cache.set(test_key, test_value, timeout=10)
+            retrieved_value = cache.get(test_key)
+            
+            if retrieved_value == test_value:
+                cache.delete(test_key)  # Clean up
+                return {
+                    'status': 'HEALTHY',
+                    'status_code': 200
+                }
+            else:
+                return {
+                    'status': 'CRITICAL',
+                    'status_code': 500,
+                    'error_message': 'Cache read/write test failed'
+                }
+        except Exception as e:
+            return {
+                'status': 'CRITICAL',
+                'status_code': 500,
+                'error_message': f'Cache operation failed: {str(e)}'
+            }
+
+
+class CeleryHealthCheck(BaseHealthCheck):
+    """Celery worker health check"""
+    
+    def _perform_check(self) -> Dict[str, Any]:
+        try:
+            # Check if Celery workers are active
+            inspect = celery_app.control.inspect()
+            active_workers = inspect.active()
+            
+            if active_workers:
+                worker_count = len(active_workers)
+                return {
+                    'status': 'HEALTHY',
+                    'status_code': 200,
+                    'response_body': f'Active workers: {worker_count}'
+                }
+            else:
+                return {
+                    'status': 'CRITICAL',
+                    'status_code': 500,
+                    'error_message': 'No active Celery workers found'
+                }
+        except Exception as e:
+            return {
+                'status': 'CRITICAL',
+                'status_code': 500,
+                'error_message': f'Celery health check failed: {str(e)}'
+            }
+
+
+class SystemResourceHealthCheck(BaseHealthCheck):
+    """System resource health check"""
+    
+    def _perform_check(self) -> Dict[str, Any]:
+        try:
+            # Get system metrics
+            cpu_percent = psutil.cpu_percent(interval=1)
+            memory = psutil.virtual_memory()
+            disk = psutil.disk_usage('/')
+            
+            # Determine status based on thresholds
+            status = 'HEALTHY'
+            if cpu_percent > 90 or memory.percent > 90 or disk.percent > 90:
+                status = 'CRITICAL'
+            elif cpu_percent > 80 or memory.percent > 80 or disk.percent > 80:
+                status = 'WARNING'
+            
+            return {
+                'status': status,
+                'status_code': 200,
+                'cpu_usage_percent': cpu_percent,
+                'memory_usage_percent': memory.percent,
+                'disk_usage_percent': disk.percent,
+                'response_body': f'CPU: {cpu_percent}%, Memory: {memory.percent}%, Disk: {disk.percent}%'
+            }
+        except Exception as e:
+            return {
+                'status': 'CRITICAL',
+                'status_code': 500,
+                'error_message': f'System resource check failed: {str(e)}'
+            }
+
+
+class ModuleHealthCheck(BaseHealthCheck):
+    """Django module health check"""
+    
+    def _perform_check(self) -> Dict[str, Any]:
+        try:
+            module_name = self.target.related_module
+            if not module_name:
+                raise ValueError("No module specified for module health check")
+            
+            # Import the module to check if it's accessible
+            __import__(module_name)
+            
+            # Check if module has required models/views
+            from django.apps import apps
+            app_config = apps.get_app_config(module_name)
+            
+            if app_config:
+                return {
+                    'status': 'HEALTHY',
+                    'status_code': 200,
+                    'response_body': f'Module {module_name} is accessible'
+                }
+            else:
+                return {
+                    'status': 'WARNING',
+                    'status_code': 200,
+                    'error_message': f'Module {module_name} not found in Django apps'
+                }
+        except Exception as e:
+            return {
+                'status': 'CRITICAL',
+                'status_code': 500,
+                'error_message': f'Module health check failed: {str(e)}'
+            }
+
+
+class HealthCheckFactory:
+    """Factory for creating health check instances"""
+    
+    CHECK_CLASSES = {
+        'HTTP': HTTPHealthCheck,
+        'DATABASE': DatabaseHealthCheck,
+        'CACHE': CacheHealthCheck,
+        'QUEUE': CeleryHealthCheck,
+        'CUSTOM': BaseHealthCheck,
+        'PING': HTTPHealthCheck,  # Use HTTP for ping
+        'SSL': HTTPHealthCheck,   # Use HTTP for SSL
+    }
+    
+    @classmethod
+    def create_health_check(cls, target, check_type: str) -> BaseHealthCheck:
+        """Create a health check instance based on type"""
+        check_class = cls.CHECK_CLASSES.get(check_type, BaseHealthCheck)
+        return check_class(target)
+    
+    @classmethod
+    def get_available_check_types(cls) -> list:
+        """Get list of available health check types"""
+        return list(cls.CHECK_CLASSES.keys())
+
+
+class HealthCheckService:
+    """Service for managing health checks"""
+    
+    def __init__(self):
+        self.factory = HealthCheckFactory()
+    
+    def execute_health_check(self, target, check_type: str) -> Dict[str, Any]:
+        """Execute a health check for a target"""
+        health_check = self.factory.create_health_check(target, check_type)
+        return health_check.execute()
+    
+    def execute_all_health_checks(self) -> Dict[str, Any]:
+        """Execute health checks for all active targets"""
+        from monitoring.models import MonitoringTarget, HealthCheck
+        
+        results = {}
+        active_targets = MonitoringTarget.objects.filter(
+            status='ACTIVE',
+            health_check_enabled=True
+        )
+        
+        for target in active_targets:
+            try:
+                # Determine check type based on target type
+                check_type = self._get_check_type_for_target(target)
+                
+                # Execute health check
+                result = self.execute_health_check(target, check_type)
+                
+                # Save result to database
+                HealthCheck.objects.create(
+                    target=target,
+                    check_type=check_type,
+                    status=result['status'],
+                    response_time_ms=result.get('response_time_ms'),
+                    status_code=result.get('status_code'),
+                    response_body=result.get('response_body'),
+                    error_message=result.get('error_message'),
+                    cpu_usage_percent=result.get('cpu_usage_percent'),
+                    memory_usage_percent=result.get('memory_usage_percent'),
+                    disk_usage_percent=result.get('disk_usage_percent')
+                )
+                
+                # Update target status
+                target.last_checked = timezone.now()
+                target.last_status = result['status']
+                target.save(update_fields=['last_checked', 'last_status'])
+                
+                results[target.name] = result
+                
+            except Exception as e:
+                logger.error(f"Failed to execute health check for {target.name}: {e}")
+                results[target.name] = {
+                    'status': 'CRITICAL',
+                    'error_message': str(e)
+                }
+        
+        return results
+    
+    def _get_check_type_for_target(self, target) -> str:
+        """Determine the appropriate check type for a target"""
+        target_type_mapping = {
+            'APPLICATION': 'HTTP',
+            'DATABASE': 'DATABASE',
+            'CACHE': 'CACHE',
+            'QUEUE': 'QUEUE',
+            'EXTERNAL_API': 'HTTP',
+            'SERVICE': 'HTTP',
+            'INFRASTRUCTURE': 'HTTP',
+            'MODULE': 'CUSTOM',
+        }
+        
+        return target_type_mapping.get(target.target_type, 'HTTP')
+    
+    def get_system_health_summary(self) -> Dict[str, Any]:
+        """Get overall system health summary"""
+        from monitoring.models import HealthCheck, MonitoringTarget
+        
+        # Get latest health check for each target
+        latest_checks = HealthCheck.objects.filter(
+            target__status='ACTIVE'
+        ).order_by('target', '-checked_at').distinct('target')
+        
+        total_targets = MonitoringTarget.objects.filter(status='ACTIVE').count()
+        healthy_targets = latest_checks.filter(status='HEALTHY').count()
+        warning_targets = latest_checks.filter(status='WARNING').count()
+        critical_targets = latest_checks.filter(status='CRITICAL').count()
+        
+        # Calculate overall status
+        if critical_targets > 0:
+            overall_status = 'CRITICAL'
+        elif warning_targets > 0:
+            overall_status = 'WARNING'
+        elif healthy_targets == total_targets:
+            overall_status = 'HEALTHY'
+        else:
+            overall_status = 'UNKNOWN'
+        
+        return {
+            'overall_status': overall_status,
+            'total_targets': total_targets,
+            'healthy_targets': healthy_targets,
+            'warning_targets': warning_targets,
+            'critical_targets': critical_targets,
+            'health_percentage': (healthy_targets / total_targets * 100) if total_targets > 0 else 0,
+            'last_updated': timezone.now()
+        }
--- a/ETB-API/monitoring/services/metrics_collector.py
+++ b/ETB-API/monitoring/services/metrics_collector.py
@@ -0,0 +1,364 @@
+"""
+Metrics collection service for system monitoring
+"""
+import time
+import logging
+from typing import Dict, Any, List, Optional
+from datetime import datetime, timedelta
+from django.utils import timezone
+from django.db import connection
+from django.core.cache import cache
+from django.conf import settings
+from django.contrib.auth import get_user_model
+
+from monitoring.models import SystemMetric, MetricMeasurement
+
+User = get_user_model()
+logger = logging.getLogger(__name__)
+
+
+class MetricsCollector:
+    """Service for collecting and storing system metrics"""
+    
+    def __init__(self):
+        self.collected_metrics = {}
+    
+    def collect_all_metrics(self) -> Dict[str, Any]:
+        """Collect all configured metrics"""
+        results = {}
+        
+        # Get all active metrics
+        active_metrics = SystemMetric.objects.filter(is_active=True)
+        
+        for metric in active_metrics:
+            try:
+                value = self._collect_metric_value(metric)
+                if value is not None:
+                    # Store measurement
+                    measurement = MetricMeasurement.objects.create(
+                        metric=metric,
+                        value=value,
+                        tags=self._get_metric_tags(metric),
+                        metadata=self._get_metric_metadata(metric)
+                    )
+                    
+                    results[metric.name] = {
+                        'value': value,
+                        'measurement_id': measurement.id,
+                        'timestamp': measurement.timestamp
+                    }
+                    
+            except Exception as e:
+                logger.error(f"Failed to collect metric {metric.name}: {e}")
+                results[metric.name] = {
+                    'error': str(e)
+                }
+        
+        return results
+    
+    def _collect_metric_value(self, metric: SystemMetric) -> Optional[float]:
+        """Collect value for a specific metric"""
+        category = metric.category
+        
+        if category == 'API_RESPONSE_TIME':
+            return self._collect_api_response_time(metric)
+        elif category == 'THROUGHPUT':
+            return self._collect_throughput(metric)
+        elif category == 'ERROR_RATE':
+            return self._collect_error_rate(metric)
+        elif category == 'AVAILABILITY':
+            return self._collect_availability(metric)
+        elif category == 'INCIDENT_COUNT':
+            return self._collect_incident_count(metric)
+        elif category == 'MTTR':
+            return self._collect_mttr(metric)
+        elif category == 'MTTA':
+            return self._collect_mtta(metric)
+        elif category == 'SLA_COMPLIANCE':
+            return self._collect_sla_compliance(metric)
+        elif category == 'SECURITY_EVENTS':
+            return self._collect_security_events(metric)
+        elif category == 'AUTOMATION_SUCCESS':
+            return self._collect_automation_success(metric)
+        elif category == 'AI_ACCURACY':
+            return self._collect_ai_accuracy(metric)
+        elif category == 'COST_IMPACT':
+            return self._collect_cost_impact(metric)
+        elif category == 'USER_ACTIVITY':
+            return self._collect_user_activity(metric)
+        elif category == 'SYSTEM_RESOURCES':
+            return self._collect_system_resources(metric)
+        else:
+            logger.warning(f"Unknown metric category: {category}")
+            return None
+    
+    def _collect_api_response_time(self, metric: SystemMetric) -> Optional[float]:
+        """Collect API response time metrics"""
+        # This would typically come from middleware or APM tools
+        # For now, return a mock value
+        return 150.5  # milliseconds
+    
+    def _collect_throughput(self, metric: SystemMetric) -> Optional[float]:
+        """Collect throughput metrics (requests per minute)"""
+        # Count requests in the last minute
+        # This would typically come from access logs or middleware
+        return 120.0  # requests per minute
+    
+    def _collect_error_rate(self, metric: SystemMetric) -> Optional[float]:
+        """Collect error rate metrics"""
+        # Count errors in the last hour
+        # This would typically come from logs or error tracking
+        return 0.02  # 2% error rate
+    
+    def _collect_availability(self, metric: SystemMetric) -> Optional[float]:
+        """Collect availability metrics"""
+        # Calculate availability percentage
+        # This would typically come from uptime monitoring
+        return 99.9  # 99.9% availability
+    
+    def _collect_incident_count(self, metric: SystemMetric) -> Optional[float]:
+        """Collect incident count metrics"""
+        from incident_intelligence.models import Incident
+        
+        # Count incidents in the last 24 hours
+        since = timezone.now() - timedelta(hours=24)
+        count = Incident.objects.filter(created_at__gte=since).count()
+        return float(count)
+    
+    def _collect_mttr(self, metric: SystemMetric) -> Optional[float]:
+        """Collect Mean Time to Resolve metrics"""
+        from incident_intelligence.models import Incident
+        
+        # Calculate MTTR for resolved incidents in the last 7 days
+        since = timezone.now() - timedelta(days=7)
+        resolved_incidents = Incident.objects.filter(
+            status__in=['RESOLVED', 'CLOSED'],
+            resolved_at__isnull=False,
+            resolved_at__gte=since
+        )
+        
+        if not resolved_incidents.exists():
+            return None
+        
+        total_resolution_time = 0
+        count = 0
+        
+        for incident in resolved_incidents:
+            if incident.resolved_at and incident.created_at:
+                resolution_time = incident.resolved_at - incident.created_at
+                total_resolution_time += resolution_time.total_seconds()
+                count += 1
+        
+        if count > 0:
+            return total_resolution_time / count / 60  # Convert to minutes
+        return None
+    
+    def _collect_mtta(self, metric: SystemMetric) -> Optional[float]:
+        """Collect Mean Time to Acknowledge metrics"""
+        # This would require tracking when incidents are first acknowledged
+        # For now, return a mock value
+        return 15.5  # minutes
+    
+    def _collect_sla_compliance(self, metric: SystemMetric) -> Optional[float]:
+        """Collect SLA compliance metrics"""
+        from sla_oncall.models import SLAInstance
+        
+        # Calculate SLA compliance percentage
+        total_slas = SLAInstance.objects.count()
+        if total_slas == 0:
+            return None
+        
+        # This would require more complex SLA compliance calculation
+        # For now, return a mock value
+        return 95.5  # 95.5% SLA compliance
+    
+    def _collect_security_events(self, metric: SystemMetric) -> Optional[float]:
+        """Collect security events metrics"""
+        # Count security events in the last hour
+        # This would come from security logs or audit trails
+        return 3.0  # 3 security events in the last hour
+    
+    def _collect_automation_success(self, metric: SystemMetric) -> Optional[float]:
+        """Collect automation success rate metrics"""
+        from automation_orchestration.models import RunbookExecution
+        
+        # Calculate success rate for runbook executions in the last 24 hours
+        since = timezone.now() - timedelta(hours=24)
+        executions = RunbookExecution.objects.filter(created_at__gte=since)
+        
+        if not executions.exists():
+            return None
+        
+        successful = executions.filter(status='COMPLETED').count()
+        total = executions.count()
+        
+        return (successful / total * 100) if total > 0 else None
+    
+    def _collect_ai_accuracy(self, metric: SystemMetric) -> Optional[float]:
+        """Collect AI model accuracy metrics"""
+        from incident_intelligence.models import IncidentClassification
+        
+        # Calculate accuracy for AI classifications
+        classifications = IncidentClassification.objects.all()
+        
+        if not classifications.exists():
+            return None
+        
+        # This would require comparing predictions with actual outcomes
+        # For now, return average confidence score
+        total_confidence = sum(c.confidence_score for c in classifications)
+        return (total_confidence / classifications.count() * 100) if classifications.count() > 0 else None
+    
+    def _collect_cost_impact(self, metric: SystemMetric) -> Optional[float]:
+        """Collect cost impact metrics"""
+        from analytics_predictive_insights.models import CostImpactAnalysis
+        
+        # Calculate total cost impact for the last 30 days
+        since = timezone.now() - timedelta(days=30)
+        cost_analyses = CostImpactAnalysis.objects.filter(created_at__gte=since)
+        
+        total_cost = sum(float(ca.cost_amount) for ca in cost_analyses)
+        return total_cost
+    
+    def _collect_user_activity(self, metric: SystemMetric) -> Optional[float]:
+        """Collect user activity metrics"""
+        # Count active users in the last hour
+        since = timezone.now() - timedelta(hours=1)
+        # This would require user activity tracking
+        return 25.0  # 25 active users in the last hour
+    
+    def _collect_system_resources(self, metric: SystemMetric) -> Optional[float]:
+        """Collect system resource metrics"""
+        import psutil
+        
+        # Get CPU usage
+        cpu_percent = psutil.cpu_percent(interval=1)
+        return cpu_percent
+    
+    def _get_metric_tags(self, metric: SystemMetric) -> Dict[str, str]:
+        """Get tags for a metric measurement"""
+        tags = {
+            'metric_type': metric.metric_type,
+            'category': metric.category,
+        }
+        
+        if metric.related_module:
+            tags['module'] = metric.related_module
+        
+        return tags
+    
+    def _get_metric_metadata(self, metric: SystemMetric) -> Dict[str, Any]:
+        """Get metadata for a metric measurement"""
+        return {
+            'unit': metric.unit,
+            'aggregation_method': metric.aggregation_method,
+            'collection_interval': metric.collection_interval_seconds,
+        }
+
+
+class MetricsAggregator:
+    """Service for aggregating metrics over time periods"""
+    
+    def __init__(self):
+        self.collector = MetricsCollector()
+    
+    def aggregate_metrics(self, metric: SystemMetric, start_time: datetime, end_time: datetime) -> Dict[str, Any]:
+        """Aggregate metrics over a time period"""
+        measurements = MetricMeasurement.objects.filter(
+            metric=metric,
+            timestamp__gte=start_time,
+            timestamp__lte=end_time
+        ).order_by('timestamp')
+        
+        if not measurements.exists():
+            return {
+                'count': 0,
+                'values': [],
+                'aggregated_value': None
+            }
+        
+        values = [float(m.value) for m in measurements]
+        aggregated_value = self._aggregate_values(values, metric.aggregation_method)
+        
+        return {
+            'count': len(values),
+            'values': values,
+            'aggregated_value': aggregated_value,
+            'start_time': start_time,
+            'end_time': end_time,
+            'unit': metric.unit
+        }
+    
+    def _aggregate_values(self, values: List[float], method: str) -> Optional[float]:
+        """Aggregate a list of values using the specified method"""
+        if not values:
+            return None
+        
+        if method == 'AVERAGE':
+            return sum(values) / len(values)
+        elif method == 'SUM':
+            return sum(values)
+        elif method == 'COUNT':
+            return len(values)
+        elif method == 'MIN':
+            return min(values)
+        elif method == 'MAX':
+            return max(values)
+        elif method == 'PERCENTILE_95':
+            return self._calculate_percentile(values, 95)
+        elif method == 'PERCENTILE_99':
+            return self._calculate_percentile(values, 99)
+        else:
+            return sum(values) / len(values)  # Default to average
+    
+    def _calculate_percentile(self, values: List[float], percentile: int) -> float:
+        """Calculate percentile of values"""
+        sorted_values = sorted(values)
+        index = int((percentile / 100) * len(sorted_values))
+        return sorted_values[min(index, len(sorted_values) - 1)]
+    
+    def get_metric_trends(self, metric: SystemMetric, days: int = 7) -> Dict[str, Any]:
+        """Get metric trends over a period"""
+        end_time = timezone.now()
+        start_time = end_time - timedelta(days=days)
+        
+        # Get daily aggregations
+        daily_data = []
+        for i in range(days):
+            day_start = start_time + timedelta(days=i)
+            day_end = day_start + timedelta(days=1)
+            
+            day_aggregation = self.aggregate_metrics(metric, day_start, day_end)
+            daily_data.append({
+                'date': day_start.date(),
+                'value': day_aggregation['aggregated_value'],
+                'count': day_aggregation['count']
+            })
+        
+        return {
+            'metric_name': metric.name,
+            'period_days': days,
+            'daily_data': daily_data,
+            'trend': self._calculate_trend([d['value'] for d in daily_data if d['value'] is not None])
+        }
+    
+    def _calculate_trend(self, values: List[float]) -> str:
+        """Calculate trend direction from values"""
+        if len(values) < 2:
+            return 'STABLE'
+        
+        # Simple linear trend calculation
+        first_half = values[:len(values)//2]
+        second_half = values[len(values)//2:]
+        
+        first_avg = sum(first_half) / len(first_half)
+        second_avg = sum(second_half) / len(second_half)
+        
+        change_percent = ((second_avg - first_avg) / first_avg) * 100 if first_avg != 0 else 0
+        
+        if change_percent > 5:
+            return 'INCREASING'
+        elif change_percent < -5:
+            return 'DECREASING'
+        else:
+            return 'STABLE'
--- a/ETB-API/monitoring/signals.py
+++ b/ETB-API/monitoring/signals.py
@@ -0,0 +1,88 @@
+"""
+Signals for monitoring system
+"""
+import logging
+from django.db.models.signals import post_save, post_delete
+from django.dispatch import receiver
+from django.utils import timezone
+
+from monitoring.models import Alert, SystemStatus
+from monitoring.services.alerting import AlertingService
+
+logger = logging.getLogger(__name__)
+
+
+@receiver(post_save, sender=Alert)
+def alert_created_handler(sender, instance, created, **kwargs):
+    """Handle alert creation"""
+    if created:
+        logger.info(f"New alert created: {instance.title} ({instance.severity})")
+        
+        # Send notifications for new alerts
+        try:
+            alerting_service = AlertingService()
+            alert_data = {
+                'rule_id': str(instance.rule.id),
+                'title': instance.title,
+                'description': instance.description,
+                'severity': instance.severity,
+                'current_value': float(instance.triggered_value) if instance.triggered_value else None,
+                'threshold_value': float(instance.threshold_value) if instance.threshold_value else None
+            }
+            
+            notification_results = alerting_service.notification_service.send_alert_notifications(alert_data)
+            logger.info(f"Alert notifications sent: {notification_results}")
+            
+        except Exception as e:
+            logger.error(f"Failed to send alert notifications: {e}")
+
+
+@receiver(post_save, sender=SystemStatus)
+def system_status_changed_handler(sender, instance, created, **kwargs):
+    """Handle system status changes"""
+    if created or instance.tracker.has_changed('status'):
+        logger.info(f"System status changed to: {instance.status}")
+        
+        # Update system status in cache or external systems
+        try:
+            # This could trigger notifications to external systems
+            # or update status pages
+            pass
+        except Exception as e:
+            logger.error(f"Failed to update system status: {e}")
+
+
+# Add tracker to SystemStatus model for change detection
+from django.db import models
+
+class SystemStatusTracker:
+    """Track changes to SystemStatus model"""
+    
+    def __init__(self, instance):
+        self.instance = instance
+        self._initial_data = {}
+        if instance.pk:
+            self._initial_data = {
+                'status': instance.status,
+                'message': instance.message
+            }
+    
+    def has_changed(self, field):
+        """Check if a field has changed"""
+        if not self.instance.pk:
+            return True
+        return getattr(self.instance, field) != self._initial_data.get(field)
+
+# Monkey patch the SystemStatus model to add tracker
+def add_tracker_to_system_status():
+    """Add tracker to SystemStatus instances"""
+    original_init = SystemStatus.__init__
+    
+    def new_init(self, *args, **kwargs):
+        original_init(self, *args, **kwargs)
+        self.tracker = SystemStatusTracker(self)
+    
+    SystemStatus.__init__ = new_init
+
+# Call the function to add tracker
+add_tracker_to_system_status()
--- a/ETB-API/monitoring/tasks.py
+++ b/ETB-API/monitoring/tasks.py
@@ -0,0 +1,319 @@
+"""
+Celery tasks for automated monitoring
+"""
+import logging
+from celery import shared_task
+from django.utils import timezone
+from datetime import timedelta
+
+from monitoring.services.health_checks import HealthCheckService
+from monitoring.services.metrics_collector import MetricsCollector
+from monitoring.services.alerting import AlertingService
+
+logger = logging.getLogger(__name__)
+
+
+@shared_task(bind=True, max_retries=3)
+def execute_health_checks(self):
+    """Execute health checks for all monitoring targets"""
+    try:
+        logger.info("Starting health check execution")
+        
+        health_service = HealthCheckService()
+        results = health_service.execute_all_health_checks()
+        
+        logger.info(f"Health checks completed. Results: {len(results)} targets checked")
+        
+        return {
+            'status': 'success',
+            'targets_checked': len(results),
+            'results': results
+        }
+        
+    except Exception as e:
+        logger.error(f"Health check execution failed: {e}")
+        
+        # Retry with exponential backoff
+        if self.request.retries < self.max_retries:
+            countdown = 2 ** self.request.retries
+            logger.info(f"Retrying health checks in {countdown} seconds")
+            raise self.retry(countdown=countdown)
+        
+        return {
+            'status': 'error',
+            'error': str(e)
+        }
+
+
+@shared_task(bind=True, max_retries=3)
+def collect_metrics(self):
+    """Collect metrics from all configured sources"""
+    try:
+        logger.info("Starting metrics collection")
+        
+        collector = MetricsCollector()
+        results = collector.collect_all_metrics()
+        
+        successful_metrics = len([r for r in results.values() if 'error' not in r])
+        failed_metrics = len([r for r in results.values() if 'error' in r])
+        
+        logger.info(f"Metrics collection completed. Success: {successful_metrics}, Failed: {failed_metrics}")
+        
+        return {
+            'status': 'success',
+            'successful_metrics': successful_metrics,
+            'failed_metrics': failed_metrics,
+            'results': results
+        }
+        
+    except Exception as e:
+        logger.error(f"Metrics collection failed: {e}")
+        
+        # Retry with exponential backoff
+        if self.request.retries < self.max_retries:
+            countdown = 2 ** self.request.retries
+            logger.info(f"Retrying metrics collection in {countdown} seconds")
+            raise self.retry(countdown=countdown)
+        
+        return {
+            'status': 'error',
+            'error': str(e)
+        }
+
+
+@shared_task(bind=True, max_retries=3)
+def evaluate_alerts(self):
+    """Evaluate alert rules and send notifications"""
+    try:
+        logger.info("Starting alert evaluation")
+        
+        alerting_service = AlertingService()
+        results = alerting_service.run_alert_evaluation()
+        
+        logger.info(f"Alert evaluation completed. Triggered: {results['triggered_alerts']}, Notifications: {results['notifications_sent']}")
+        
+        return results
+        
+    except Exception as e:
+        logger.error(f"Alert evaluation failed: {e}")
+        
+        # Retry with exponential backoff
+        if self.request.retries < self.max_retries:
+            countdown = 2 ** self.request.retries
+            logger.info(f"Retrying alert evaluation in {countdown} seconds")
+            raise self.retry(countdown=countdown)
+        
+        return {
+            'status': 'error',
+            'error': str(e)
+        }
+
+
+@shared_task(bind=True, max_retries=3)
+def cleanup_old_data(self):
+    """Clean up old monitoring data"""
+    try:
+        logger.info("Starting data cleanup")
+        
+        from monitoring.models import HealthCheck, MetricMeasurement, Alert
+        
+        # Clean up old health checks (keep last 7 days)
+        cutoff_date = timezone.now() - timedelta(days=7)
+        old_health_checks = HealthCheck.objects.filter(checked_at__lt=cutoff_date)
+        health_checks_deleted = old_health_checks.count()
+        old_health_checks.delete()
+        
+        # Clean up old metric measurements (keep last 90 days)
+        cutoff_date = timezone.now() - timedelta(days=90)
+        old_measurements = MetricMeasurement.objects.filter(timestamp__lt=cutoff_date)
+        measurements_deleted = old_measurements.count()
+        old_measurements.delete()
+        
+        # Clean up resolved alerts older than 30 days
+        cutoff_date = timezone.now() - timedelta(days=30)
+        old_alerts = Alert.objects.filter(
+            status='RESOLVED',
+            resolved_at__lt=cutoff_date
+        )
+        alerts_deleted = old_alerts.count()
+        old_alerts.delete()
+        
+        logger.info(f"Data cleanup completed. Health checks: {health_checks_deleted}, Measurements: {measurements_deleted}, Alerts: {alerts_deleted}")
+        
+        return {
+            'status': 'success',
+            'health_checks_deleted': health_checks_deleted,
+            'measurements_deleted': measurements_deleted,
+            'alerts_deleted': alerts_deleted
+        }
+        
+    except Exception as e:
+        logger.error(f"Data cleanup failed: {e}")
+        
+        # Retry with exponential backoff
+        if self.request.retries < self.max_retries:
+            countdown = 2 ** self.request.retries
+            logger.info(f"Retrying data cleanup in {countdown} seconds")
+            raise self.retry(countdown=countdown)
+        
+        return {
+            'status': 'error',
+            'error': str(e)
+        }
+
+
+@shared_task(bind=True, max_retries=3)
+def generate_system_status_report(self):
+    """Generate system status report"""
+    try:
+        logger.info("Generating system status report")
+        
+        from monitoring.models import SystemStatus
+        from monitoring.services.health_checks import HealthCheckService
+        
+        health_service = HealthCheckService()
+        health_summary = health_service.get_system_health_summary()
+        
+        # Determine overall system status
+        if health_summary['critical_targets'] > 0:
+            status = 'MAJOR_OUTAGE'
+            message = f"Critical issues detected in {health_summary['critical_targets']} systems"
+        elif health_summary['warning_targets'] > 0:
+            status = 'DEGRADED'
+            message = f"Performance issues detected in {health_summary['warning_targets']} systems"
+        else:
+            status = 'OPERATIONAL'
+            message = "All systems operational"
+        
+        # Create system status record
+        system_status = SystemStatus.objects.create(
+            status=status,
+            message=message,
+            affected_services=[]  # Would be populated based on actual issues
+        )
+        
+        logger.info(f"System status report generated: {status}")
+        
+        return {
+            'status': 'success',
+            'system_status': status,
+            'message': message,
+            'health_summary': health_summary
+        }
+        
+    except Exception as e:
+        logger.error(f"System status report generation failed: {e}")
+        
+        # Retry with exponential backoff
+        if self.request.retries < self.max_retries:
+            countdown = 2 ** self.request.retries
+            logger.info(f"Retrying system status report in {countdown} seconds")
+            raise self.retry(countdown=countdown)
+        
+        return {
+            'status': 'error',
+            'error': str(e)
+        }
+
+
+@shared_task(bind=True, max_retries=3)
+def monitor_external_integrations(self):
+    """Monitor external integrations and services"""
+    try:
+        logger.info("Starting external integrations monitoring")
+        
+        from monitoring.models import MonitoringTarget
+        from monitoring.services.health_checks import HealthCheckService
+        
+        health_service = HealthCheckService()
+        
+        # Get external integration targets
+        external_targets = MonitoringTarget.objects.filter(
+            target_type='EXTERNAL_API',
+            status='ACTIVE'
+        )
+        
+        results = {}
+        for target in external_targets:
+            try:
+                result = health_service.execute_health_check(target, 'HTTP')
+                results[target.name] = result
+                
+                # Log integration status
+                if result['status'] == 'CRITICAL':
+                    logger.warning(f"External integration {target.name} is critical")
+                elif result['status'] == 'WARNING':
+                    logger.info(f"External integration {target.name} has warnings")
+                
+            except Exception as e:
+                logger.error(f"Failed to check external integration {target.name}: {e}")
+                results[target.name] = {'status': 'CRITICAL', 'error': str(e)}
+        
+        logger.info(f"External integrations monitoring completed. Checked: {len(results)} integrations")
+        
+        return {
+            'status': 'success',
+            'integrations_checked': len(results),
+            'results': results
+        }
+        
+    except Exception as e:
+        logger.error(f"External integrations monitoring failed: {e}")
+        
+        # Retry with exponential backoff
+        if self.request.retries < self.max_retries:
+            countdown = 2 ** self.request.retries
+            logger.info(f"Retrying external integrations monitoring in {countdown} seconds")
+            raise self.retry(countdown=countdown)
+        
+        return {
+            'status': 'error',
+            'error': str(e)
+        }
+
+
+@shared_task(bind=True, max_retries=3)
+def update_monitoring_dashboards(self):
+    """Update monitoring dashboards with latest data"""
+    try:
+        logger.info("Updating monitoring dashboards")
+        
+        from monitoring.models import MonitoringDashboard
+        from monitoring.services.metrics_collector import MetricsAggregator
+        
+        aggregator = MetricsAggregator()
+        
+        # Get active dashboards
+        active_dashboards = MonitoringDashboard.objects.filter(is_active=True)
+        
+        updated_dashboards = 0
+        for dashboard in active_dashboards:
+            try:
+                # Update dashboard data (this would typically involve caching or real-time updates)
+                # For now, just log the update
+                logger.info(f"Updating dashboard: {dashboard.name}")
+                updated_dashboards += 1
+                
+            except Exception as e:
+                logger.error(f"Failed to update dashboard {dashboard.name}: {e}")
+        
+        logger.info(f"Dashboard updates completed. Updated: {updated_dashboards} dashboards")
+        
+        return {
+            'status': 'success',
+            'dashboards_updated': updated_dashboards
+        }
+        
+    except Exception as e:
+        logger.error(f"Dashboard update failed: {e}")
+        
+        # Retry with exponential backoff
+        if self.request.retries < self.max_retries:
+            countdown = 2 ** self.request.retries
+            logger.info(f"Retrying dashboard update in {countdown} seconds")
+            raise self.retry(countdown=countdown)
+        
+        return {
+            'status': 'error',
+            'error': str(e)
+        }
--- a/ETB-API/monitoring/urls.py
+++ b/ETB-API/monitoring/urls.py
@@ -0,0 +1,30 @@
+"""
+URL configuration for monitoring app
+"""
+from django.urls import path, include
+from rest_framework.routers import DefaultRouter
+
+from monitoring.views import (
+    MonitoringTargetViewSet, HealthCheckViewSet, SystemMetricViewSet,
+    MetricMeasurementViewSet, AlertRuleViewSet, AlertViewSet,
+    MonitoringDashboardViewSet, SystemStatusViewSet, SystemOverviewView,
+    MonitoringTasksView
+)
+
+router = DefaultRouter()
+router.register(r'targets', MonitoringTargetViewSet)
+router.register(r'health-checks', HealthCheckViewSet)
+router.register(r'metrics', SystemMetricViewSet)
+router.register(r'measurements', MetricMeasurementViewSet)
+router.register(r'alert-rules', AlertRuleViewSet)
+router.register(r'alerts', AlertViewSet)
+router.register(r'dashboards', MonitoringDashboardViewSet)
+router.register(r'status', SystemStatusViewSet)
+
+app_name = 'monitoring'
+
+urlpatterns = [
+    path('', include(router.urls)),
+    path('overview/', SystemOverviewView.as_view(), name='system-overview'),
+    path('tasks/', MonitoringTasksView.as_view(), name='monitoring-tasks'),
+]
--- a/ETB-API/monitoring/views.py
+++ b/ETB-API/monitoring/views.py
@@ -0,0 +1,480 @@
+"""
+Views for monitoring system
+"""
+import logging
+from rest_framework import viewsets, status, permissions
+from rest_framework.decorators import action
+from rest_framework.response import Response
+from rest_framework.views import APIView
+from django_filters.rest_framework import DjangoFilterBackend
+from rest_framework.filters import SearchFilter, OrderingFilter
+from django.utils import timezone
+from datetime import timedelta
+
+from monitoring.models import (
+    MonitoringTarget, HealthCheck, SystemMetric, MetricMeasurement,
+    AlertRule, Alert, MonitoringDashboard, SystemStatus
+)
+from monitoring.serializers import (
+    MonitoringTargetSerializer, HealthCheckSerializer, SystemMetricSerializer,
+    MetricMeasurementSerializer, AlertRuleSerializer, AlertSerializer,
+    MonitoringDashboardSerializer, SystemStatusSerializer,
+    HealthCheckSummarySerializer, MetricTrendSerializer, AlertSummarySerializer,
+    SystemOverviewSerializer
+)
+from monitoring.services.health_checks import HealthCheckService
+from monitoring.services.metrics_collector import MetricsCollector, MetricsAggregator
+from monitoring.services.alerting import AlertingService
+from monitoring.tasks import (
+    execute_health_checks, collect_metrics, evaluate_alerts,
+    generate_system_status_report
+)
+
+logger = logging.getLogger(__name__)
+
+
+class MonitoringTargetViewSet(viewsets.ModelViewSet):
+    """ViewSet for MonitoringTarget model"""
+    
+    queryset = MonitoringTarget.objects.all()
+    serializer_class = MonitoringTargetSerializer
+    permission_classes = [permissions.IsAuthenticated]
+    filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
+    filterset_fields = ['target_type', 'status', 'last_status', 'related_module']
+    search_fields = ['name', 'description']
+    ordering_fields = ['name', 'created_at', 'last_checked']
+    ordering = ['name']
+    
+    def perform_create(self, serializer):
+        """Set the creator when creating a monitoring target"""
+        serializer.save(created_by=self.request.user)
+    
+    @action(detail=True, methods=['post'])
+    def test_connection(self, request, pk=None):
+        """Test connection to monitoring target"""
+        target = self.get_object()
+        
+        try:
+            health_service = HealthCheckService()
+            result = health_service.execute_health_check(target, 'HTTP')
+            
+            return Response({
+                'status': 'success',
+                'result': result
+            })
+        except Exception as e:
+            return Response({
+                'status': 'error',
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+    
+    @action(detail=True, methods=['post'])
+    def enable_monitoring(self, request, pk=None):
+        """Enable monitoring for a target"""
+        target = self.get_object()
+        target.status = 'ACTIVE'
+        target.save()
+        
+        return Response({
+            'status': 'success',
+            'message': f'Monitoring enabled for {target.name}'
+        })
+    
+    @action(detail=True, methods=['post'])
+    def disable_monitoring(self, request, pk=None):
+        """Disable monitoring for a target"""
+        target = self.get_object()
+        target.status = 'INACTIVE'
+        target.save()
+        
+        return Response({
+            'status': 'success',
+            'message': f'Monitoring disabled for {target.name}'
+        })
+
+
+class HealthCheckViewSet(viewsets.ReadOnlyModelViewSet):
+    """ViewSet for HealthCheck model (read-only)"""
+    
+    queryset = HealthCheck.objects.all()
+    serializer_class = HealthCheckSerializer
+    permission_classes = [permissions.IsAuthenticated]
+    filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
+    filterset_fields = ['target', 'check_type', 'status']
+    ordering_fields = ['checked_at', 'response_time_ms']
+    ordering = ['-checked_at']
+    
+    @action(detail=False, methods=['get'])
+    def summary(self, request):
+        """Get health check summary"""
+        try:
+            health_service = HealthCheckService()
+            summary = health_service.get_system_health_summary()
+            
+            serializer = HealthCheckSummarySerializer(summary)
+            return Response(serializer.data)
+        except Exception as e:
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+    
+    @action(detail=False, methods=['post'])
+    def run_all_checks(self, request):
+        """Run health checks for all targets"""
+        try:
+            # Execute health checks asynchronously
+            task = execute_health_checks.delay()
+            
+            return Response({
+                'status': 'success',
+                'message': 'Health checks started',
+                'task_id': task.id
+            })
+        except Exception as e:
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+
+
+class SystemMetricViewSet(viewsets.ModelViewSet):
+    """ViewSet for SystemMetric model"""
+    
+    queryset = SystemMetric.objects.all()
+    serializer_class = SystemMetricSerializer
+    permission_classes = [permissions.IsAuthenticated]
+    filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
+    filterset_fields = ['metric_type', 'category', 'is_active', 'related_module']
+    search_fields = ['name', 'description']
+    ordering_fields = ['name', 'created_at']
+    ordering = ['name']
+    
+    def perform_create(self, serializer):
+        """Set the creator when creating a metric"""
+        serializer.save(created_by=self.request.user)
+    
+    @action(detail=True, methods=['get'])
+    def measurements(self, request, pk=None):
+        """Get measurements for a metric"""
+        metric = self.get_object()
+        
+        # Get query parameters
+        hours = int(request.query_params.get('hours', 24))
+        limit = int(request.query_params.get('limit', 100))
+        
+        since = timezone.now() - timedelta(hours=hours)
+        measurements = MetricMeasurement.objects.filter(
+            metric=metric,
+            timestamp__gte=since
+        ).order_by('-timestamp')[:limit]
+        
+        serializer = MetricMeasurementSerializer(measurements, many=True)
+        return Response(serializer.data)
+    
+    @action(detail=True, methods=['get'])
+    def trends(self, request, pk=None):
+        """Get metric trends"""
+        metric = self.get_object()
+        days = int(request.query_params.get('days', 7))
+        
+        try:
+            aggregator = MetricsAggregator()
+            trends = aggregator.get_metric_trends(metric, days)
+            
+            serializer = MetricTrendSerializer(trends)
+            return Response(serializer.data)
+        except Exception as e:
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+
+
+class MetricMeasurementViewSet(viewsets.ReadOnlyModelViewSet):
+    """ViewSet for MetricMeasurement model (read-only)"""
+    
+    queryset = MetricMeasurement.objects.all()
+    serializer_class = MetricMeasurementSerializer
+    permission_classes = [permissions.IsAuthenticated]
+    filter_backends = [DjangoFilterBackend, OrderingFilter]
+    filterset_fields = ['metric']
+    ordering_fields = ['timestamp', 'value']
+    ordering = ['-timestamp']
+
+
+class AlertRuleViewSet(viewsets.ModelViewSet):
+    """ViewSet for AlertRule model"""
+    
+    queryset = AlertRule.objects.all()
+    serializer_class = AlertRuleSerializer
+    permission_classes = [permissions.IsAuthenticated]
+    filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
+    filterset_fields = ['alert_type', 'severity', 'status', 'is_enabled']
+    search_fields = ['name', 'description']
+    ordering_fields = ['name', 'created_at']
+    ordering = ['name']
+    
+    def perform_create(self, serializer):
+        """Set the creator when creating an alert rule"""
+        serializer.save(created_by=self.request.user)
+    
+    @action(detail=True, methods=['post'])
+    def test_rule(self, request, pk=None):
+        """Test an alert rule"""
+        rule = self.get_object()
+        
+        try:
+            alerting_service = AlertingService()
+            # This would test the rule without creating an alert
+            return Response({
+                'status': 'success',
+                'message': f'Alert rule {rule.name} test completed'
+            })
+        except Exception as e:
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+    
+    @action(detail=True, methods=['post'])
+    def enable_rule(self, request, pk=None):
+        """Enable an alert rule"""
+        rule = self.get_object()
+        rule.is_enabled = True
+        rule.save()
+        
+        return Response({
+            'status': 'success',
+            'message': f'Alert rule {rule.name} enabled'
+        })
+    
+    @action(detail=True, methods=['post'])
+    def disable_rule(self, request, pk=None):
+        """Disable an alert rule"""
+        rule = self.get_object()
+        rule.is_enabled = False
+        rule.save()
+        
+        return Response({
+            'status': 'success',
+            'message': f'Alert rule {rule.name} disabled'
+        })
+
+
+class AlertViewSet(viewsets.ModelViewSet):
+    """ViewSet for Alert model"""
+    
+    queryset = Alert.objects.all()
+    serializer_class = AlertSerializer
+    permission_classes = [permissions.IsAuthenticated]
+    filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
+    filterset_fields = ['rule', 'severity', 'status']
+    search_fields = ['title', 'description']
+    ordering_fields = ['triggered_at', 'severity']
+    ordering = ['-triggered_at']
+    
+    @action(detail=True, methods=['post'])
+    def acknowledge(self, request, pk=None):
+        """Acknowledge an alert"""
+        alert = self.get_object()
+        
+        try:
+            alerting_service = AlertingService()
+            result = alerting_service.acknowledge_alert(str(alert.id), request.user)
+            
+            return Response(result)
+        except Exception as e:
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+    
+    @action(detail=True, methods=['post'])
+    def resolve(self, request, pk=None):
+        """Resolve an alert"""
+        alert = self.get_object()
+        
+        try:
+            alerting_service = AlertingService()
+            result = alerting_service.resolve_alert(str(alert.id), request.user)
+            
+            return Response(result)
+        except Exception as e:
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+    
+    @action(detail=False, methods=['get'])
+    def summary(self, request):
+        """Get alert summary"""
+        try:
+            alerting_service = AlertingService()
+            active_alerts = alerting_service.get_active_alerts()
+            
+            # Calculate summary
+            total_alerts = Alert.objects.count()
+            critical_alerts = Alert.objects.filter(severity='CRITICAL', status='TRIGGERED').count()
+            high_alerts = Alert.objects.filter(severity='HIGH', status='TRIGGERED').count()
+            medium_alerts = Alert.objects.filter(severity='MEDIUM', status='TRIGGERED').count()
+            low_alerts = Alert.objects.filter(severity='LOW', status='TRIGGERED').count()
+            acknowledged_alerts = Alert.objects.filter(status='ACKNOWLEDGED').count()
+            resolved_alerts = Alert.objects.filter(status='RESOLVED').count()
+            
+            summary = {
+                'total_alerts': total_alerts,
+                'critical_alerts': critical_alerts,
+                'high_alerts': high_alerts,
+                'medium_alerts': medium_alerts,
+                'low_alerts': low_alerts,
+                'acknowledged_alerts': acknowledged_alerts,
+                'resolved_alerts': resolved_alerts
+            }
+            
+            serializer = AlertSummarySerializer(summary)
+            return Response(serializer.data)
+        except Exception as e:
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+
+
+class MonitoringDashboardViewSet(viewsets.ModelViewSet):
+    """ViewSet for MonitoringDashboard model"""
+    
+    queryset = MonitoringDashboard.objects.all()
+    serializer_class = MonitoringDashboardSerializer
+    permission_classes = [permissions.IsAuthenticated]
+    filter_backends = [DjangoFilterBackend, SearchFilter, OrderingFilter]
+    filterset_fields = ['dashboard_type', 'is_active', 'is_public']
+    search_fields = ['name', 'description']
+    ordering_fields = ['name', 'created_at']
+    ordering = ['name']
+    
+    def perform_create(self, serializer):
+        """Set the creator when creating a dashboard"""
+        serializer.save(created_by=self.request.user)
+    
+    def get_queryset(self):
+        """Filter dashboards based on user access"""
+        queryset = super().get_queryset()
+        
+        if not self.request.user.is_staff:
+            # Non-staff users can only see public dashboards or dashboards they have access to
+            queryset = queryset.filter(
+                models.Q(is_public=True) | 
+                models.Q(allowed_users=self.request.user)
+            ).distinct()
+        
+        return queryset
+
+
+class SystemStatusViewSet(viewsets.ReadOnlyModelViewSet):
+    """ViewSet for SystemStatus model (read-only)"""
+    
+    queryset = SystemStatus.objects.all()
+    serializer_class = SystemStatusSerializer
+    permission_classes = [permissions.IsAuthenticated]
+    ordering = ['-started_at']
+
+
+class SystemOverviewView(APIView):
+    """System overview endpoint"""
+    
+    permission_classes = [permissions.IsAuthenticated]
+    
+    def get(self, request):
+        """Get system overview"""
+        try:
+            # Get current system status
+            current_status = SystemStatus.objects.filter(
+                resolved_at__isnull=True
+            ).order_by('-started_at').first()
+            
+            if not current_status:
+                # Create default operational status
+                current_status = SystemStatus.objects.create(
+                    status='OPERATIONAL',
+                    message='All systems operational',
+                    created_by=request.user
+                )
+            
+            # Get health summary
+            health_service = HealthCheckService()
+            health_summary = health_service.get_system_health_summary()
+            
+            # Get alert summary
+            alerting_service = AlertingService()
+            active_alerts = alerting_service.get_active_alerts()
+            
+            alert_summary = {
+                'total_alerts': len(active_alerts),
+                'critical_alerts': len([a for a in active_alerts if a['severity'] == 'CRITICAL']),
+                'high_alerts': len([a for a in active_alerts if a['severity'] == 'HIGH']),
+                'medium_alerts': len([a for a in active_alerts if a['severity'] == 'MEDIUM']),
+                'low_alerts': len([a for a in active_alerts if a['severity'] == 'LOW']),
+                'acknowledged_alerts': 0,  # Would be calculated from database
+                'resolved_alerts': 0  # Would be calculated from database
+            }
+            
+            # Get recent incidents (mock data for now)
+            recent_incidents = []
+            
+            # Get top metrics (mock data for now)
+            top_metrics = []
+            
+            # Get system resources
+            import psutil
+            system_resources = {
+                'cpu_percent': psutil.cpu_percent(interval=1),
+                'memory_percent': psutil.virtual_memory().percent,
+                'disk_percent': psutil.disk_usage('/').percent
+            }
+            
+            overview = {
+                'system_status': current_status,
+                'health_summary': health_summary,
+                'alert_summary': alert_summary,
+                'recent_incidents': recent_incidents,
+                'top_metrics': top_metrics,
+                'system_resources': system_resources
+            }
+            
+            serializer = SystemOverviewSerializer(overview)
+            return Response(serializer.data)
+            
+        except Exception as e:
+            logger.error(f"Failed to get system overview: {e}")
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
+
+
+class MonitoringTasksView(APIView):
+    """Monitoring tasks management"""
+    
+    permission_classes = [permissions.IsAuthenticated]
+    
+    def post(self, request):
+        """Execute monitoring tasks"""
+        task_type = request.data.get('task_type')
+        
+        try:
+            if task_type == 'health_checks':
+                task = execute_health_checks.delay()
+            elif task_type == 'metrics_collection':
+                task = collect_metrics.delay()
+            elif task_type == 'alert_evaluation':
+                task = evaluate_alerts.delay()
+            elif task_type == 'system_status_report':
+                task = generate_system_status_report.delay()
+            else:
+                return Response({
+                    'error': 'Invalid task type'
+                }, status=status.HTTP_400_BAD_REQUEST)
+            
+            return Response({
+                'status': 'success',
+                'message': f'{task_type} task started',
+                'task_id': task.id
+            })
+            
+        except Exception as e:
+            return Response({
+                'error': str(e)
+            }, status=status.HTTP_500_INTERNAL_SERVER_ERROR)