Files
ETB/ETB-API/monitoring/Documentations/MONITORING_SYSTEM_API.md
Iliyan Angelov 6b247e5b9f Updates
2025-09-19 11:58:53 +03:00

12 KiB

ETB-API Monitoring System Documentation

Overview

The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.

Features

1. Health Monitoring

  • System Health Checks: Monitor application, database, cache, and queue health
  • Module Health: Individual module status and dependency tracking
  • External Integrations: Third-party service health monitoring
  • Infrastructure Monitoring: Server resources and network connectivity

2. Metrics Collection

  • Performance Metrics: API response times, throughput, error rates
  • Business Metrics: Incident counts, MTTR, MTTA, SLA compliance
  • Security Metrics: Security events, failed logins, risk assessments
  • Infrastructure Metrics: CPU, memory, disk usage
  • AI/ML Metrics: Model accuracy, automation success rates

3. Intelligent Alerting

  • Threshold Alerts: Configurable thresholds for all metrics
  • Anomaly Detection: Statistical anomaly detection
  • Pattern Alerts: Pattern-based alerting
  • Multi-Channel Notifications: Email, Slack, Webhook support
  • Alert Management: Acknowledge, resolve, and track alerts

4. Monitoring Dashboards

  • System Overview: High-level system status
  • Performance Dashboard: Performance metrics visualization
  • Business Metrics: Operational metrics dashboard
  • Security Dashboard: Security monitoring dashboard
  • Custom Dashboards: User-configurable dashboards

API Endpoints

Base URL

http://localhost:8000/api/monitoring/

Authentication

All endpoints require authentication using Django REST Framework token authentication.

Health Checks

Get Health Check Summary

GET /api/monitoring/health-checks/summary/
Authorization: Token your-token-here

Response:

{
    "overall_status": "HEALTHY",
    "total_targets": 12,
    "healthy_targets": 11,
    "warning_targets": 1,
    "critical_targets": 0,
    "health_percentage": 91.67,
    "last_updated": "2024-01-15T10:30:00Z"
}

Run All Health Checks

POST /api/monitoring/health-checks/run_all_checks/
Authorization: Token your-token-here

Response:

{
    "status": "success",
    "message": "Health checks started",
    "task_id": "celery-task-id"
}

Test Target Connection

POST /api/monitoring/targets/{target_id}/test_connection/
Authorization: Token your-token-here

Metrics

Get Metric Measurements

GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
Authorization: Token your-token-here
GET /api/monitoring/metrics/{metric_id}/trends/?days=7
Authorization: Token your-token-here

Response:

{
    "metric_name": "API Response Time",
    "period_days": 7,
    "daily_data": [
        {
            "date": "2024-01-08",
            "value": 150.5,
            "count": 1440
        }
    ],
    "trend": "STABLE"
}

Alerts

Get Alert Summary

GET /api/monitoring/alerts/summary/
Authorization: Token your-token-here

Response:

{
    "total_alerts": 25,
    "critical_alerts": 2,
    "high_alerts": 5,
    "medium_alerts": 8,
    "low_alerts": 10,
    "acknowledged_alerts": 15,
    "resolved_alerts": 20
}

Acknowledge Alert

POST /api/monitoring/alerts/{alert_id}/acknowledge/
Authorization: Token your-token-here

Resolve Alert

POST /api/monitoring/alerts/{alert_id}/resolve/
Authorization: Token your-token-here

System Overview

Get System Overview

GET /api/monitoring/overview/
Authorization: Token your-token-here

Response:

{
    "system_status": {
        "status": "OPERATIONAL",
        "message": "All systems operational",
        "started_at": "2024-01-15T09:00:00Z"
    },
    "health_summary": {
        "overall_status": "HEALTHY",
        "total_targets": 12,
        "healthy_targets": 12,
        "health_percentage": 100.0
    },
    "alert_summary": {
        "total_alerts": 0,
        "critical_alerts": 0
    },
    "system_resources": {
        "cpu_percent": 45.2,
        "memory_percent": 67.8,
        "disk_percent": 34.5
    }
}

Monitoring Tasks

Execute Monitoring Tasks

POST /api/monitoring/tasks/
Authorization: Token your-token-here
Content-Type: application/json

{
    "task_type": "health_checks"
}

Available task types:

  • health_checks: Execute health checks for all targets
  • metrics_collection: Collect metrics from all sources
  • alert_evaluation: Evaluate alert rules and send notifications
  • system_status_report: Generate system status report

Data Models

MonitoringTarget

Represents a system, service, or component to monitor.

Fields:

  • name: Target name
  • target_type: Type (APPLICATION, DATABASE, CACHE, etc.)
  • endpoint_url: Health check endpoint
  • status: Current status (ACTIVE, INACTIVE, etc.)
  • last_status: Last health check result
  • health_check_enabled: Whether health checks are enabled

SystemMetric

Defines metrics to collect and monitor.

Fields:

  • name: Metric name
  • metric_type: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)
  • category: Category (API_RESPONSE_TIME, MTTR, etc.)
  • unit: Unit of measurement
  • aggregation_method: How to aggregate values
  • warning_threshold: Warning threshold
  • critical_threshold: Critical threshold

AlertRule

Defines alert conditions and notifications.

Fields:

  • name: Rule name
  • alert_type: Type (THRESHOLD, ANOMALY, etc.)
  • severity: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)
  • condition: Alert condition configuration
  • notification_channels: Notification channels
  • is_enabled: Whether rule is enabled

Alert

Represents triggered alerts.

Fields:

  • title: Alert title
  • description: Alert description
  • severity: Alert severity
  • status: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)
  • triggered_value: Value that triggered the alert
  • threshold_value: Threshold that was exceeded

Configuration

Environment Variables

# Monitoring Settings
MONITORING_ENABLED=true
MONITORING_HEALTH_CHECK_INTERVAL=60
MONITORING_METRICS_COLLECTION_INTERVAL=300
MONITORING_ALERT_EVALUATION_INTERVAL=60

# Alerting Settings
ALERTING_EMAIL_FROM=monitoring@etb-api.com
ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts

# Performance Thresholds
PERFORMANCE_API_RESPONSE_THRESHOLD=2000
PERFORMANCE_CPU_THRESHOLD=80
PERFORMANCE_MEMORY_THRESHOLD=80

Celery Configuration

Add to your Celery configuration:

from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    'health-checks': {
        'task': 'monitoring.tasks.execute_health_checks',
        'schedule': 60.0,  # Every minute
    },
    'metrics-collection': {
        'task': 'monitoring.tasks.collect_metrics',
        'schedule': 300.0,  # Every 5 minutes
    },
    'alert-evaluation': {
        'task': 'monitoring.tasks.evaluate_alerts',
        'schedule': 60.0,  # Every minute
    },
    'data-cleanup': {
        'task': 'monitoring.tasks.cleanup_old_data',
        'schedule': crontab(hour=2, minute=0),  # Daily at 2 AM
    },
}

Setup Instructions

1. Install Dependencies

Add to requirements.txt:

psutil>=5.9.0
requests>=2.31.0

2. Run Migrations

python manage.py makemigrations monitoring
python manage.py migrate

3. Set Up Initial Configuration

python manage.py setup_monitoring --admin-user admin

4. Start Celery Workers

celery -A core worker -l info
celery -A core beat -l info

5. Access Monitoring

  • Admin Interface: http://localhost:8000/admin/monitoring/
  • API Documentation: http://localhost:8000/api/monitoring/
  • System Overview: http://localhost:8000/api/monitoring/overview/

Monitoring Best Practices

1. Health Checks

  • Set appropriate check intervals (not too frequent)
  • Use timeouts to prevent hanging checks
  • Monitor dependencies and external services
  • Implement graceful degradation

2. Metrics Collection

  • Collect metrics at appropriate intervals
  • Use proper aggregation methods
  • Set meaningful thresholds
  • Monitor both technical and business metrics

3. Alerting

  • Set up alert rules with appropriate severity levels
  • Use multiple notification channels
  • Implement alert fatigue prevention
  • Regularly review and tune alert thresholds

4. Dashboards

  • Create role-based dashboards
  • Use appropriate refresh intervals
  • Include both real-time and historical data
  • Make dashboards actionable

Troubleshooting

Common Issues

  1. Health Checks Failing

    • Check network connectivity
    • Verify endpoint URLs
    • Check authentication credentials
    • Review timeout settings
  2. Metrics Not Collecting

    • Verify Celery workers are running
    • Check metric configuration
    • Review collection intervals
    • Check for errors in logs
  3. Alerts Not Triggering

    • Verify alert rules are enabled
    • Check threshold values
    • Review notification channel configuration
    • Check alert evaluation task is running
  4. Performance Issues

    • Monitor system resources
    • Check database query performance
    • Review metric retention settings
    • Optimize collection intervals

Debug Commands

# Check monitoring status
python manage.py shell
>>> from monitoring.services.health_checks import HealthCheckService
>>> service = HealthCheckService()
>>> service.get_system_health_summary()

# Test health checks
>>> from monitoring.models import MonitoringTarget
>>> target = MonitoringTarget.objects.first()
>>> service.execute_health_check(target, 'HTTP')

# Check metrics collection
>>> from monitoring.services.metrics_collector import MetricsCollector
>>> collector = MetricsCollector()
>>> collector.collect_all_metrics()

Integration with Other Modules

Security Module

  • Monitor authentication failures
  • Track security events
  • Monitor device posture assessments
  • Alert on risk assessment anomalies

Incident Intelligence

  • Monitor incident processing times
  • Track AI model performance
  • Monitor correlation engine health
  • Alert on incident volume spikes

Automation & Orchestration

  • Monitor runbook execution success
  • Track integration health
  • Monitor ChatOps command usage
  • Alert on automation failures

SLA & On-Call

  • Monitor SLA compliance
  • Track escalation times
  • Monitor on-call assignments
  • Alert on SLA breaches

Analytics & Predictive Insights

  • Monitor ML model accuracy
  • Track prediction performance
  • Monitor cost impact calculations
  • Alert on anomaly detections

Future Enhancements

Planned Features

  1. Advanced Anomaly Detection: Machine learning-based anomaly detection
  2. Predictive Alerting: Predict and prevent issues before they occur
  3. Custom Metrics: User-defined custom metrics
  4. Advanced Dashboards: Interactive dashboards with drill-down capabilities
  5. Mobile App: Mobile monitoring application
  6. Integration APIs: APIs for external monitoring tools
  7. Cost Optimization: Resource usage optimization recommendations
  8. Compliance Reporting: Automated compliance reporting

Integration Roadmap

  1. APM Tools: New Relic, DataDog, AppDynamics
  2. Log Aggregation: ELK Stack, Splunk, Fluentd
  3. Infrastructure Monitoring: Prometheus, Grafana, InfluxDB
  4. Cloud Platforms: AWS CloudWatch, Azure Monitor, GCP Monitoring
  5. Communication Platforms: PagerDuty, OpsGenie, VictorOps