gnx/ETB

Files

Iliyan Angelov 6b247e5b9f Updates

2025-09-19 11:58:53 +03:00

12 KiB

Raw Blame History

ETB-API Monitoring System Documentation

Overview

The ETB-API Monitoring System provides comprehensive observability for all modules and services within the Enterprise Incident Management platform. It includes health checks, metrics collection, alerting, and dashboard capabilities.

Features

1. Health Monitoring

System Health Checks: Monitor application, database, cache, and queue health
Module Health: Individual module status and dependency tracking
External Integrations: Third-party service health monitoring
Infrastructure Monitoring: Server resources and network connectivity

2. Metrics Collection

Performance Metrics: API response times, throughput, error rates
Business Metrics: Incident counts, MTTR, MTTA, SLA compliance
Security Metrics: Security events, failed logins, risk assessments
Infrastructure Metrics: CPU, memory, disk usage
AI/ML Metrics: Model accuracy, automation success rates

3. Intelligent Alerting

Threshold Alerts: Configurable thresholds for all metrics
Anomaly Detection: Statistical anomaly detection
Pattern Alerts: Pattern-based alerting
Multi-Channel Notifications: Email, Slack, Webhook support
Alert Management: Acknowledge, resolve, and track alerts

4. Monitoring Dashboards

System Overview: High-level system status
Performance Dashboard: Performance metrics visualization
Business Metrics: Operational metrics dashboard
Security Dashboard: Security monitoring dashboard
Custom Dashboards: User-configurable dashboards

API Endpoints

Base URL

http://localhost:8000/api/monitoring/

Authentication

All endpoints require authentication using Django REST Framework token authentication.

Health Checks

Get Health Check Summary

GET /api/monitoring/health-checks/summary/
Authorization: Token your-token-here

Response:

{
    "overall_status": "HEALTHY",
    "total_targets": 12,
    "healthy_targets": 11,
    "warning_targets": 1,
    "critical_targets": 0,
    "health_percentage": 91.67,
    "last_updated": "2024-01-15T10:30:00Z"
}

Run All Health Checks

POST /api/monitoring/health-checks/run_all_checks/
Authorization: Token your-token-here

Response:

{
    "status": "success",
    "message": "Health checks started",
    "task_id": "celery-task-id"
}

Test Target Connection

POST /api/monitoring/targets/{target_id}/test_connection/
Authorization: Token your-token-here

Metrics

Get Metric Measurements

GET /api/monitoring/metrics/{metric_id}/measurements/?hours=24&limit=100
Authorization: Token your-token-here

Get Metric Trends

GET /api/monitoring/metrics/{metric_id}/trends/?days=7
Authorization: Token your-token-here

Response:

{
    "metric_name": "API Response Time",
    "period_days": 7,
    "daily_data": [
        {
            "date": "2024-01-08",
            "value": 150.5,
            "count": 1440
        }
    ],
    "trend": "STABLE"
}

Alerts

Get Alert Summary

GET /api/monitoring/alerts/summary/
Authorization: Token your-token-here

Response:

{
    "total_alerts": 25,
    "critical_alerts": 2,
    "high_alerts": 5,
    "medium_alerts": 8,
    "low_alerts": 10,
    "acknowledged_alerts": 15,
    "resolved_alerts": 20
}

Acknowledge Alert

POST /api/monitoring/alerts/{alert_id}/acknowledge/
Authorization: Token your-token-here

Resolve Alert

POST /api/monitoring/alerts/{alert_id}/resolve/
Authorization: Token your-token-here

System Overview

Get System Overview

GET /api/monitoring/overview/
Authorization: Token your-token-here

Response:

{
    "system_status": {
        "status": "OPERATIONAL",
        "message": "All systems operational",
        "started_at": "2024-01-15T09:00:00Z"
    },
    "health_summary": {
        "overall_status": "HEALTHY",
        "total_targets": 12,
        "healthy_targets": 12,
        "health_percentage": 100.0
    },
    "alert_summary": {
        "total_alerts": 0,
        "critical_alerts": 0
    },
    "system_resources": {
        "cpu_percent": 45.2,
        "memory_percent": 67.8,
        "disk_percent": 34.5
    }
}

Monitoring Tasks

Execute Monitoring Tasks

POST /api/monitoring/tasks/
Authorization: Token your-token-here
Content-Type: application/json

{
    "task_type": "health_checks"
}

Available task types:

health_checks: Execute health checks for all targets
metrics_collection: Collect metrics from all sources
alert_evaluation: Evaluate alert rules and send notifications
system_status_report: Generate system status report

Data Models

MonitoringTarget

Represents a system, service, or component to monitor.

Fields:

name: Target name
target_type: Type (APPLICATION, DATABASE, CACHE, etc.)
endpoint_url: Health check endpoint
status: Current status (ACTIVE, INACTIVE, etc.)
last_status: Last health check result
health_check_enabled: Whether health checks are enabled

SystemMetric

Defines metrics to collect and monitor.

Fields:

name: Metric name
metric_type: Type (PERFORMANCE, BUSINESS, SECURITY, etc.)
category: Category (API_RESPONSE_TIME, MTTR, etc.)
unit: Unit of measurement
aggregation_method: How to aggregate values
warning_threshold: Warning threshold
critical_threshold: Critical threshold

AlertRule

Defines alert conditions and notifications.

Fields:

name: Rule name
alert_type: Type (THRESHOLD, ANOMALY, etc.)
severity: Alert severity (LOW, MEDIUM, HIGH, CRITICAL)
condition: Alert condition configuration
notification_channels: Notification channels
is_enabled: Whether rule is enabled

Alert

Represents triggered alerts.

Fields:

title: Alert title
description: Alert description
severity: Alert severity
status: Alert status (TRIGGERED, ACKNOWLEDGED, RESOLVED)
triggered_value: Value that triggered the alert
threshold_value: Threshold that was exceeded

Configuration

Environment Variables

# Monitoring Settings
MONITORING_ENABLED=true
MONITORING_HEALTH_CHECK_INTERVAL=60
MONITORING_METRICS_COLLECTION_INTERVAL=300
MONITORING_ALERT_EVALUATION_INTERVAL=60

# Alerting Settings
ALERTING_EMAIL_FROM=monitoring@etb-api.com
ALERTING_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERTING_WEBHOOK_URL=https://your-webhook-url.com/alerts

# Performance Thresholds
PERFORMANCE_API_RESPONSE_THRESHOLD=2000
PERFORMANCE_CPU_THRESHOLD=80
PERFORMANCE_MEMORY_THRESHOLD=80

Celery Configuration

Add to your Celery configuration:

from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    'health-checks': {
        'task': 'monitoring.tasks.execute_health_checks',
        'schedule': 60.0,  # Every minute
    },
    'metrics-collection': {
        'task': 'monitoring.tasks.collect_metrics',
        'schedule': 300.0,  # Every 5 minutes
    },
    'alert-evaluation': {
        'task': 'monitoring.tasks.evaluate_alerts',
        'schedule': 60.0,  # Every minute
    },
    'data-cleanup': {
        'task': 'monitoring.tasks.cleanup_old_data',
        'schedule': crontab(hour=2, minute=0),  # Daily at 2 AM
    },
}

Setup Instructions

1. Install Dependencies

Add to requirements.txt:

psutil>=5.9.0
requests>=2.31.0

2. Run Migrations

python manage.py makemigrations monitoring
python manage.py migrate

3. Set Up Initial Configuration

python manage.py setup_monitoring --admin-user admin

4. Start Celery Workers

celery -A core worker -l info
celery -A core beat -l info

5. Access Monitoring

Admin Interface: http://localhost:8000/admin/monitoring/
API Documentation: http://localhost:8000/api/monitoring/
System Overview: http://localhost:8000/api/monitoring/overview/

Monitoring Best Practices

1. Health Checks

Set appropriate check intervals (not too frequent)
Use timeouts to prevent hanging checks
Monitor dependencies and external services
Implement graceful degradation

2. Metrics Collection

Collect metrics at appropriate intervals
Use proper aggregation methods
Set meaningful thresholds
Monitor both technical and business metrics

3. Alerting

Set up alert rules with appropriate severity levels
Use multiple notification channels
Implement alert fatigue prevention
Regularly review and tune alert thresholds

4. Dashboards

Create role-based dashboards
Use appropriate refresh intervals
Include both real-time and historical data
Make dashboards actionable

Troubleshooting

Common Issues

Health Checks Failing
- Check network connectivity
- Verify endpoint URLs
- Check authentication credentials
- Review timeout settings
Metrics Not Collecting
- Verify Celery workers are running
- Check metric configuration
- Review collection intervals
- Check for errors in logs
Alerts Not Triggering
- Verify alert rules are enabled
- Check threshold values
- Review notification channel configuration
- Check alert evaluation task is running
Performance Issues
- Monitor system resources
- Check database query performance
- Review metric retention settings
- Optimize collection intervals

Debug Commands

# Check monitoring status
python manage.py shell
>>> from monitoring.services.health_checks import HealthCheckService
>>> service = HealthCheckService()
>>> service.get_system_health_summary()

# Test health checks
>>> from monitoring.models import MonitoringTarget
>>> target = MonitoringTarget.objects.first()
>>> service.execute_health_check(target, 'HTTP')

# Check metrics collection
>>> from monitoring.services.metrics_collector import MetricsCollector
>>> collector = MetricsCollector()
>>> collector.collect_all_metrics()

Integration with Other Modules

Security Module

Monitor authentication failures
Track security events
Monitor device posture assessments
Alert on risk assessment anomalies

Incident Intelligence

Monitor incident processing times
Track AI model performance
Monitor correlation engine health
Alert on incident volume spikes

Automation & Orchestration

Monitor runbook execution success
Track integration health
Monitor ChatOps command usage
Alert on automation failures

SLA & On-Call

Monitor SLA compliance
Track escalation times
Monitor on-call assignments
Alert on SLA breaches

Analytics & Predictive Insights

Monitor ML model accuracy
Track prediction performance
Monitor cost impact calculations
Alert on anomaly detections

Future Enhancements

Planned Features

Advanced Anomaly Detection: Machine learning-based anomaly detection
Predictive Alerting: Predict and prevent issues before they occur
Custom Metrics: User-defined custom metrics
Advanced Dashboards: Interactive dashboards with drill-down capabilities
Mobile App: Mobile monitoring application
Integration APIs: APIs for external monitoring tools
Cost Optimization: Resource usage optimization recommendations
Compliance Reporting: Automated compliance reporting

Integration Roadmap

APM Tools: New Relic, DataDog, AppDynamics
Log Aggregation: ELK Stack, Splunk, Fluentd
Infrastructure Monitoring: Prometheus, Grafana, InfluxDB
Cloud Platforms: AWS CloudWatch, Azure Monitor, GCP Monitoring
Communication Platforms: PagerDuty, OpsGenie, VictorOps

12 KiB Raw Blame History

ETB-API Monitoring System Documentation

Overview

Features

1. Health Monitoring

2. Metrics Collection

3. Intelligent Alerting

4. Monitoring Dashboards

API Endpoints

Base URL

Authentication

Health Checks

Get Health Check Summary

Run All Health Checks

Test Target Connection

Metrics

Get Metric Measurements

Get Metric Trends

Alerts

Get Alert Summary

Acknowledge Alert

Resolve Alert

System Overview

Get System Overview

Monitoring Tasks

Execute Monitoring Tasks

Data Models

MonitoringTarget

SystemMetric

AlertRule

Alert

Configuration

Environment Variables

Celery Configuration

Setup Instructions

1. Install Dependencies

2. Run Migrations

3. Set Up Initial Configuration

4. Start Celery Workers

5. Access Monitoring

Monitoring Best Practices

1. Health Checks

2. Metrics Collection

3. Alerting

4. Dashboards

Troubleshooting

Common Issues

Debug Commands

Integration with Other Modules

Security Module

Incident Intelligence

Automation & Orchestration

SLA & On-Call

Analytics & Predictive Insights

Future Enhancements

Planned Features

Integration Roadmap

12 KiB

Raw Blame History