gnx/ETB

Files

Iliyan Angelov 6b247e5b9f Updates

2025-09-19 11:58:53 +03:00

20 KiB

Raw Blame History

SLA & On-Call Management API Documentation

Overview

The SLA & On-Call Management module provides comprehensive Service Level Agreement (SLA) tracking, escalation policies, and on-call rotation management for enterprise incident management systems.

Features

Dynamic SLAs

Incident Type-Based SLAs: Different SLA targets based on incident category, severity, and priority
Business Hours Support: SLA calculations that respect business hours and timezones
Multiple SLA Types: Response time, resolution time, acknowledgment time, and first response time
Automatic SLA Instance Creation: SLAs are automatically created when incidents are reported

Escalation Policies

Multi-Level Escalation: Configurable escalation steps with different actions and timing
Condition-Based Triggering: Escalations triggered by SLA breaches, thresholds, or custom conditions
Multi-Channel Notifications: Email, SMS, Slack, Teams, and webhook notifications
Integration with On-Call: Automatic escalation to current on-call personnel

On-Call Rotation Management

Flexible Scheduling: Weekly, daily, monthly, and custom rotation schedules
External System Integration: Built-in support for PagerDuty and OpsGenie
Handoff Management: Structured handoff processes with notes and tracking
Performance Metrics: Track incident handling and response times

Business Hours Management

Timezone Support: Multiple timezone configurations
Holiday Calendar: Holiday and special day handling
Day Overrides: Custom hours for specific dates
Weekend Configuration: Separate weekend business hours

API Endpoints

Business Hours Management

GET /api/sla-oncall/api/v1/business-hours/

List all business hours configurations.

Query Parameters:

is_active: Filter by active status
is_default: Filter by default status
timezone: Filter by timezone
search: Search by name or description

Response:

{
    "count": 2,
    "next": null,
    "previous": null,
    "results": [
        {
            "id": "uuid",
            "name": "Standard Business Hours",
            "description": "Standard 9-5 business hours",
            "timezone": "UTC",
            "weekday_start": "09:00:00",
            "weekday_end": "17:00:00",
            "weekend_start": "10:00:00",
            "weekend_end": "16:00:00",
            "day_overrides": {},
            "holiday_calendar": [],
            "is_active": true,
            "is_default": true,
            "created_at": "2024-01-01T00:00:00Z",
            "updated_at": "2024-01-01T00:00:00Z"
        }
    ]
}

POST /api/sla-oncall/api/v1/business-hours/

Create a new business hours configuration.

Request Body:

{
    "name": "Custom Business Hours",
    "description": "Custom business hours for special team",
    "timezone": "America/New_York",
    "weekday_start": "08:00:00",
    "weekday_end": "18:00:00",
    "weekend_start": "10:00:00",
    "weekend_end": "16:00:00",
    "holiday_calendar": ["2024-12-25", "2024-01-01"],
    "is_active": true
}

POST /api/sla-oncall/api/v1/business-hours/{id}/test_business_hours/

Test if a given time is within business hours.

Request Body:

{
    "test_time": "2024-01-08T14:30:00Z"
}

Response:

{
    "is_business_hours": true,
    "test_time": "2024-01-08T14:30:00Z"
}

SLA Definitions

GET /api/sla-oncall/api/v1/sla-definitions/

List all SLA definitions.

Query Parameters:

sla_type: Filter by SLA type (RESPONSE_TIME, RESOLUTION_TIME, etc.)
is_active: Filter by active status
business_hours_only: Filter by business hours requirement

Response:

{
    "count": 3,
    "results": [
        {
            "id": "uuid",
            "name": "Critical Incident Response",
            "description": "SLA for critical incidents",
            "sla_type": "RESPONSE_TIME",
            "incident_categories": ["SYSTEM", "NETWORK"],
            "incident_severities": ["CRITICAL", "EMERGENCY"],
            "incident_priorities": ["P1"],
            "target_duration_minutes": 15,
            "business_hours_only": false,
            "business_hours": null,
            "business_hours_name": null,
            "escalation_enabled": true,
            "escalation_threshold_percent": 75.0,
            "is_active": true,
            "is_default": false,
            "created_at": "2024-01-01T00:00:00Z"
        }
    ]
}

POST /api/sla-oncall/api/v1/sla-definitions/

Create a new SLA definition.

Request Body:

{
    "name": "High Priority Response",
    "description": "SLA for high priority incidents",
    "sla_type": "RESPONSE_TIME",
    "incident_severities": ["HIGH"],
    "incident_priorities": ["P2"],
    "target_duration_minutes": 30,
    "business_hours_only": false,
    "escalation_enabled": true,
    "escalation_threshold_percent": 80.0,
    "is_active": true
}

POST /api/sla-oncall/api/v1/sla-definitions/{id}/test_applicability/

Test if SLA definition applies to a given incident.

Request Body:

{
    "category": "SYSTEM",
    "severity": "HIGH",
    "priority": "P2"
}

Response:

{
    "applies": true,
    "incident_data": {
        "category": "SYSTEM",
        "severity": "HIGH",
        "priority": "P2"
    }
}

On-Call Rotations

GET /api/sla-oncall/api/v1/oncall-rotations/

List all on-call rotations.

Query Parameters:

rotation_type: Filter by rotation type (WEEKLY, DAILY, etc.)
status: Filter by status (ACTIVE, PAUSED, INACTIVE)
external_system: Filter by external system integration

Response:

{
    "count": 1,
    "results": [
        {
            "id": "uuid",
            "name": "Primary On-Call Rotation",
            "description": "Primary rotation for incident response",
            "rotation_type": "WEEKLY",
            "status": "ACTIVE",
            "team_name": "Incident Response Team",
            "team_description": "Primary team responsible for incidents",
            "schedule_config": {
                "rotation_length_days": 7,
                "handoff_time": "09:00"
            },
            "timezone": "UTC",
            "external_system": "INTERNAL",
            "external_system_id": null,
            "integration_config": {},
            "current_oncall": {
                "user_id": "uuid",
                "username": "john.doe",
                "start_time": "2024-01-08T09:00:00Z",
                "end_time": "2024-01-15T09:00:00Z"
            },
            "created_at": "2024-01-01T00:00:00Z"
        }
    ]
}

GET /api/sla-oncall/api/v1/oncall-rotations/{id}/current_oncall/

Get the current on-call person for a rotation.

Response:

{
    "id": "uuid",
    "rotation": "uuid",
    "rotation_name": "Primary On-Call Rotation",
    "user": "uuid",
    "user_name": "john.doe",
    "user_email": "john.doe@company.com",
    "start_time": "2024-01-08T09:00:00Z",
    "end_time": "2024-01-15T09:00:00Z",
    "status": "ACTIVE",
    "incidents_handled": 5,
    "response_time_avg": "00:15:30"
}

GET /api/sla-oncall/api/v1/oncall-rotations/{id}/upcoming_assignments/

Get upcoming on-call assignments.

Query Parameters:

days: Number of days ahead to look (default: 30)

SLA Instances

GET /api/sla-oncall/api/v1/sla-instances/

List all SLA instances.

Query Parameters:

status: Filter by status (ACTIVE, MET, BREACHED, CANCELLED)
escalation_triggered: Filter by escalation status
sla_definition: Filter by SLA definition

Response:

{
    "count": 10,
    "results": [
        {
            "id": "uuid",
            "sla_definition": "uuid",
            "sla_definition_name": "Critical Incident Response",
            "incident": "uuid",
            "incident_title": "Database Connection Failure",
            "status": "ACTIVE",
            "target_time": "2024-01-08T15:15:00Z",
            "started_at": "2024-01-08T15:00:00Z",
            "met_at": null,
            "breached_at": null,
            "escalation_policy": "uuid",
            "escalation_triggered": false,
            "escalation_triggered_at": null,
            "escalation_level": 0,
            "response_time": null,
            "resolution_time": null,
            "is_breached": false,
            "time_remaining": "00:12:30",
            "breach_time": "00:00:00",
            "created_at": "2024-01-08T15:00:00Z"
        }
    ]
}

GET /api/sla-oncall/api/v1/sla-instances/breached/

Get all breached SLA instances.

GET /api/sla-oncall/api/v1/sla-instances/at_risk/

Get SLA instances at risk of breaching (within 15 minutes).

POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_met/

Mark an SLA instance as met.

Response:

{
    "message": "SLA marked as met"
}

POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_breached/

Mark an SLA instance as breached.

On-Call Assignments

GET /api/sla-oncall/api/v1/oncall-assignments/

List all on-call assignments.

Query Parameters:

rotation: Filter by rotation
user: Filter by user
status: Filter by status (SCHEDULED, ACTIVE, COMPLETED, CANCELLED)

POST /api/sla-oncall/api/v1/oncall-assignments/

Create a new on-call assignment.

Request Body:

{
    "rotation": "uuid",
    "user": "uuid",
    "start_time": "2024-01-15T09:00:00Z",
    "end_time": "2024-01-22T09:00:00Z",
    "handoff_notes": "All systems stable, no pending incidents"
}

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/handoff/

Perform on-call handoff.

Request Body:

{
    "handoff_notes": "Handing off to next person. 3 active incidents."
}

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/activate/

Activate a scheduled assignment.

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/complete/

Complete an active assignment.

Escalation Policies

GET /api/sla-oncall/api/v1/escalation-policies/

List all escalation policies.

Query Parameters:

escalation_type: Filter by escalation type
trigger_condition: Filter by trigger condition
is_active: Filter by active status

POST /api/sla-oncall/api/v1/escalation-policies/

Create a new escalation policy.

Request Body:

{
    "name": "Critical Escalation",
    "description": "Escalation for critical incidents",
    "escalation_type": "TIME_BASED",
    "trigger_condition": "SLA_THRESHOLD",
    "incident_severities": ["CRITICAL", "EMERGENCY"],
    "trigger_delay_minutes": 0,
    "escalation_steps": [
        {
            "level": 1,
            "delay_minutes": 5,
            "actions": ["notify_oncall", "notify_manager"],
            "channels": ["email", "sms"]
        },
        {
            "level": 2,
            "delay_minutes": 15,
            "actions": ["notify_director", "page_oncall"],
            "channels": ["email", "sms", "phone"]
        }
    ],
    "notification_channels": ["email", "sms", "phone"],
    "is_active": true
}

Escalation Instances

GET /api/sla-oncall/api/v1/escalation-instances/

List all escalation instances.

Query Parameters:

status: Filter by status (PENDING, TRIGGERED, ACKNOWLEDGED, RESOLVED, CANCELLED)
escalation_level: Filter by escalation level
escalation_policy: Filter by escalation policy

POST /api/sla-oncall/api/v1/escalation-instances/{id}/acknowledge/

Acknowledge an escalation.

POST /api/sla-oncall/api/v1/escalation-instances/{id}/resolve/

Resolve an escalation.

Notification Templates

GET /api/sla-oncall/api/v1/notification-templates/

List all notification templates.

Query Parameters:

template_type: Filter by template type (ESCALATION, ONCALL_HANDOFF, etc.)
channel_type: Filter by channel type (EMAIL, SMS, SLACK, etc.)
is_active: Filter by active status

POST /api/sla-oncall/api/v1/notification-templates/

Create a new notification template.

Request Body:

{
    "name": "Email Escalation Alert",
    "template_type": "ESCALATION",
    "channel_type": "EMAIL",
    "subject_template": "URGENT: Incident #{incident_id} Escalated",
    "body_template": "Incident #{incident_id} has been escalated to Level {escalation_level}. Please respond immediately.",
    "variables": ["incident_id", "incident_title", "escalation_level"],
    "is_active": true,
    "is_default": true
}

Setup and Configuration

Initial Setup

Run the setup command to create default configurations:

python manage.py setup_sla_oncall

This command creates:

Default business hours configurations
Standard SLA definitions for different incident types
Default escalation policies
Notification templates
Sample on-call rotation (if users exist)

Configuration Examples

Business Hours for Different Teams

# 24/7 Operations
business_hours = BusinessHours.objects.create(
    name='24/7 Operations',
    description='Always business hours',
    timezone='UTC',
    weekday_start=time(0, 0),
    weekday_end=time(23, 59),
    weekend_start=time(0, 0),
    weekend_end=time(23, 59),
)

# EMEA Business Hours
business_hours = BusinessHours.objects.create(
    name='EMEA Business Hours',
    description='EMEA timezone business hours',
    timezone='Europe/London',
    weekday_start=time(9, 0),
    weekday_end=time(17, 0),
    weekend_start=time(10, 0),
    weekend_end=time(16, 0),
    holiday_calendar=['2024-12-25', '2024-01-01', '2024-04-19'],
)

SLA Definitions

# Critical incidents - 15 minute response
critical_sla = SLADefinition.objects.create(
    name='Critical Incident Response',
    description='SLA for critical and emergency incidents',
    sla_type='RESPONSE_TIME',
    incident_severities=['CRITICAL', 'EMERGENCY'],
    incident_priorities=['P1'],
    target_duration_minutes=15,
    business_hours_only=False,
    escalation_enabled=True,
    escalation_threshold_percent=75.0,
)

# Medium incidents - 2 hour response during business hours
medium_sla = SLADefinition.objects.create(
    name='Medium Priority Response',
    description='SLA for medium priority incidents',
    sla_type='RESPONSE_TIME',
    incident_severities=['MEDIUM'],
    incident_priorities=['P3'],
    target_duration_minutes=120,
    business_hours_only=True,
    business_hours=business_hours,
    escalation_enabled=True,
    escalation_threshold_percent=85.0,
)

Escalation Policies

# Critical escalation policy
escalation_policy = EscalationPolicy.objects.create(
    name='Critical Incident Escalation',
    description='Escalation for critical incidents',
    escalation_type='TIME_BASED',
    trigger_condition='SLA_THRESHOLD',
    incident_severities=['CRITICAL', 'EMERGENCY'],
    trigger_delay_minutes=0,
    escalation_steps=[
        {
            'level': 1,
            'delay_minutes': 5,
            'actions': ['notify_oncall', 'notify_manager'],
            'channels': ['email', 'sms']
        },
        {
            'level': 2,
            'delay_minutes': 15,
            'actions': ['notify_director', 'page_oncall'],
            'channels': ['email', 'sms', 'phone']
        },
        {
            'level': 3,
            'delay_minutes': 30,
            'actions': ['notify_executive', 'escalate_to_vendor'],
            'channels': ['email', 'phone', 'webhook']
        }
    ],
    notification_channels=['email', 'sms', 'phone'],
)

On-Call Rotations

# Weekly rotation
rotation = OnCallRotation.objects.create(
    name='Primary On-Call Rotation',
    description='Primary rotation for incident response',
    rotation_type='WEEKLY',
    team_name='Incident Response Team',
    schedule_config={
        'rotation_length_days': 7,
        'handoff_time': '09:00',
        'timezone': 'UTC'
    },
    timezone='UTC',
)

# Create assignments
assignment = OnCallAssignment.objects.create(
    rotation=rotation,
    user=user,
    start_time=timezone.now(),
    end_time=timezone.now() + timedelta(days=7),
    status='ACTIVE'
)

Integration with Other Modules

Incident Intelligence Integration

The SLA module automatically creates SLA instances when incidents are created:

# When an incident is created, applicable SLA definitions are found
# and SLA instances are automatically created
incident = Incident.objects.create(
    title='Database Connection Failure',
    description='Unable to connect to primary database',
    severity='CRITICAL',
    category='DATABASE',
    reporter=user,
)

# This automatically triggers SLA instance creation via signals

Automation Orchestration Integration

SLA breaches can trigger automation workflows:

# In automation_orchestration models, you can reference SLA instances
class RunbookExecution(models.Model):
    # ... existing fields ...
    sla_instance = models.ForeignKey(
        'sla_oncall.SLAInstance',
        on_delete=models.SET_NULL,
        null=True,
        blank=True,
        related_name='runbook_executions'
    )

Security Integration

On-call assignments respect security clearances:

# Users with appropriate clearance levels can be assigned to sensitive incidents
if user.clearance_level.level >= incident.get_required_clearance_level():
    # User can be assigned to this incident
    assignment = OnCallAssignment.objects.create(...)

Monitoring and Alerting

SLA Breach Monitoring

Monitor SLA instances for breaches:

# Get all breached SLAs
breached_slas = SLAInstance.objects.filter(status='BREACHED')

# Get SLAs at risk (within 15 minutes of breach)
warning_time = timezone.now() + timedelta(minutes=15)
at_risk_slas = SLAInstance.objects.filter(
    status='ACTIVE',
    target_time__lte=warning_time
)

Escalation Monitoring

Monitor active escalations:

# Get all active escalations
active_escalations = EscalationInstance.objects.filter(
    status__in=['PENDING', 'TRIGGERED']
)

# Get escalations by level
level_2_escalations = EscalationInstance.objects.filter(
    escalation_level=2,
    status='TRIGGERED'
)

Performance Metrics

Track on-call performance:

# Get current on-call assignments
current_assignments = OnCallAssignment.objects.filter(
    status='ACTIVE',
    start_time__lte=timezone.now(),
    end_time__gte=timezone.now()
)

# Calculate average response time
avg_response_time = current_assignments.aggregate(
    avg_response=Avg('response_time_avg')
)['avg_response']

Best Practices

SLA Definition Best Practices

Start Simple: Begin with basic SLAs and add complexity as needed
Business Hours Consideration: Use business hours for non-critical incidents
Escalation Thresholds: Set escalation thresholds at 75-85% of SLA time
Regular Review: Review and adjust SLAs based on performance data

On-Call Management Best Practices

Clear Handoffs: Use structured handoff processes with notes
Rotation Length: Keep rotations between 1-2 weeks for optimal coverage
Backup Coverage: Always have backup on-call personnel
Training: Ensure on-call personnel are properly trained

Escalation Best Practices

Progressive Escalation: Use multiple levels with increasing urgency
Clear Actions: Define specific actions for each escalation level
Multiple Channels: Use multiple notification channels for critical escalations
Documentation: Document all escalation actions and outcomes

Troubleshooting

Common Issues

SLA Not Created: Check if SLA definition criteria match incident attributes
Escalation Not Triggered: Verify escalation policy is active and criteria match
On-Call Not Found: Ensure active assignments exist for the rotation
Business Hours Issues: Verify timezone configuration and business hours setup

Debugging Commands

# Check SLA instances for a specific incident
python manage.py shell
>>> incident = Incident.objects.get(id='incident-id')
>>> sla_instances = incident.sla_instances.all()
>>> for sla in sla_instances:
...     print(f"SLA: {sla.sla_definition.name}, Status: {sla.status}")

# Check current on-call for a rotation
>>> rotation = OnCallRotation.objects.get(id='rotation-id')
>>> current = rotation.get_current_oncall()
>>> print(f"Current on-call: {current.user.username if current else 'None'}")

# Check business hours
>>> business_hours = BusinessHours.objects.get(id='business-hours-id')
>>> now = timezone.now()
>>> print(f"Is business hours: {business_hours.is_business_hours(now)}")

20 KiB Raw Blame History

SLA & On-Call Management API Documentation

Overview

Features

Dynamic SLAs

Escalation Policies

On-Call Rotation Management

Business Hours Management

API Endpoints

Business Hours Management

GET /api/sla-oncall/api/v1/business-hours/

POST /api/sla-oncall/api/v1/business-hours/

POST /api/sla-oncall/api/v1/business-hours/{id}/test_business_hours/

SLA Definitions

GET /api/sla-oncall/api/v1/sla-definitions/

POST /api/sla-oncall/api/v1/sla-definitions/

POST /api/sla-oncall/api/v1/sla-definitions/{id}/test_applicability/

On-Call Rotations

GET /api/sla-oncall/api/v1/oncall-rotations/

GET /api/sla-oncall/api/v1/oncall-rotations/{id}/current_oncall/

GET /api/sla-oncall/api/v1/oncall-rotations/{id}/upcoming_assignments/

SLA Instances

GET /api/sla-oncall/api/v1/sla-instances/

GET /api/sla-oncall/api/v1/sla-instances/breached/

GET /api/sla-oncall/api/v1/sla-instances/at_risk/

POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_met/

POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_breached/

On-Call Assignments

GET /api/sla-oncall/api/v1/oncall-assignments/

POST /api/sla-oncall/api/v1/oncall-assignments/

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/handoff/

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/activate/

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/complete/

Escalation Policies

GET /api/sla-oncall/api/v1/escalation-policies/

POST /api/sla-oncall/api/v1/escalation-policies/

Escalation Instances

GET /api/sla-oncall/api/v1/escalation-instances/

POST /api/sla-oncall/api/v1/escalation-instances/{id}/acknowledge/

POST /api/sla-oncall/api/v1/escalation-instances/{id}/resolve/

Notification Templates

GET /api/sla-oncall/api/v1/notification-templates/

POST /api/sla-oncall/api/v1/notification-templates/

Setup and Configuration

Initial Setup

Configuration Examples

Business Hours for Different Teams

SLA Definitions

Escalation Policies

On-Call Rotations

Integration with Other Modules

Incident Intelligence Integration

Automation Orchestration Integration

Security Integration

Monitoring and Alerting

SLA Breach Monitoring

Escalation Monitoring

Performance Metrics

Best Practices

SLA Definition Best Practices

On-Call Management Best Practices

Escalation Best Practices

Troubleshooting

Common Issues

Debugging Commands

20 KiB

Raw Blame History