Files
ETB/ETB-API/sla_oncall/Documentations/SLA_ONCALL_API.md
Iliyan Angelov 6b247e5b9f Updates
2025-09-19 11:58:53 +03:00

20 KiB

SLA & On-Call Management API Documentation

Overview

The SLA & On-Call Management module provides comprehensive Service Level Agreement (SLA) tracking, escalation policies, and on-call rotation management for enterprise incident management systems.

Features

Dynamic SLAs

  • Incident Type-Based SLAs: Different SLA targets based on incident category, severity, and priority
  • Business Hours Support: SLA calculations that respect business hours and timezones
  • Multiple SLA Types: Response time, resolution time, acknowledgment time, and first response time
  • Automatic SLA Instance Creation: SLAs are automatically created when incidents are reported

Escalation Policies

  • Multi-Level Escalation: Configurable escalation steps with different actions and timing
  • Condition-Based Triggering: Escalations triggered by SLA breaches, thresholds, or custom conditions
  • Multi-Channel Notifications: Email, SMS, Slack, Teams, and webhook notifications
  • Integration with On-Call: Automatic escalation to current on-call personnel

On-Call Rotation Management

  • Flexible Scheduling: Weekly, daily, monthly, and custom rotation schedules
  • External System Integration: Built-in support for PagerDuty and OpsGenie
  • Handoff Management: Structured handoff processes with notes and tracking
  • Performance Metrics: Track incident handling and response times

Business Hours Management

  • Timezone Support: Multiple timezone configurations
  • Holiday Calendar: Holiday and special day handling
  • Day Overrides: Custom hours for specific dates
  • Weekend Configuration: Separate weekend business hours

API Endpoints

Business Hours Management

GET /api/sla-oncall/api/v1/business-hours/

List all business hours configurations.

Query Parameters:

  • is_active: Filter by active status
  • is_default: Filter by default status
  • timezone: Filter by timezone
  • search: Search by name or description

Response:

{
    "count": 2,
    "next": null,
    "previous": null,
    "results": [
        {
            "id": "uuid",
            "name": "Standard Business Hours",
            "description": "Standard 9-5 business hours",
            "timezone": "UTC",
            "weekday_start": "09:00:00",
            "weekday_end": "17:00:00",
            "weekend_start": "10:00:00",
            "weekend_end": "16:00:00",
            "day_overrides": {},
            "holiday_calendar": [],
            "is_active": true,
            "is_default": true,
            "created_at": "2024-01-01T00:00:00Z",
            "updated_at": "2024-01-01T00:00:00Z"
        }
    ]
}

POST /api/sla-oncall/api/v1/business-hours/

Create a new business hours configuration.

Request Body:

{
    "name": "Custom Business Hours",
    "description": "Custom business hours for special team",
    "timezone": "America/New_York",
    "weekday_start": "08:00:00",
    "weekday_end": "18:00:00",
    "weekend_start": "10:00:00",
    "weekend_end": "16:00:00",
    "holiday_calendar": ["2024-12-25", "2024-01-01"],
    "is_active": true
}

POST /api/sla-oncall/api/v1/business-hours/{id}/test_business_hours/

Test if a given time is within business hours.

Request Body:

{
    "test_time": "2024-01-08T14:30:00Z"
}

Response:

{
    "is_business_hours": true,
    "test_time": "2024-01-08T14:30:00Z"
}

SLA Definitions

GET /api/sla-oncall/api/v1/sla-definitions/

List all SLA definitions.

Query Parameters:

  • sla_type: Filter by SLA type (RESPONSE_TIME, RESOLUTION_TIME, etc.)
  • is_active: Filter by active status
  • business_hours_only: Filter by business hours requirement

Response:

{
    "count": 3,
    "results": [
        {
            "id": "uuid",
            "name": "Critical Incident Response",
            "description": "SLA for critical incidents",
            "sla_type": "RESPONSE_TIME",
            "incident_categories": ["SYSTEM", "NETWORK"],
            "incident_severities": ["CRITICAL", "EMERGENCY"],
            "incident_priorities": ["P1"],
            "target_duration_minutes": 15,
            "business_hours_only": false,
            "business_hours": null,
            "business_hours_name": null,
            "escalation_enabled": true,
            "escalation_threshold_percent": 75.0,
            "is_active": true,
            "is_default": false,
            "created_at": "2024-01-01T00:00:00Z"
        }
    ]
}

POST /api/sla-oncall/api/v1/sla-definitions/

Create a new SLA definition.

Request Body:

{
    "name": "High Priority Response",
    "description": "SLA for high priority incidents",
    "sla_type": "RESPONSE_TIME",
    "incident_severities": ["HIGH"],
    "incident_priorities": ["P2"],
    "target_duration_minutes": 30,
    "business_hours_only": false,
    "escalation_enabled": true,
    "escalation_threshold_percent": 80.0,
    "is_active": true
}

POST /api/sla-oncall/api/v1/sla-definitions/{id}/test_applicability/

Test if SLA definition applies to a given incident.

Request Body:

{
    "category": "SYSTEM",
    "severity": "HIGH",
    "priority": "P2"
}

Response:

{
    "applies": true,
    "incident_data": {
        "category": "SYSTEM",
        "severity": "HIGH",
        "priority": "P2"
    }
}

On-Call Rotations

GET /api/sla-oncall/api/v1/oncall-rotations/

List all on-call rotations.

Query Parameters:

  • rotation_type: Filter by rotation type (WEEKLY, DAILY, etc.)
  • status: Filter by status (ACTIVE, PAUSED, INACTIVE)
  • external_system: Filter by external system integration

Response:

{
    "count": 1,
    "results": [
        {
            "id": "uuid",
            "name": "Primary On-Call Rotation",
            "description": "Primary rotation for incident response",
            "rotation_type": "WEEKLY",
            "status": "ACTIVE",
            "team_name": "Incident Response Team",
            "team_description": "Primary team responsible for incidents",
            "schedule_config": {
                "rotation_length_days": 7,
                "handoff_time": "09:00"
            },
            "timezone": "UTC",
            "external_system": "INTERNAL",
            "external_system_id": null,
            "integration_config": {},
            "current_oncall": {
                "user_id": "uuid",
                "username": "john.doe",
                "start_time": "2024-01-08T09:00:00Z",
                "end_time": "2024-01-15T09:00:00Z"
            },
            "created_at": "2024-01-01T00:00:00Z"
        }
    ]
}

GET /api/sla-oncall/api/v1/oncall-rotations/{id}/current_oncall/

Get the current on-call person for a rotation.

Response:

{
    "id": "uuid",
    "rotation": "uuid",
    "rotation_name": "Primary On-Call Rotation",
    "user": "uuid",
    "user_name": "john.doe",
    "user_email": "john.doe@company.com",
    "start_time": "2024-01-08T09:00:00Z",
    "end_time": "2024-01-15T09:00:00Z",
    "status": "ACTIVE",
    "incidents_handled": 5,
    "response_time_avg": "00:15:30"
}

GET /api/sla-oncall/api/v1/oncall-rotations/{id}/upcoming_assignments/

Get upcoming on-call assignments.

Query Parameters:

  • days: Number of days ahead to look (default: 30)

SLA Instances

GET /api/sla-oncall/api/v1/sla-instances/

List all SLA instances.

Query Parameters:

  • status: Filter by status (ACTIVE, MET, BREACHED, CANCELLED)
  • escalation_triggered: Filter by escalation status
  • sla_definition: Filter by SLA definition

Response:

{
    "count": 10,
    "results": [
        {
            "id": "uuid",
            "sla_definition": "uuid",
            "sla_definition_name": "Critical Incident Response",
            "incident": "uuid",
            "incident_title": "Database Connection Failure",
            "status": "ACTIVE",
            "target_time": "2024-01-08T15:15:00Z",
            "started_at": "2024-01-08T15:00:00Z",
            "met_at": null,
            "breached_at": null,
            "escalation_policy": "uuid",
            "escalation_triggered": false,
            "escalation_triggered_at": null,
            "escalation_level": 0,
            "response_time": null,
            "resolution_time": null,
            "is_breached": false,
            "time_remaining": "00:12:30",
            "breach_time": "00:00:00",
            "created_at": "2024-01-08T15:00:00Z"
        }
    ]
}

GET /api/sla-oncall/api/v1/sla-instances/breached/

Get all breached SLA instances.

GET /api/sla-oncall/api/v1/sla-instances/at_risk/

Get SLA instances at risk of breaching (within 15 minutes).

POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_met/

Mark an SLA instance as met.

Response:

{
    "message": "SLA marked as met"
}

POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_breached/

Mark an SLA instance as breached.

On-Call Assignments

GET /api/sla-oncall/api/v1/oncall-assignments/

List all on-call assignments.

Query Parameters:

  • rotation: Filter by rotation
  • user: Filter by user
  • status: Filter by status (SCHEDULED, ACTIVE, COMPLETED, CANCELLED)

POST /api/sla-oncall/api/v1/oncall-assignments/

Create a new on-call assignment.

Request Body:

{
    "rotation": "uuid",
    "user": "uuid",
    "start_time": "2024-01-15T09:00:00Z",
    "end_time": "2024-01-22T09:00:00Z",
    "handoff_notes": "All systems stable, no pending incidents"
}

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/handoff/

Perform on-call handoff.

Request Body:

{
    "handoff_notes": "Handing off to next person. 3 active incidents."
}

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/activate/

Activate a scheduled assignment.

POST /api/sla-oncall/api/v1/oncall-assignments/{id}/complete/

Complete an active assignment.

Escalation Policies

GET /api/sla-oncall/api/v1/escalation-policies/

List all escalation policies.

Query Parameters:

  • escalation_type: Filter by escalation type
  • trigger_condition: Filter by trigger condition
  • is_active: Filter by active status

POST /api/sla-oncall/api/v1/escalation-policies/

Create a new escalation policy.

Request Body:

{
    "name": "Critical Escalation",
    "description": "Escalation for critical incidents",
    "escalation_type": "TIME_BASED",
    "trigger_condition": "SLA_THRESHOLD",
    "incident_severities": ["CRITICAL", "EMERGENCY"],
    "trigger_delay_minutes": 0,
    "escalation_steps": [
        {
            "level": 1,
            "delay_minutes": 5,
            "actions": ["notify_oncall", "notify_manager"],
            "channels": ["email", "sms"]
        },
        {
            "level": 2,
            "delay_minutes": 15,
            "actions": ["notify_director", "page_oncall"],
            "channels": ["email", "sms", "phone"]
        }
    ],
    "notification_channels": ["email", "sms", "phone"],
    "is_active": true
}

Escalation Instances

GET /api/sla-oncall/api/v1/escalation-instances/

List all escalation instances.

Query Parameters:

  • status: Filter by status (PENDING, TRIGGERED, ACKNOWLEDGED, RESOLVED, CANCELLED)
  • escalation_level: Filter by escalation level
  • escalation_policy: Filter by escalation policy

POST /api/sla-oncall/api/v1/escalation-instances/{id}/acknowledge/

Acknowledge an escalation.

POST /api/sla-oncall/api/v1/escalation-instances/{id}/resolve/

Resolve an escalation.

Notification Templates

GET /api/sla-oncall/api/v1/notification-templates/

List all notification templates.

Query Parameters:

  • template_type: Filter by template type (ESCALATION, ONCALL_HANDOFF, etc.)
  • channel_type: Filter by channel type (EMAIL, SMS, SLACK, etc.)
  • is_active: Filter by active status

POST /api/sla-oncall/api/v1/notification-templates/

Create a new notification template.

Request Body:

{
    "name": "Email Escalation Alert",
    "template_type": "ESCALATION",
    "channel_type": "EMAIL",
    "subject_template": "URGENT: Incident #{incident_id} Escalated",
    "body_template": "Incident #{incident_id} has been escalated to Level {escalation_level}. Please respond immediately.",
    "variables": ["incident_id", "incident_title", "escalation_level"],
    "is_active": true,
    "is_default": true
}

Setup and Configuration

Initial Setup

Run the setup command to create default configurations:

python manage.py setup_sla_oncall

This command creates:

  • Default business hours configurations
  • Standard SLA definitions for different incident types
  • Default escalation policies
  • Notification templates
  • Sample on-call rotation (if users exist)

Configuration Examples

Business Hours for Different Teams

# 24/7 Operations
business_hours = BusinessHours.objects.create(
    name='24/7 Operations',
    description='Always business hours',
    timezone='UTC',
    weekday_start=time(0, 0),
    weekday_end=time(23, 59),
    weekend_start=time(0, 0),
    weekend_end=time(23, 59),
)

# EMEA Business Hours
business_hours = BusinessHours.objects.create(
    name='EMEA Business Hours',
    description='EMEA timezone business hours',
    timezone='Europe/London',
    weekday_start=time(9, 0),
    weekday_end=time(17, 0),
    weekend_start=time(10, 0),
    weekend_end=time(16, 0),
    holiday_calendar=['2024-12-25', '2024-01-01', '2024-04-19'],
)

SLA Definitions

# Critical incidents - 15 minute response
critical_sla = SLADefinition.objects.create(
    name='Critical Incident Response',
    description='SLA for critical and emergency incidents',
    sla_type='RESPONSE_TIME',
    incident_severities=['CRITICAL', 'EMERGENCY'],
    incident_priorities=['P1'],
    target_duration_minutes=15,
    business_hours_only=False,
    escalation_enabled=True,
    escalation_threshold_percent=75.0,
)

# Medium incidents - 2 hour response during business hours
medium_sla = SLADefinition.objects.create(
    name='Medium Priority Response',
    description='SLA for medium priority incidents',
    sla_type='RESPONSE_TIME',
    incident_severities=['MEDIUM'],
    incident_priorities=['P3'],
    target_duration_minutes=120,
    business_hours_only=True,
    business_hours=business_hours,
    escalation_enabled=True,
    escalation_threshold_percent=85.0,
)

Escalation Policies

# Critical escalation policy
escalation_policy = EscalationPolicy.objects.create(
    name='Critical Incident Escalation',
    description='Escalation for critical incidents',
    escalation_type='TIME_BASED',
    trigger_condition='SLA_THRESHOLD',
    incident_severities=['CRITICAL', 'EMERGENCY'],
    trigger_delay_minutes=0,
    escalation_steps=[
        {
            'level': 1,
            'delay_minutes': 5,
            'actions': ['notify_oncall', 'notify_manager'],
            'channels': ['email', 'sms']
        },
        {
            'level': 2,
            'delay_minutes': 15,
            'actions': ['notify_director', 'page_oncall'],
            'channels': ['email', 'sms', 'phone']
        },
        {
            'level': 3,
            'delay_minutes': 30,
            'actions': ['notify_executive', 'escalate_to_vendor'],
            'channels': ['email', 'phone', 'webhook']
        }
    ],
    notification_channels=['email', 'sms', 'phone'],
)

On-Call Rotations

# Weekly rotation
rotation = OnCallRotation.objects.create(
    name='Primary On-Call Rotation',
    description='Primary rotation for incident response',
    rotation_type='WEEKLY',
    team_name='Incident Response Team',
    schedule_config={
        'rotation_length_days': 7,
        'handoff_time': '09:00',
        'timezone': 'UTC'
    },
    timezone='UTC',
)

# Create assignments
assignment = OnCallAssignment.objects.create(
    rotation=rotation,
    user=user,
    start_time=timezone.now(),
    end_time=timezone.now() + timedelta(days=7),
    status='ACTIVE'
)

Integration with Other Modules

Incident Intelligence Integration

The SLA module automatically creates SLA instances when incidents are created:

# When an incident is created, applicable SLA definitions are found
# and SLA instances are automatically created
incident = Incident.objects.create(
    title='Database Connection Failure',
    description='Unable to connect to primary database',
    severity='CRITICAL',
    category='DATABASE',
    reporter=user,
)

# This automatically triggers SLA instance creation via signals

Automation Orchestration Integration

SLA breaches can trigger automation workflows:

# In automation_orchestration models, you can reference SLA instances
class RunbookExecution(models.Model):
    # ... existing fields ...
    sla_instance = models.ForeignKey(
        'sla_oncall.SLAInstance',
        on_delete=models.SET_NULL,
        null=True,
        blank=True,
        related_name='runbook_executions'
    )

Security Integration

On-call assignments respect security clearances:

# Users with appropriate clearance levels can be assigned to sensitive incidents
if user.clearance_level.level >= incident.get_required_clearance_level():
    # User can be assigned to this incident
    assignment = OnCallAssignment.objects.create(...)

Monitoring and Alerting

SLA Breach Monitoring

Monitor SLA instances for breaches:

# Get all breached SLAs
breached_slas = SLAInstance.objects.filter(status='BREACHED')

# Get SLAs at risk (within 15 minutes of breach)
warning_time = timezone.now() + timedelta(minutes=15)
at_risk_slas = SLAInstance.objects.filter(
    status='ACTIVE',
    target_time__lte=warning_time
)

Escalation Monitoring

Monitor active escalations:

# Get all active escalations
active_escalations = EscalationInstance.objects.filter(
    status__in=['PENDING', 'TRIGGERED']
)

# Get escalations by level
level_2_escalations = EscalationInstance.objects.filter(
    escalation_level=2,
    status='TRIGGERED'
)

Performance Metrics

Track on-call performance:

# Get current on-call assignments
current_assignments = OnCallAssignment.objects.filter(
    status='ACTIVE',
    start_time__lte=timezone.now(),
    end_time__gte=timezone.now()
)

# Calculate average response time
avg_response_time = current_assignments.aggregate(
    avg_response=Avg('response_time_avg')
)['avg_response']

Best Practices

SLA Definition Best Practices

  1. Start Simple: Begin with basic SLAs and add complexity as needed
  2. Business Hours Consideration: Use business hours for non-critical incidents
  3. Escalation Thresholds: Set escalation thresholds at 75-85% of SLA time
  4. Regular Review: Review and adjust SLAs based on performance data

On-Call Management Best Practices

  1. Clear Handoffs: Use structured handoff processes with notes
  2. Rotation Length: Keep rotations between 1-2 weeks for optimal coverage
  3. Backup Coverage: Always have backup on-call personnel
  4. Training: Ensure on-call personnel are properly trained

Escalation Best Practices

  1. Progressive Escalation: Use multiple levels with increasing urgency
  2. Clear Actions: Define specific actions for each escalation level
  3. Multiple Channels: Use multiple notification channels for critical escalations
  4. Documentation: Document all escalation actions and outcomes

Troubleshooting

Common Issues

  1. SLA Not Created: Check if SLA definition criteria match incident attributes
  2. Escalation Not Triggered: Verify escalation policy is active and criteria match
  3. On-Call Not Found: Ensure active assignments exist for the rotation
  4. Business Hours Issues: Verify timezone configuration and business hours setup

Debugging Commands

# Check SLA instances for a specific incident
python manage.py shell
>>> incident = Incident.objects.get(id='incident-id')
>>> sla_instances = incident.sla_instances.all()
>>> for sla in sla_instances:
...     print(f"SLA: {sla.sla_definition.name}, Status: {sla.status}")

# Check current on-call for a rotation
>>> rotation = OnCallRotation.objects.get(id='rotation-id')
>>> current = rotation.get_current_oncall()
>>> print(f"Current on-call: {current.user.username if current else 'None'}")

# Check business hours
>>> business_hours = BusinessHours.objects.get(id='business-hours-id')
>>> now = timezone.now()
>>> print(f"Is business hours: {business_hours.is_business_hours(now)}")