20 KiB
SLA & On-Call Management API Documentation
Overview
The SLA & On-Call Management module provides comprehensive Service Level Agreement (SLA) tracking, escalation policies, and on-call rotation management for enterprise incident management systems.
Features
Dynamic SLAs
- Incident Type-Based SLAs: Different SLA targets based on incident category, severity, and priority
- Business Hours Support: SLA calculations that respect business hours and timezones
- Multiple SLA Types: Response time, resolution time, acknowledgment time, and first response time
- Automatic SLA Instance Creation: SLAs are automatically created when incidents are reported
Escalation Policies
- Multi-Level Escalation: Configurable escalation steps with different actions and timing
- Condition-Based Triggering: Escalations triggered by SLA breaches, thresholds, or custom conditions
- Multi-Channel Notifications: Email, SMS, Slack, Teams, and webhook notifications
- Integration with On-Call: Automatic escalation to current on-call personnel
On-Call Rotation Management
- Flexible Scheduling: Weekly, daily, monthly, and custom rotation schedules
- External System Integration: Built-in support for PagerDuty and OpsGenie
- Handoff Management: Structured handoff processes with notes and tracking
- Performance Metrics: Track incident handling and response times
Business Hours Management
- Timezone Support: Multiple timezone configurations
- Holiday Calendar: Holiday and special day handling
- Day Overrides: Custom hours for specific dates
- Weekend Configuration: Separate weekend business hours
API Endpoints
Business Hours Management
GET /api/sla-oncall/api/v1/business-hours/
List all business hours configurations.
Query Parameters:
is_active: Filter by active statusis_default: Filter by default statustimezone: Filter by timezonesearch: Search by name or description
Response:
{
"count": 2,
"next": null,
"previous": null,
"results": [
{
"id": "uuid",
"name": "Standard Business Hours",
"description": "Standard 9-5 business hours",
"timezone": "UTC",
"weekday_start": "09:00:00",
"weekday_end": "17:00:00",
"weekend_start": "10:00:00",
"weekend_end": "16:00:00",
"day_overrides": {},
"holiday_calendar": [],
"is_active": true,
"is_default": true,
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-01T00:00:00Z"
}
]
}
POST /api/sla-oncall/api/v1/business-hours/
Create a new business hours configuration.
Request Body:
{
"name": "Custom Business Hours",
"description": "Custom business hours for special team",
"timezone": "America/New_York",
"weekday_start": "08:00:00",
"weekday_end": "18:00:00",
"weekend_start": "10:00:00",
"weekend_end": "16:00:00",
"holiday_calendar": ["2024-12-25", "2024-01-01"],
"is_active": true
}
POST /api/sla-oncall/api/v1/business-hours/{id}/test_business_hours/
Test if a given time is within business hours.
Request Body:
{
"test_time": "2024-01-08T14:30:00Z"
}
Response:
{
"is_business_hours": true,
"test_time": "2024-01-08T14:30:00Z"
}
SLA Definitions
GET /api/sla-oncall/api/v1/sla-definitions/
List all SLA definitions.
Query Parameters:
sla_type: Filter by SLA type (RESPONSE_TIME, RESOLUTION_TIME, etc.)is_active: Filter by active statusbusiness_hours_only: Filter by business hours requirement
Response:
{
"count": 3,
"results": [
{
"id": "uuid",
"name": "Critical Incident Response",
"description": "SLA for critical incidents",
"sla_type": "RESPONSE_TIME",
"incident_categories": ["SYSTEM", "NETWORK"],
"incident_severities": ["CRITICAL", "EMERGENCY"],
"incident_priorities": ["P1"],
"target_duration_minutes": 15,
"business_hours_only": false,
"business_hours": null,
"business_hours_name": null,
"escalation_enabled": true,
"escalation_threshold_percent": 75.0,
"is_active": true,
"is_default": false,
"created_at": "2024-01-01T00:00:00Z"
}
]
}
POST /api/sla-oncall/api/v1/sla-definitions/
Create a new SLA definition.
Request Body:
{
"name": "High Priority Response",
"description": "SLA for high priority incidents",
"sla_type": "RESPONSE_TIME",
"incident_severities": ["HIGH"],
"incident_priorities": ["P2"],
"target_duration_minutes": 30,
"business_hours_only": false,
"escalation_enabled": true,
"escalation_threshold_percent": 80.0,
"is_active": true
}
POST /api/sla-oncall/api/v1/sla-definitions/{id}/test_applicability/
Test if SLA definition applies to a given incident.
Request Body:
{
"category": "SYSTEM",
"severity": "HIGH",
"priority": "P2"
}
Response:
{
"applies": true,
"incident_data": {
"category": "SYSTEM",
"severity": "HIGH",
"priority": "P2"
}
}
On-Call Rotations
GET /api/sla-oncall/api/v1/oncall-rotations/
List all on-call rotations.
Query Parameters:
rotation_type: Filter by rotation type (WEEKLY, DAILY, etc.)status: Filter by status (ACTIVE, PAUSED, INACTIVE)external_system: Filter by external system integration
Response:
{
"count": 1,
"results": [
{
"id": "uuid",
"name": "Primary On-Call Rotation",
"description": "Primary rotation for incident response",
"rotation_type": "WEEKLY",
"status": "ACTIVE",
"team_name": "Incident Response Team",
"team_description": "Primary team responsible for incidents",
"schedule_config": {
"rotation_length_days": 7,
"handoff_time": "09:00"
},
"timezone": "UTC",
"external_system": "INTERNAL",
"external_system_id": null,
"integration_config": {},
"current_oncall": {
"user_id": "uuid",
"username": "john.doe",
"start_time": "2024-01-08T09:00:00Z",
"end_time": "2024-01-15T09:00:00Z"
},
"created_at": "2024-01-01T00:00:00Z"
}
]
}
GET /api/sla-oncall/api/v1/oncall-rotations/{id}/current_oncall/
Get the current on-call person for a rotation.
Response:
{
"id": "uuid",
"rotation": "uuid",
"rotation_name": "Primary On-Call Rotation",
"user": "uuid",
"user_name": "john.doe",
"user_email": "john.doe@company.com",
"start_time": "2024-01-08T09:00:00Z",
"end_time": "2024-01-15T09:00:00Z",
"status": "ACTIVE",
"incidents_handled": 5,
"response_time_avg": "00:15:30"
}
GET /api/sla-oncall/api/v1/oncall-rotations/{id}/upcoming_assignments/
Get upcoming on-call assignments.
Query Parameters:
days: Number of days ahead to look (default: 30)
SLA Instances
GET /api/sla-oncall/api/v1/sla-instances/
List all SLA instances.
Query Parameters:
status: Filter by status (ACTIVE, MET, BREACHED, CANCELLED)escalation_triggered: Filter by escalation statussla_definition: Filter by SLA definition
Response:
{
"count": 10,
"results": [
{
"id": "uuid",
"sla_definition": "uuid",
"sla_definition_name": "Critical Incident Response",
"incident": "uuid",
"incident_title": "Database Connection Failure",
"status": "ACTIVE",
"target_time": "2024-01-08T15:15:00Z",
"started_at": "2024-01-08T15:00:00Z",
"met_at": null,
"breached_at": null,
"escalation_policy": "uuid",
"escalation_triggered": false,
"escalation_triggered_at": null,
"escalation_level": 0,
"response_time": null,
"resolution_time": null,
"is_breached": false,
"time_remaining": "00:12:30",
"breach_time": "00:00:00",
"created_at": "2024-01-08T15:00:00Z"
}
]
}
GET /api/sla-oncall/api/v1/sla-instances/breached/
Get all breached SLA instances.
GET /api/sla-oncall/api/v1/sla-instances/at_risk/
Get SLA instances at risk of breaching (within 15 minutes).
POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_met/
Mark an SLA instance as met.
Response:
{
"message": "SLA marked as met"
}
POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_breached/
Mark an SLA instance as breached.
On-Call Assignments
GET /api/sla-oncall/api/v1/oncall-assignments/
List all on-call assignments.
Query Parameters:
rotation: Filter by rotationuser: Filter by userstatus: Filter by status (SCHEDULED, ACTIVE, COMPLETED, CANCELLED)
POST /api/sla-oncall/api/v1/oncall-assignments/
Create a new on-call assignment.
Request Body:
{
"rotation": "uuid",
"user": "uuid",
"start_time": "2024-01-15T09:00:00Z",
"end_time": "2024-01-22T09:00:00Z",
"handoff_notes": "All systems stable, no pending incidents"
}
POST /api/sla-oncall/api/v1/oncall-assignments/{id}/handoff/
Perform on-call handoff.
Request Body:
{
"handoff_notes": "Handing off to next person. 3 active incidents."
}
POST /api/sla-oncall/api/v1/oncall-assignments/{id}/activate/
Activate a scheduled assignment.
POST /api/sla-oncall/api/v1/oncall-assignments/{id}/complete/
Complete an active assignment.
Escalation Policies
GET /api/sla-oncall/api/v1/escalation-policies/
List all escalation policies.
Query Parameters:
escalation_type: Filter by escalation typetrigger_condition: Filter by trigger conditionis_active: Filter by active status
POST /api/sla-oncall/api/v1/escalation-policies/
Create a new escalation policy.
Request Body:
{
"name": "Critical Escalation",
"description": "Escalation for critical incidents",
"escalation_type": "TIME_BASED",
"trigger_condition": "SLA_THRESHOLD",
"incident_severities": ["CRITICAL", "EMERGENCY"],
"trigger_delay_minutes": 0,
"escalation_steps": [
{
"level": 1,
"delay_minutes": 5,
"actions": ["notify_oncall", "notify_manager"],
"channels": ["email", "sms"]
},
{
"level": 2,
"delay_minutes": 15,
"actions": ["notify_director", "page_oncall"],
"channels": ["email", "sms", "phone"]
}
],
"notification_channels": ["email", "sms", "phone"],
"is_active": true
}
Escalation Instances
GET /api/sla-oncall/api/v1/escalation-instances/
List all escalation instances.
Query Parameters:
status: Filter by status (PENDING, TRIGGERED, ACKNOWLEDGED, RESOLVED, CANCELLED)escalation_level: Filter by escalation levelescalation_policy: Filter by escalation policy
POST /api/sla-oncall/api/v1/escalation-instances/{id}/acknowledge/
Acknowledge an escalation.
POST /api/sla-oncall/api/v1/escalation-instances/{id}/resolve/
Resolve an escalation.
Notification Templates
GET /api/sla-oncall/api/v1/notification-templates/
List all notification templates.
Query Parameters:
template_type: Filter by template type (ESCALATION, ONCALL_HANDOFF, etc.)channel_type: Filter by channel type (EMAIL, SMS, SLACK, etc.)is_active: Filter by active status
POST /api/sla-oncall/api/v1/notification-templates/
Create a new notification template.
Request Body:
{
"name": "Email Escalation Alert",
"template_type": "ESCALATION",
"channel_type": "EMAIL",
"subject_template": "URGENT: Incident #{incident_id} Escalated",
"body_template": "Incident #{incident_id} has been escalated to Level {escalation_level}. Please respond immediately.",
"variables": ["incident_id", "incident_title", "escalation_level"],
"is_active": true,
"is_default": true
}
Setup and Configuration
Initial Setup
Run the setup command to create default configurations:
python manage.py setup_sla_oncall
This command creates:
- Default business hours configurations
- Standard SLA definitions for different incident types
- Default escalation policies
- Notification templates
- Sample on-call rotation (if users exist)
Configuration Examples
Business Hours for Different Teams
# 24/7 Operations
business_hours = BusinessHours.objects.create(
name='24/7 Operations',
description='Always business hours',
timezone='UTC',
weekday_start=time(0, 0),
weekday_end=time(23, 59),
weekend_start=time(0, 0),
weekend_end=time(23, 59),
)
# EMEA Business Hours
business_hours = BusinessHours.objects.create(
name='EMEA Business Hours',
description='EMEA timezone business hours',
timezone='Europe/London',
weekday_start=time(9, 0),
weekday_end=time(17, 0),
weekend_start=time(10, 0),
weekend_end=time(16, 0),
holiday_calendar=['2024-12-25', '2024-01-01', '2024-04-19'],
)
SLA Definitions
# Critical incidents - 15 minute response
critical_sla = SLADefinition.objects.create(
name='Critical Incident Response',
description='SLA for critical and emergency incidents',
sla_type='RESPONSE_TIME',
incident_severities=['CRITICAL', 'EMERGENCY'],
incident_priorities=['P1'],
target_duration_minutes=15,
business_hours_only=False,
escalation_enabled=True,
escalation_threshold_percent=75.0,
)
# Medium incidents - 2 hour response during business hours
medium_sla = SLADefinition.objects.create(
name='Medium Priority Response',
description='SLA for medium priority incidents',
sla_type='RESPONSE_TIME',
incident_severities=['MEDIUM'],
incident_priorities=['P3'],
target_duration_minutes=120,
business_hours_only=True,
business_hours=business_hours,
escalation_enabled=True,
escalation_threshold_percent=85.0,
)
Escalation Policies
# Critical escalation policy
escalation_policy = EscalationPolicy.objects.create(
name='Critical Incident Escalation',
description='Escalation for critical incidents',
escalation_type='TIME_BASED',
trigger_condition='SLA_THRESHOLD',
incident_severities=['CRITICAL', 'EMERGENCY'],
trigger_delay_minutes=0,
escalation_steps=[
{
'level': 1,
'delay_minutes': 5,
'actions': ['notify_oncall', 'notify_manager'],
'channels': ['email', 'sms']
},
{
'level': 2,
'delay_minutes': 15,
'actions': ['notify_director', 'page_oncall'],
'channels': ['email', 'sms', 'phone']
},
{
'level': 3,
'delay_minutes': 30,
'actions': ['notify_executive', 'escalate_to_vendor'],
'channels': ['email', 'phone', 'webhook']
}
],
notification_channels=['email', 'sms', 'phone'],
)
On-Call Rotations
# Weekly rotation
rotation = OnCallRotation.objects.create(
name='Primary On-Call Rotation',
description='Primary rotation for incident response',
rotation_type='WEEKLY',
team_name='Incident Response Team',
schedule_config={
'rotation_length_days': 7,
'handoff_time': '09:00',
'timezone': 'UTC'
},
timezone='UTC',
)
# Create assignments
assignment = OnCallAssignment.objects.create(
rotation=rotation,
user=user,
start_time=timezone.now(),
end_time=timezone.now() + timedelta(days=7),
status='ACTIVE'
)
Integration with Other Modules
Incident Intelligence Integration
The SLA module automatically creates SLA instances when incidents are created:
# When an incident is created, applicable SLA definitions are found
# and SLA instances are automatically created
incident = Incident.objects.create(
title='Database Connection Failure',
description='Unable to connect to primary database',
severity='CRITICAL',
category='DATABASE',
reporter=user,
)
# This automatically triggers SLA instance creation via signals
Automation Orchestration Integration
SLA breaches can trigger automation workflows:
# In automation_orchestration models, you can reference SLA instances
class RunbookExecution(models.Model):
# ... existing fields ...
sla_instance = models.ForeignKey(
'sla_oncall.SLAInstance',
on_delete=models.SET_NULL,
null=True,
blank=True,
related_name='runbook_executions'
)
Security Integration
On-call assignments respect security clearances:
# Users with appropriate clearance levels can be assigned to sensitive incidents
if user.clearance_level.level >= incident.get_required_clearance_level():
# User can be assigned to this incident
assignment = OnCallAssignment.objects.create(...)
Monitoring and Alerting
SLA Breach Monitoring
Monitor SLA instances for breaches:
# Get all breached SLAs
breached_slas = SLAInstance.objects.filter(status='BREACHED')
# Get SLAs at risk (within 15 minutes of breach)
warning_time = timezone.now() + timedelta(minutes=15)
at_risk_slas = SLAInstance.objects.filter(
status='ACTIVE',
target_time__lte=warning_time
)
Escalation Monitoring
Monitor active escalations:
# Get all active escalations
active_escalations = EscalationInstance.objects.filter(
status__in=['PENDING', 'TRIGGERED']
)
# Get escalations by level
level_2_escalations = EscalationInstance.objects.filter(
escalation_level=2,
status='TRIGGERED'
)
Performance Metrics
Track on-call performance:
# Get current on-call assignments
current_assignments = OnCallAssignment.objects.filter(
status='ACTIVE',
start_time__lte=timezone.now(),
end_time__gte=timezone.now()
)
# Calculate average response time
avg_response_time = current_assignments.aggregate(
avg_response=Avg('response_time_avg')
)['avg_response']
Best Practices
SLA Definition Best Practices
- Start Simple: Begin with basic SLAs and add complexity as needed
- Business Hours Consideration: Use business hours for non-critical incidents
- Escalation Thresholds: Set escalation thresholds at 75-85% of SLA time
- Regular Review: Review and adjust SLAs based on performance data
On-Call Management Best Practices
- Clear Handoffs: Use structured handoff processes with notes
- Rotation Length: Keep rotations between 1-2 weeks for optimal coverage
- Backup Coverage: Always have backup on-call personnel
- Training: Ensure on-call personnel are properly trained
Escalation Best Practices
- Progressive Escalation: Use multiple levels with increasing urgency
- Clear Actions: Define specific actions for each escalation level
- Multiple Channels: Use multiple notification channels for critical escalations
- Documentation: Document all escalation actions and outcomes
Troubleshooting
Common Issues
- SLA Not Created: Check if SLA definition criteria match incident attributes
- Escalation Not Triggered: Verify escalation policy is active and criteria match
- On-Call Not Found: Ensure active assignments exist for the rotation
- Business Hours Issues: Verify timezone configuration and business hours setup
Debugging Commands
# Check SLA instances for a specific incident
python manage.py shell
>>> incident = Incident.objects.get(id='incident-id')
>>> sla_instances = incident.sla_instances.all()
>>> for sla in sla_instances:
... print(f"SLA: {sla.sla_definition.name}, Status: {sla.status}")
# Check current on-call for a rotation
>>> rotation = OnCallRotation.objects.get(id='rotation-id')
>>> current = rotation.get_current_oncall()
>>> print(f"Current on-call: {current.user.username if current else 'None'}")
# Check business hours
>>> business_hours = BusinessHours.objects.get(id='business-hours-id')
>>> now = timezone.now()
>>> print(f"Is business hours: {business_hours.is_business_hours(now)}")