# SLA & On-Call Management API Documentation ## Overview The SLA & On-Call Management module provides comprehensive Service Level Agreement (SLA) tracking, escalation policies, and on-call rotation management for enterprise incident management systems. ## Features ### Dynamic SLAs - **Incident Type-Based SLAs**: Different SLA targets based on incident category, severity, and priority - **Business Hours Support**: SLA calculations that respect business hours and timezones - **Multiple SLA Types**: Response time, resolution time, acknowledgment time, and first response time - **Automatic SLA Instance Creation**: SLAs are automatically created when incidents are reported ### Escalation Policies - **Multi-Level Escalation**: Configurable escalation steps with different actions and timing - **Condition-Based Triggering**: Escalations triggered by SLA breaches, thresholds, or custom conditions - **Multi-Channel Notifications**: Email, SMS, Slack, Teams, and webhook notifications - **Integration with On-Call**: Automatic escalation to current on-call personnel ### On-Call Rotation Management - **Flexible Scheduling**: Weekly, daily, monthly, and custom rotation schedules - **External System Integration**: Built-in support for PagerDuty and OpsGenie - **Handoff Management**: Structured handoff processes with notes and tracking - **Performance Metrics**: Track incident handling and response times ### Business Hours Management - **Timezone Support**: Multiple timezone configurations - **Holiday Calendar**: Holiday and special day handling - **Day Overrides**: Custom hours for specific dates - **Weekend Configuration**: Separate weekend business hours ## API Endpoints ### Business Hours Management #### GET /api/sla-oncall/api/v1/business-hours/ List all business hours configurations. **Query Parameters:** - `is_active`: Filter by active status - `is_default`: Filter by default status - `timezone`: Filter by timezone - `search`: Search by name or description **Response:** ```json { "count": 2, "next": null, "previous": null, "results": [ { "id": "uuid", "name": "Standard Business Hours", "description": "Standard 9-5 business hours", "timezone": "UTC", "weekday_start": "09:00:00", "weekday_end": "17:00:00", "weekend_start": "10:00:00", "weekend_end": "16:00:00", "day_overrides": {}, "holiday_calendar": [], "is_active": true, "is_default": true, "created_at": "2024-01-01T00:00:00Z", "updated_at": "2024-01-01T00:00:00Z" } ] } ``` #### POST /api/sla-oncall/api/v1/business-hours/ Create a new business hours configuration. **Request Body:** ```json { "name": "Custom Business Hours", "description": "Custom business hours for special team", "timezone": "America/New_York", "weekday_start": "08:00:00", "weekday_end": "18:00:00", "weekend_start": "10:00:00", "weekend_end": "16:00:00", "holiday_calendar": ["2024-12-25", "2024-01-01"], "is_active": true } ``` #### POST /api/sla-oncall/api/v1/business-hours/{id}/test_business_hours/ Test if a given time is within business hours. **Request Body:** ```json { "test_time": "2024-01-08T14:30:00Z" } ``` **Response:** ```json { "is_business_hours": true, "test_time": "2024-01-08T14:30:00Z" } ``` ### SLA Definitions #### GET /api/sla-oncall/api/v1/sla-definitions/ List all SLA definitions. **Query Parameters:** - `sla_type`: Filter by SLA type (RESPONSE_TIME, RESOLUTION_TIME, etc.) - `is_active`: Filter by active status - `business_hours_only`: Filter by business hours requirement **Response:** ```json { "count": 3, "results": [ { "id": "uuid", "name": "Critical Incident Response", "description": "SLA for critical incidents", "sla_type": "RESPONSE_TIME", "incident_categories": ["SYSTEM", "NETWORK"], "incident_severities": ["CRITICAL", "EMERGENCY"], "incident_priorities": ["P1"], "target_duration_minutes": 15, "business_hours_only": false, "business_hours": null, "business_hours_name": null, "escalation_enabled": true, "escalation_threshold_percent": 75.0, "is_active": true, "is_default": false, "created_at": "2024-01-01T00:00:00Z" } ] } ``` #### POST /api/sla-oncall/api/v1/sla-definitions/ Create a new SLA definition. **Request Body:** ```json { "name": "High Priority Response", "description": "SLA for high priority incidents", "sla_type": "RESPONSE_TIME", "incident_severities": ["HIGH"], "incident_priorities": ["P2"], "target_duration_minutes": 30, "business_hours_only": false, "escalation_enabled": true, "escalation_threshold_percent": 80.0, "is_active": true } ``` #### POST /api/sla-oncall/api/v1/sla-definitions/{id}/test_applicability/ Test if SLA definition applies to a given incident. **Request Body:** ```json { "category": "SYSTEM", "severity": "HIGH", "priority": "P2" } ``` **Response:** ```json { "applies": true, "incident_data": { "category": "SYSTEM", "severity": "HIGH", "priority": "P2" } } ``` ### On-Call Rotations #### GET /api/sla-oncall/api/v1/oncall-rotations/ List all on-call rotations. **Query Parameters:** - `rotation_type`: Filter by rotation type (WEEKLY, DAILY, etc.) - `status`: Filter by status (ACTIVE, PAUSED, INACTIVE) - `external_system`: Filter by external system integration **Response:** ```json { "count": 1, "results": [ { "id": "uuid", "name": "Primary On-Call Rotation", "description": "Primary rotation for incident response", "rotation_type": "WEEKLY", "status": "ACTIVE", "team_name": "Incident Response Team", "team_description": "Primary team responsible for incidents", "schedule_config": { "rotation_length_days": 7, "handoff_time": "09:00" }, "timezone": "UTC", "external_system": "INTERNAL", "external_system_id": null, "integration_config": {}, "current_oncall": { "user_id": "uuid", "username": "john.doe", "start_time": "2024-01-08T09:00:00Z", "end_time": "2024-01-15T09:00:00Z" }, "created_at": "2024-01-01T00:00:00Z" } ] } ``` #### GET /api/sla-oncall/api/v1/oncall-rotations/{id}/current_oncall/ Get the current on-call person for a rotation. **Response:** ```json { "id": "uuid", "rotation": "uuid", "rotation_name": "Primary On-Call Rotation", "user": "uuid", "user_name": "john.doe", "user_email": "john.doe@company.com", "start_time": "2024-01-08T09:00:00Z", "end_time": "2024-01-15T09:00:00Z", "status": "ACTIVE", "incidents_handled": 5, "response_time_avg": "00:15:30" } ``` #### GET /api/sla-oncall/api/v1/oncall-rotations/{id}/upcoming_assignments/ Get upcoming on-call assignments. **Query Parameters:** - `days`: Number of days ahead to look (default: 30) ### SLA Instances #### GET /api/sla-oncall/api/v1/sla-instances/ List all SLA instances. **Query Parameters:** - `status`: Filter by status (ACTIVE, MET, BREACHED, CANCELLED) - `escalation_triggered`: Filter by escalation status - `sla_definition`: Filter by SLA definition **Response:** ```json { "count": 10, "results": [ { "id": "uuid", "sla_definition": "uuid", "sla_definition_name": "Critical Incident Response", "incident": "uuid", "incident_title": "Database Connection Failure", "status": "ACTIVE", "target_time": "2024-01-08T15:15:00Z", "started_at": "2024-01-08T15:00:00Z", "met_at": null, "breached_at": null, "escalation_policy": "uuid", "escalation_triggered": false, "escalation_triggered_at": null, "escalation_level": 0, "response_time": null, "resolution_time": null, "is_breached": false, "time_remaining": "00:12:30", "breach_time": "00:00:00", "created_at": "2024-01-08T15:00:00Z" } ] } ``` #### GET /api/sla-oncall/api/v1/sla-instances/breached/ Get all breached SLA instances. #### GET /api/sla-oncall/api/v1/sla-instances/at_risk/ Get SLA instances at risk of breaching (within 15 minutes). #### POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_met/ Mark an SLA instance as met. **Response:** ```json { "message": "SLA marked as met" } ``` #### POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_breached/ Mark an SLA instance as breached. ### On-Call Assignments #### GET /api/sla-oncall/api/v1/oncall-assignments/ List all on-call assignments. **Query Parameters:** - `rotation`: Filter by rotation - `user`: Filter by user - `status`: Filter by status (SCHEDULED, ACTIVE, COMPLETED, CANCELLED) #### POST /api/sla-oncall/api/v1/oncall-assignments/ Create a new on-call assignment. **Request Body:** ```json { "rotation": "uuid", "user": "uuid", "start_time": "2024-01-15T09:00:00Z", "end_time": "2024-01-22T09:00:00Z", "handoff_notes": "All systems stable, no pending incidents" } ``` #### POST /api/sla-oncall/api/v1/oncall-assignments/{id}/handoff/ Perform on-call handoff. **Request Body:** ```json { "handoff_notes": "Handing off to next person. 3 active incidents." } ``` #### POST /api/sla-oncall/api/v1/oncall-assignments/{id}/activate/ Activate a scheduled assignment. #### POST /api/sla-oncall/api/v1/oncall-assignments/{id}/complete/ Complete an active assignment. ### Escalation Policies #### GET /api/sla-oncall/api/v1/escalation-policies/ List all escalation policies. **Query Parameters:** - `escalation_type`: Filter by escalation type - `trigger_condition`: Filter by trigger condition - `is_active`: Filter by active status #### POST /api/sla-oncall/api/v1/escalation-policies/ Create a new escalation policy. **Request Body:** ```json { "name": "Critical Escalation", "description": "Escalation for critical incidents", "escalation_type": "TIME_BASED", "trigger_condition": "SLA_THRESHOLD", "incident_severities": ["CRITICAL", "EMERGENCY"], "trigger_delay_minutes": 0, "escalation_steps": [ { "level": 1, "delay_minutes": 5, "actions": ["notify_oncall", "notify_manager"], "channels": ["email", "sms"] }, { "level": 2, "delay_minutes": 15, "actions": ["notify_director", "page_oncall"], "channels": ["email", "sms", "phone"] } ], "notification_channels": ["email", "sms", "phone"], "is_active": true } ``` ### Escalation Instances #### GET /api/sla-oncall/api/v1/escalation-instances/ List all escalation instances. **Query Parameters:** - `status`: Filter by status (PENDING, TRIGGERED, ACKNOWLEDGED, RESOLVED, CANCELLED) - `escalation_level`: Filter by escalation level - `escalation_policy`: Filter by escalation policy #### POST /api/sla-oncall/api/v1/escalation-instances/{id}/acknowledge/ Acknowledge an escalation. #### POST /api/sla-oncall/api/v1/escalation-instances/{id}/resolve/ Resolve an escalation. ### Notification Templates #### GET /api/sla-oncall/api/v1/notification-templates/ List all notification templates. **Query Parameters:** - `template_type`: Filter by template type (ESCALATION, ONCALL_HANDOFF, etc.) - `channel_type`: Filter by channel type (EMAIL, SMS, SLACK, etc.) - `is_active`: Filter by active status #### POST /api/sla-oncall/api/v1/notification-templates/ Create a new notification template. **Request Body:** ```json { "name": "Email Escalation Alert", "template_type": "ESCALATION", "channel_type": "EMAIL", "subject_template": "URGENT: Incident #{incident_id} Escalated", "body_template": "Incident #{incident_id} has been escalated to Level {escalation_level}. Please respond immediately.", "variables": ["incident_id", "incident_title", "escalation_level"], "is_active": true, "is_default": true } ``` ## Setup and Configuration ### Initial Setup Run the setup command to create default configurations: ```bash python manage.py setup_sla_oncall ``` This command creates: - Default business hours configurations - Standard SLA definitions for different incident types - Default escalation policies - Notification templates - Sample on-call rotation (if users exist) ### Configuration Examples #### Business Hours for Different Teams ```python # 24/7 Operations business_hours = BusinessHours.objects.create( name='24/7 Operations', description='Always business hours', timezone='UTC', weekday_start=time(0, 0), weekday_end=time(23, 59), weekend_start=time(0, 0), weekend_end=time(23, 59), ) # EMEA Business Hours business_hours = BusinessHours.objects.create( name='EMEA Business Hours', description='EMEA timezone business hours', timezone='Europe/London', weekday_start=time(9, 0), weekday_end=time(17, 0), weekend_start=time(10, 0), weekend_end=time(16, 0), holiday_calendar=['2024-12-25', '2024-01-01', '2024-04-19'], ) ``` #### SLA Definitions ```python # Critical incidents - 15 minute response critical_sla = SLADefinition.objects.create( name='Critical Incident Response', description='SLA for critical and emergency incidents', sla_type='RESPONSE_TIME', incident_severities=['CRITICAL', 'EMERGENCY'], incident_priorities=['P1'], target_duration_minutes=15, business_hours_only=False, escalation_enabled=True, escalation_threshold_percent=75.0, ) # Medium incidents - 2 hour response during business hours medium_sla = SLADefinition.objects.create( name='Medium Priority Response', description='SLA for medium priority incidents', sla_type='RESPONSE_TIME', incident_severities=['MEDIUM'], incident_priorities=['P3'], target_duration_minutes=120, business_hours_only=True, business_hours=business_hours, escalation_enabled=True, escalation_threshold_percent=85.0, ) ``` #### Escalation Policies ```python # Critical escalation policy escalation_policy = EscalationPolicy.objects.create( name='Critical Incident Escalation', description='Escalation for critical incidents', escalation_type='TIME_BASED', trigger_condition='SLA_THRESHOLD', incident_severities=['CRITICAL', 'EMERGENCY'], trigger_delay_minutes=0, escalation_steps=[ { 'level': 1, 'delay_minutes': 5, 'actions': ['notify_oncall', 'notify_manager'], 'channels': ['email', 'sms'] }, { 'level': 2, 'delay_minutes': 15, 'actions': ['notify_director', 'page_oncall'], 'channels': ['email', 'sms', 'phone'] }, { 'level': 3, 'delay_minutes': 30, 'actions': ['notify_executive', 'escalate_to_vendor'], 'channels': ['email', 'phone', 'webhook'] } ], notification_channels=['email', 'sms', 'phone'], ) ``` #### On-Call Rotations ```python # Weekly rotation rotation = OnCallRotation.objects.create( name='Primary On-Call Rotation', description='Primary rotation for incident response', rotation_type='WEEKLY', team_name='Incident Response Team', schedule_config={ 'rotation_length_days': 7, 'handoff_time': '09:00', 'timezone': 'UTC' }, timezone='UTC', ) # Create assignments assignment = OnCallAssignment.objects.create( rotation=rotation, user=user, start_time=timezone.now(), end_time=timezone.now() + timedelta(days=7), status='ACTIVE' ) ``` ## Integration with Other Modules ### Incident Intelligence Integration The SLA module automatically creates SLA instances when incidents are created: ```python # When an incident is created, applicable SLA definitions are found # and SLA instances are automatically created incident = Incident.objects.create( title='Database Connection Failure', description='Unable to connect to primary database', severity='CRITICAL', category='DATABASE', reporter=user, ) # This automatically triggers SLA instance creation via signals ``` ### Automation Orchestration Integration SLA breaches can trigger automation workflows: ```python # In automation_orchestration models, you can reference SLA instances class RunbookExecution(models.Model): # ... existing fields ... sla_instance = models.ForeignKey( 'sla_oncall.SLAInstance', on_delete=models.SET_NULL, null=True, blank=True, related_name='runbook_executions' ) ``` ### Security Integration On-call assignments respect security clearances: ```python # Users with appropriate clearance levels can be assigned to sensitive incidents if user.clearance_level.level >= incident.get_required_clearance_level(): # User can be assigned to this incident assignment = OnCallAssignment.objects.create(...) ``` ## Monitoring and Alerting ### SLA Breach Monitoring Monitor SLA instances for breaches: ```python # Get all breached SLAs breached_slas = SLAInstance.objects.filter(status='BREACHED') # Get SLAs at risk (within 15 minutes of breach) warning_time = timezone.now() + timedelta(minutes=15) at_risk_slas = SLAInstance.objects.filter( status='ACTIVE', target_time__lte=warning_time ) ``` ### Escalation Monitoring Monitor active escalations: ```python # Get all active escalations active_escalations = EscalationInstance.objects.filter( status__in=['PENDING', 'TRIGGERED'] ) # Get escalations by level level_2_escalations = EscalationInstance.objects.filter( escalation_level=2, status='TRIGGERED' ) ``` ### Performance Metrics Track on-call performance: ```python # Get current on-call assignments current_assignments = OnCallAssignment.objects.filter( status='ACTIVE', start_time__lte=timezone.now(), end_time__gte=timezone.now() ) # Calculate average response time avg_response_time = current_assignments.aggregate( avg_response=Avg('response_time_avg') )['avg_response'] ``` ## Best Practices ### SLA Definition Best Practices 1. **Start Simple**: Begin with basic SLAs and add complexity as needed 2. **Business Hours Consideration**: Use business hours for non-critical incidents 3. **Escalation Thresholds**: Set escalation thresholds at 75-85% of SLA time 4. **Regular Review**: Review and adjust SLAs based on performance data ### On-Call Management Best Practices 1. **Clear Handoffs**: Use structured handoff processes with notes 2. **Rotation Length**: Keep rotations between 1-2 weeks for optimal coverage 3. **Backup Coverage**: Always have backup on-call personnel 4. **Training**: Ensure on-call personnel are properly trained ### Escalation Best Practices 1. **Progressive Escalation**: Use multiple levels with increasing urgency 2. **Clear Actions**: Define specific actions for each escalation level 3. **Multiple Channels**: Use multiple notification channels for critical escalations 4. **Documentation**: Document all escalation actions and outcomes ## Troubleshooting ### Common Issues 1. **SLA Not Created**: Check if SLA definition criteria match incident attributes 2. **Escalation Not Triggered**: Verify escalation policy is active and criteria match 3. **On-Call Not Found**: Ensure active assignments exist for the rotation 4. **Business Hours Issues**: Verify timezone configuration and business hours setup ### Debugging Commands ```bash # Check SLA instances for a specific incident python manage.py shell >>> incident = Incident.objects.get(id='incident-id') >>> sla_instances = incident.sla_instances.all() >>> for sla in sla_instances: ... print(f"SLA: {sla.sla_definition.name}, Status: {sla.status}") # Check current on-call for a rotation >>> rotation = OnCallRotation.objects.get(id='rotation-id') >>> current = rotation.get_current_oncall() >>> print(f"Current on-call: {current.user.username if current else 'None'}") # Check business hours >>> business_hours = BusinessHours.objects.get(id='business-hours-id') >>> now = timezone.now() >>> print(f"Is business hours: {business_hours.is_business_hours(now)}") ```