Updates
This commit is contained in:
733
ETB-API/sla_oncall/Documentations/SLA_ONCALL_API.md
Normal file
733
ETB-API/sla_oncall/Documentations/SLA_ONCALL_API.md
Normal file
@@ -0,0 +1,733 @@
|
||||
# SLA & On-Call Management API Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The SLA & On-Call Management module provides comprehensive Service Level Agreement (SLA) tracking, escalation policies, and on-call rotation management for enterprise incident management systems.
|
||||
|
||||
## Features
|
||||
|
||||
### Dynamic SLAs
|
||||
- **Incident Type-Based SLAs**: Different SLA targets based on incident category, severity, and priority
|
||||
- **Business Hours Support**: SLA calculations that respect business hours and timezones
|
||||
- **Multiple SLA Types**: Response time, resolution time, acknowledgment time, and first response time
|
||||
- **Automatic SLA Instance Creation**: SLAs are automatically created when incidents are reported
|
||||
|
||||
### Escalation Policies
|
||||
- **Multi-Level Escalation**: Configurable escalation steps with different actions and timing
|
||||
- **Condition-Based Triggering**: Escalations triggered by SLA breaches, thresholds, or custom conditions
|
||||
- **Multi-Channel Notifications**: Email, SMS, Slack, Teams, and webhook notifications
|
||||
- **Integration with On-Call**: Automatic escalation to current on-call personnel
|
||||
|
||||
### On-Call Rotation Management
|
||||
- **Flexible Scheduling**: Weekly, daily, monthly, and custom rotation schedules
|
||||
- **External System Integration**: Built-in support for PagerDuty and OpsGenie
|
||||
- **Handoff Management**: Structured handoff processes with notes and tracking
|
||||
- **Performance Metrics**: Track incident handling and response times
|
||||
|
||||
### Business Hours Management
|
||||
- **Timezone Support**: Multiple timezone configurations
|
||||
- **Holiday Calendar**: Holiday and special day handling
|
||||
- **Day Overrides**: Custom hours for specific dates
|
||||
- **Weekend Configuration**: Separate weekend business hours
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Business Hours Management
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/business-hours/
|
||||
List all business hours configurations.
|
||||
|
||||
**Query Parameters:**
|
||||
- `is_active`: Filter by active status
|
||||
- `is_default`: Filter by default status
|
||||
- `timezone`: Filter by timezone
|
||||
- `search`: Search by name or description
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"count": 2,
|
||||
"next": null,
|
||||
"previous": null,
|
||||
"results": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"name": "Standard Business Hours",
|
||||
"description": "Standard 9-5 business hours",
|
||||
"timezone": "UTC",
|
||||
"weekday_start": "09:00:00",
|
||||
"weekday_end": "17:00:00",
|
||||
"weekend_start": "10:00:00",
|
||||
"weekend_end": "16:00:00",
|
||||
"day_overrides": {},
|
||||
"holiday_calendar": [],
|
||||
"is_active": true,
|
||||
"is_default": true,
|
||||
"created_at": "2024-01-01T00:00:00Z",
|
||||
"updated_at": "2024-01-01T00:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/business-hours/
|
||||
Create a new business hours configuration.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"name": "Custom Business Hours",
|
||||
"description": "Custom business hours for special team",
|
||||
"timezone": "America/New_York",
|
||||
"weekday_start": "08:00:00",
|
||||
"weekday_end": "18:00:00",
|
||||
"weekend_start": "10:00:00",
|
||||
"weekend_end": "16:00:00",
|
||||
"holiday_calendar": ["2024-12-25", "2024-01-01"],
|
||||
"is_active": true
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/business-hours/{id}/test_business_hours/
|
||||
Test if a given time is within business hours.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"test_time": "2024-01-08T14:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"is_business_hours": true,
|
||||
"test_time": "2024-01-08T14:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### SLA Definitions
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/sla-definitions/
|
||||
List all SLA definitions.
|
||||
|
||||
**Query Parameters:**
|
||||
- `sla_type`: Filter by SLA type (RESPONSE_TIME, RESOLUTION_TIME, etc.)
|
||||
- `is_active`: Filter by active status
|
||||
- `business_hours_only`: Filter by business hours requirement
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"count": 3,
|
||||
"results": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"name": "Critical Incident Response",
|
||||
"description": "SLA for critical incidents",
|
||||
"sla_type": "RESPONSE_TIME",
|
||||
"incident_categories": ["SYSTEM", "NETWORK"],
|
||||
"incident_severities": ["CRITICAL", "EMERGENCY"],
|
||||
"incident_priorities": ["P1"],
|
||||
"target_duration_minutes": 15,
|
||||
"business_hours_only": false,
|
||||
"business_hours": null,
|
||||
"business_hours_name": null,
|
||||
"escalation_enabled": true,
|
||||
"escalation_threshold_percent": 75.0,
|
||||
"is_active": true,
|
||||
"is_default": false,
|
||||
"created_at": "2024-01-01T00:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/sla-definitions/
|
||||
Create a new SLA definition.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"name": "High Priority Response",
|
||||
"description": "SLA for high priority incidents",
|
||||
"sla_type": "RESPONSE_TIME",
|
||||
"incident_severities": ["HIGH"],
|
||||
"incident_priorities": ["P2"],
|
||||
"target_duration_minutes": 30,
|
||||
"business_hours_only": false,
|
||||
"escalation_enabled": true,
|
||||
"escalation_threshold_percent": 80.0,
|
||||
"is_active": true
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/sla-definitions/{id}/test_applicability/
|
||||
Test if SLA definition applies to a given incident.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"category": "SYSTEM",
|
||||
"severity": "HIGH",
|
||||
"priority": "P2"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"applies": true,
|
||||
"incident_data": {
|
||||
"category": "SYSTEM",
|
||||
"severity": "HIGH",
|
||||
"priority": "P2"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### On-Call Rotations
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/oncall-rotations/
|
||||
List all on-call rotations.
|
||||
|
||||
**Query Parameters:**
|
||||
- `rotation_type`: Filter by rotation type (WEEKLY, DAILY, etc.)
|
||||
- `status`: Filter by status (ACTIVE, PAUSED, INACTIVE)
|
||||
- `external_system`: Filter by external system integration
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"count": 1,
|
||||
"results": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"name": "Primary On-Call Rotation",
|
||||
"description": "Primary rotation for incident response",
|
||||
"rotation_type": "WEEKLY",
|
||||
"status": "ACTIVE",
|
||||
"team_name": "Incident Response Team",
|
||||
"team_description": "Primary team responsible for incidents",
|
||||
"schedule_config": {
|
||||
"rotation_length_days": 7,
|
||||
"handoff_time": "09:00"
|
||||
},
|
||||
"timezone": "UTC",
|
||||
"external_system": "INTERNAL",
|
||||
"external_system_id": null,
|
||||
"integration_config": {},
|
||||
"current_oncall": {
|
||||
"user_id": "uuid",
|
||||
"username": "john.doe",
|
||||
"start_time": "2024-01-08T09:00:00Z",
|
||||
"end_time": "2024-01-15T09:00:00Z"
|
||||
},
|
||||
"created_at": "2024-01-01T00:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/oncall-rotations/{id}/current_oncall/
|
||||
Get the current on-call person for a rotation.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"id": "uuid",
|
||||
"rotation": "uuid",
|
||||
"rotation_name": "Primary On-Call Rotation",
|
||||
"user": "uuid",
|
||||
"user_name": "john.doe",
|
||||
"user_email": "john.doe@company.com",
|
||||
"start_time": "2024-01-08T09:00:00Z",
|
||||
"end_time": "2024-01-15T09:00:00Z",
|
||||
"status": "ACTIVE",
|
||||
"incidents_handled": 5,
|
||||
"response_time_avg": "00:15:30"
|
||||
}
|
||||
```
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/oncall-rotations/{id}/upcoming_assignments/
|
||||
Get upcoming on-call assignments.
|
||||
|
||||
**Query Parameters:**
|
||||
- `days`: Number of days ahead to look (default: 30)
|
||||
|
||||
### SLA Instances
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/sla-instances/
|
||||
List all SLA instances.
|
||||
|
||||
**Query Parameters:**
|
||||
- `status`: Filter by status (ACTIVE, MET, BREACHED, CANCELLED)
|
||||
- `escalation_triggered`: Filter by escalation status
|
||||
- `sla_definition`: Filter by SLA definition
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"count": 10,
|
||||
"results": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"sla_definition": "uuid",
|
||||
"sla_definition_name": "Critical Incident Response",
|
||||
"incident": "uuid",
|
||||
"incident_title": "Database Connection Failure",
|
||||
"status": "ACTIVE",
|
||||
"target_time": "2024-01-08T15:15:00Z",
|
||||
"started_at": "2024-01-08T15:00:00Z",
|
||||
"met_at": null,
|
||||
"breached_at": null,
|
||||
"escalation_policy": "uuid",
|
||||
"escalation_triggered": false,
|
||||
"escalation_triggered_at": null,
|
||||
"escalation_level": 0,
|
||||
"response_time": null,
|
||||
"resolution_time": null,
|
||||
"is_breached": false,
|
||||
"time_remaining": "00:12:30",
|
||||
"breach_time": "00:00:00",
|
||||
"created_at": "2024-01-08T15:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/sla-instances/breached/
|
||||
Get all breached SLA instances.
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/sla-instances/at_risk/
|
||||
Get SLA instances at risk of breaching (within 15 minutes).
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_met/
|
||||
Mark an SLA instance as met.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"message": "SLA marked as met"
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/sla-instances/{id}/mark_breached/
|
||||
Mark an SLA instance as breached.
|
||||
|
||||
### On-Call Assignments
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/oncall-assignments/
|
||||
List all on-call assignments.
|
||||
|
||||
**Query Parameters:**
|
||||
- `rotation`: Filter by rotation
|
||||
- `user`: Filter by user
|
||||
- `status`: Filter by status (SCHEDULED, ACTIVE, COMPLETED, CANCELLED)
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/oncall-assignments/
|
||||
Create a new on-call assignment.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"rotation": "uuid",
|
||||
"user": "uuid",
|
||||
"start_time": "2024-01-15T09:00:00Z",
|
||||
"end_time": "2024-01-22T09:00:00Z",
|
||||
"handoff_notes": "All systems stable, no pending incidents"
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/oncall-assignments/{id}/handoff/
|
||||
Perform on-call handoff.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"handoff_notes": "Handing off to next person. 3 active incidents."
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/oncall-assignments/{id}/activate/
|
||||
Activate a scheduled assignment.
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/oncall-assignments/{id}/complete/
|
||||
Complete an active assignment.
|
||||
|
||||
### Escalation Policies
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/escalation-policies/
|
||||
List all escalation policies.
|
||||
|
||||
**Query Parameters:**
|
||||
- `escalation_type`: Filter by escalation type
|
||||
- `trigger_condition`: Filter by trigger condition
|
||||
- `is_active`: Filter by active status
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/escalation-policies/
|
||||
Create a new escalation policy.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"name": "Critical Escalation",
|
||||
"description": "Escalation for critical incidents",
|
||||
"escalation_type": "TIME_BASED",
|
||||
"trigger_condition": "SLA_THRESHOLD",
|
||||
"incident_severities": ["CRITICAL", "EMERGENCY"],
|
||||
"trigger_delay_minutes": 0,
|
||||
"escalation_steps": [
|
||||
{
|
||||
"level": 1,
|
||||
"delay_minutes": 5,
|
||||
"actions": ["notify_oncall", "notify_manager"],
|
||||
"channels": ["email", "sms"]
|
||||
},
|
||||
{
|
||||
"level": 2,
|
||||
"delay_minutes": 15,
|
||||
"actions": ["notify_director", "page_oncall"],
|
||||
"channels": ["email", "sms", "phone"]
|
||||
}
|
||||
],
|
||||
"notification_channels": ["email", "sms", "phone"],
|
||||
"is_active": true
|
||||
}
|
||||
```
|
||||
|
||||
### Escalation Instances
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/escalation-instances/
|
||||
List all escalation instances.
|
||||
|
||||
**Query Parameters:**
|
||||
- `status`: Filter by status (PENDING, TRIGGERED, ACKNOWLEDGED, RESOLVED, CANCELLED)
|
||||
- `escalation_level`: Filter by escalation level
|
||||
- `escalation_policy`: Filter by escalation policy
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/escalation-instances/{id}/acknowledge/
|
||||
Acknowledge an escalation.
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/escalation-instances/{id}/resolve/
|
||||
Resolve an escalation.
|
||||
|
||||
### Notification Templates
|
||||
|
||||
#### GET /api/sla-oncall/api/v1/notification-templates/
|
||||
List all notification templates.
|
||||
|
||||
**Query Parameters:**
|
||||
- `template_type`: Filter by template type (ESCALATION, ONCALL_HANDOFF, etc.)
|
||||
- `channel_type`: Filter by channel type (EMAIL, SMS, SLACK, etc.)
|
||||
- `is_active`: Filter by active status
|
||||
|
||||
#### POST /api/sla-oncall/api/v1/notification-templates/
|
||||
Create a new notification template.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"name": "Email Escalation Alert",
|
||||
"template_type": "ESCALATION",
|
||||
"channel_type": "EMAIL",
|
||||
"subject_template": "URGENT: Incident #{incident_id} Escalated",
|
||||
"body_template": "Incident #{incident_id} has been escalated to Level {escalation_level}. Please respond immediately.",
|
||||
"variables": ["incident_id", "incident_title", "escalation_level"],
|
||||
"is_active": true,
|
||||
"is_default": true
|
||||
}
|
||||
```
|
||||
|
||||
## Setup and Configuration
|
||||
|
||||
### Initial Setup
|
||||
|
||||
Run the setup command to create default configurations:
|
||||
|
||||
```bash
|
||||
python manage.py setup_sla_oncall
|
||||
```
|
||||
|
||||
This command creates:
|
||||
- Default business hours configurations
|
||||
- Standard SLA definitions for different incident types
|
||||
- Default escalation policies
|
||||
- Notification templates
|
||||
- Sample on-call rotation (if users exist)
|
||||
|
||||
### Configuration Examples
|
||||
|
||||
#### Business Hours for Different Teams
|
||||
|
||||
```python
|
||||
# 24/7 Operations
|
||||
business_hours = BusinessHours.objects.create(
|
||||
name='24/7 Operations',
|
||||
description='Always business hours',
|
||||
timezone='UTC',
|
||||
weekday_start=time(0, 0),
|
||||
weekday_end=time(23, 59),
|
||||
weekend_start=time(0, 0),
|
||||
weekend_end=time(23, 59),
|
||||
)
|
||||
|
||||
# EMEA Business Hours
|
||||
business_hours = BusinessHours.objects.create(
|
||||
name='EMEA Business Hours',
|
||||
description='EMEA timezone business hours',
|
||||
timezone='Europe/London',
|
||||
weekday_start=time(9, 0),
|
||||
weekday_end=time(17, 0),
|
||||
weekend_start=time(10, 0),
|
||||
weekend_end=time(16, 0),
|
||||
holiday_calendar=['2024-12-25', '2024-01-01', '2024-04-19'],
|
||||
)
|
||||
```
|
||||
|
||||
#### SLA Definitions
|
||||
|
||||
```python
|
||||
# Critical incidents - 15 minute response
|
||||
critical_sla = SLADefinition.objects.create(
|
||||
name='Critical Incident Response',
|
||||
description='SLA for critical and emergency incidents',
|
||||
sla_type='RESPONSE_TIME',
|
||||
incident_severities=['CRITICAL', 'EMERGENCY'],
|
||||
incident_priorities=['P1'],
|
||||
target_duration_minutes=15,
|
||||
business_hours_only=False,
|
||||
escalation_enabled=True,
|
||||
escalation_threshold_percent=75.0,
|
||||
)
|
||||
|
||||
# Medium incidents - 2 hour response during business hours
|
||||
medium_sla = SLADefinition.objects.create(
|
||||
name='Medium Priority Response',
|
||||
description='SLA for medium priority incidents',
|
||||
sla_type='RESPONSE_TIME',
|
||||
incident_severities=['MEDIUM'],
|
||||
incident_priorities=['P3'],
|
||||
target_duration_minutes=120,
|
||||
business_hours_only=True,
|
||||
business_hours=business_hours,
|
||||
escalation_enabled=True,
|
||||
escalation_threshold_percent=85.0,
|
||||
)
|
||||
```
|
||||
|
||||
#### Escalation Policies
|
||||
|
||||
```python
|
||||
# Critical escalation policy
|
||||
escalation_policy = EscalationPolicy.objects.create(
|
||||
name='Critical Incident Escalation',
|
||||
description='Escalation for critical incidents',
|
||||
escalation_type='TIME_BASED',
|
||||
trigger_condition='SLA_THRESHOLD',
|
||||
incident_severities=['CRITICAL', 'EMERGENCY'],
|
||||
trigger_delay_minutes=0,
|
||||
escalation_steps=[
|
||||
{
|
||||
'level': 1,
|
||||
'delay_minutes': 5,
|
||||
'actions': ['notify_oncall', 'notify_manager'],
|
||||
'channels': ['email', 'sms']
|
||||
},
|
||||
{
|
||||
'level': 2,
|
||||
'delay_minutes': 15,
|
||||
'actions': ['notify_director', 'page_oncall'],
|
||||
'channels': ['email', 'sms', 'phone']
|
||||
},
|
||||
{
|
||||
'level': 3,
|
||||
'delay_minutes': 30,
|
||||
'actions': ['notify_executive', 'escalate_to_vendor'],
|
||||
'channels': ['email', 'phone', 'webhook']
|
||||
}
|
||||
],
|
||||
notification_channels=['email', 'sms', 'phone'],
|
||||
)
|
||||
```
|
||||
|
||||
#### On-Call Rotations
|
||||
|
||||
```python
|
||||
# Weekly rotation
|
||||
rotation = OnCallRotation.objects.create(
|
||||
name='Primary On-Call Rotation',
|
||||
description='Primary rotation for incident response',
|
||||
rotation_type='WEEKLY',
|
||||
team_name='Incident Response Team',
|
||||
schedule_config={
|
||||
'rotation_length_days': 7,
|
||||
'handoff_time': '09:00',
|
||||
'timezone': 'UTC'
|
||||
},
|
||||
timezone='UTC',
|
||||
)
|
||||
|
||||
# Create assignments
|
||||
assignment = OnCallAssignment.objects.create(
|
||||
rotation=rotation,
|
||||
user=user,
|
||||
start_time=timezone.now(),
|
||||
end_time=timezone.now() + timedelta(days=7),
|
||||
status='ACTIVE'
|
||||
)
|
||||
```
|
||||
|
||||
## Integration with Other Modules
|
||||
|
||||
### Incident Intelligence Integration
|
||||
|
||||
The SLA module automatically creates SLA instances when incidents are created:
|
||||
|
||||
```python
|
||||
# When an incident is created, applicable SLA definitions are found
|
||||
# and SLA instances are automatically created
|
||||
incident = Incident.objects.create(
|
||||
title='Database Connection Failure',
|
||||
description='Unable to connect to primary database',
|
||||
severity='CRITICAL',
|
||||
category='DATABASE',
|
||||
reporter=user,
|
||||
)
|
||||
|
||||
# This automatically triggers SLA instance creation via signals
|
||||
```
|
||||
|
||||
### Automation Orchestration Integration
|
||||
|
||||
SLA breaches can trigger automation workflows:
|
||||
|
||||
```python
|
||||
# In automation_orchestration models, you can reference SLA instances
|
||||
class RunbookExecution(models.Model):
|
||||
# ... existing fields ...
|
||||
sla_instance = models.ForeignKey(
|
||||
'sla_oncall.SLAInstance',
|
||||
on_delete=models.SET_NULL,
|
||||
null=True,
|
||||
blank=True,
|
||||
related_name='runbook_executions'
|
||||
)
|
||||
```
|
||||
|
||||
### Security Integration
|
||||
|
||||
On-call assignments respect security clearances:
|
||||
|
||||
```python
|
||||
# Users with appropriate clearance levels can be assigned to sensitive incidents
|
||||
if user.clearance_level.level >= incident.get_required_clearance_level():
|
||||
# User can be assigned to this incident
|
||||
assignment = OnCallAssignment.objects.create(...)
|
||||
```
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### SLA Breach Monitoring
|
||||
|
||||
Monitor SLA instances for breaches:
|
||||
|
||||
```python
|
||||
# Get all breached SLAs
|
||||
breached_slas = SLAInstance.objects.filter(status='BREACHED')
|
||||
|
||||
# Get SLAs at risk (within 15 minutes of breach)
|
||||
warning_time = timezone.now() + timedelta(minutes=15)
|
||||
at_risk_slas = SLAInstance.objects.filter(
|
||||
status='ACTIVE',
|
||||
target_time__lte=warning_time
|
||||
)
|
||||
```
|
||||
|
||||
### Escalation Monitoring
|
||||
|
||||
Monitor active escalations:
|
||||
|
||||
```python
|
||||
# Get all active escalations
|
||||
active_escalations = EscalationInstance.objects.filter(
|
||||
status__in=['PENDING', 'TRIGGERED']
|
||||
)
|
||||
|
||||
# Get escalations by level
|
||||
level_2_escalations = EscalationInstance.objects.filter(
|
||||
escalation_level=2,
|
||||
status='TRIGGERED'
|
||||
)
|
||||
```
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
Track on-call performance:
|
||||
|
||||
```python
|
||||
# Get current on-call assignments
|
||||
current_assignments = OnCallAssignment.objects.filter(
|
||||
status='ACTIVE',
|
||||
start_time__lte=timezone.now(),
|
||||
end_time__gte=timezone.now()
|
||||
)
|
||||
|
||||
# Calculate average response time
|
||||
avg_response_time = current_assignments.aggregate(
|
||||
avg_response=Avg('response_time_avg')
|
||||
)['avg_response']
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### SLA Definition Best Practices
|
||||
|
||||
1. **Start Simple**: Begin with basic SLAs and add complexity as needed
|
||||
2. **Business Hours Consideration**: Use business hours for non-critical incidents
|
||||
3. **Escalation Thresholds**: Set escalation thresholds at 75-85% of SLA time
|
||||
4. **Regular Review**: Review and adjust SLAs based on performance data
|
||||
|
||||
### On-Call Management Best Practices
|
||||
|
||||
1. **Clear Handoffs**: Use structured handoff processes with notes
|
||||
2. **Rotation Length**: Keep rotations between 1-2 weeks for optimal coverage
|
||||
3. **Backup Coverage**: Always have backup on-call personnel
|
||||
4. **Training**: Ensure on-call personnel are properly trained
|
||||
|
||||
### Escalation Best Practices
|
||||
|
||||
1. **Progressive Escalation**: Use multiple levels with increasing urgency
|
||||
2. **Clear Actions**: Define specific actions for each escalation level
|
||||
3. **Multiple Channels**: Use multiple notification channels for critical escalations
|
||||
4. **Documentation**: Document all escalation actions and outcomes
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **SLA Not Created**: Check if SLA definition criteria match incident attributes
|
||||
2. **Escalation Not Triggered**: Verify escalation policy is active and criteria match
|
||||
3. **On-Call Not Found**: Ensure active assignments exist for the rotation
|
||||
4. **Business Hours Issues**: Verify timezone configuration and business hours setup
|
||||
|
||||
### Debugging Commands
|
||||
|
||||
```bash
|
||||
# Check SLA instances for a specific incident
|
||||
python manage.py shell
|
||||
>>> incident = Incident.objects.get(id='incident-id')
|
||||
>>> sla_instances = incident.sla_instances.all()
|
||||
>>> for sla in sla_instances:
|
||||
... print(f"SLA: {sla.sla_definition.name}, Status: {sla.status}")
|
||||
|
||||
# Check current on-call for a rotation
|
||||
>>> rotation = OnCallRotation.objects.get(id='rotation-id')
|
||||
>>> current = rotation.get_current_oncall()
|
||||
>>> print(f"Current on-call: {current.user.username if current else 'None'}")
|
||||
|
||||
# Check business hours
|
||||
>>> business_hours = BusinessHours.objects.get(id='business-hours-id')
|
||||
>>> now = timezone.now()
|
||||
>>> print(f"Is business hours: {business_hours.is_business_hours(now)}")
|
||||
```
|
||||
Reference in New Issue
Block a user