15 KiB
15 KiB
Automation & Orchestration API Documentation
Overview
The Automation & Orchestration module provides comprehensive automation capabilities for incident management, including runbooks, integrations with external systems, ChatOps functionality, auto-remediation, and maintenance window management.
Features
1. Runbooks Automation
- Predefined Response Steps: Create and manage automated response procedures
- Multiple Trigger Types: Manual, automatic, scheduled, webhook, and ChatOps triggers
- Execution Tracking: Monitor runbook execution status and performance
- Version Control: Track runbook versions and changes
2. External System Integrations
- ITSM Tools: Jira, ServiceNow integration
- CI/CD Tools: GitHub, Jenkins, Ansible, Terraform
- Chat Platforms: Slack, Microsoft Teams, Discord, Mattermost
- Generic APIs: Webhook and API integrations
- Health Monitoring: Integration health checks and status tracking
3. ChatOps Integration
- Command Execution: Trigger workflows from chat platforms
- Security Controls: User and channel-based access control
- Command History: Track and audit ChatOps commands
- Multi-Platform Support: Slack, Teams, Discord, Mattermost
4. Auto-Remediation
- Automatic Response: Trigger remediation actions based on incident conditions
- Safety Controls: Approval workflows and execution limits
- Multiple Remediation Types: Service restart, deployment rollback, scaling, etc.
- Execution Tracking: Monitor remediation success rates and performance
5. Maintenance Windows
- Scheduled Suppression: Suppress alerts during planned maintenance
- Service-Specific: Target specific services and components
- Flexible Configuration: Control incident creation, notifications, and escalations
- Status Management: Automatic status updates based on schedule
6. Workflow Templates
- Reusable Workflows: Create templates for common automation scenarios
- Parameterized Execution: Support for input parameters and output schemas
- Template Types: Incident response, deployment, maintenance, scaling, monitoring
- Usage Tracking: Monitor template usage and performance
API Endpoints
Runbooks
List Runbooks
GET /api/automation/runbooks/
Query Parameters:
status: Filter by status (DRAFT, ACTIVE, INACTIVE, DEPRECATED)trigger_type: Filter by trigger type (MANUAL, AUTOMATIC, SCHEDULED, WEBHOOK, CHATOPS)category: Filter by categoryis_public: Filter by public/private statussearch: Search in name, description, category
Response:
{
"count": 10,
"next": null,
"previous": null,
"results": [
{
"id": "uuid",
"name": "Database Service Restart",
"description": "Automated runbook for restarting database services",
"version": "1.0",
"trigger_type": "AUTOMATIC",
"trigger_conditions": {
"severity": ["CRITICAL", "EMERGENCY"],
"category": "database"
},
"steps": [...],
"estimated_duration": "00:05:00",
"category": "database",
"tags": ["database", "restart", "automation"],
"status": "ACTIVE",
"is_public": true,
"execution_count": 5,
"success_rate": 0.8,
"can_trigger": true,
"created_at": "2024-01-15T10:00:00Z",
"updated_at": "2024-01-15T10:00:00Z"
}
]
}
Create Runbook
POST /api/automation/runbooks/
Request Body:
{
"name": "New Runbook",
"description": "Description of the runbook",
"version": "1.0",
"trigger_type": "MANUAL",
"trigger_conditions": {
"severity": ["HIGH", "CRITICAL"]
},
"steps": [
{
"name": "Step 1",
"action": "check_status",
"timeout": 30,
"parameters": {"service": "web"}
}
],
"estimated_duration": "00:05:00",
"category": "web",
"tags": ["web", "restart"],
"status": "DRAFT",
"is_public": true
}
Execute Runbook
POST /api/automation/runbooks/{id}/execute/
Request Body:
{
"trigger_data": {
"incident_id": "uuid",
"context": "additional context"
}
}
Integrations
List Integrations
GET /api/automation/integrations/
Query Parameters:
integration_type: Filter by type (JIRA, GITHUB, JENKINS, etc.)status: Filter by status (ACTIVE, INACTIVE, ERROR, CONFIGURING)health_status: Filter by health status (HEALTHY, WARNING, ERROR, UNKNOWN)
Test Integration Connection
POST /api/automation/integrations/{id}/test_connection/
Perform Health Check
POST /api/automation/integrations/{id}/health_check/
ChatOps
List ChatOps Integrations
GET /api/automation/chatops-integrations/
List ChatOps Commands
GET /api/automation/chatops-commands/
Query Parameters:
status: Filter by execution statuschatops_integration: Filter by integrationcommand: Filter by command nameuser_id: Filter by user IDchannel_id: Filter by channel ID
Auto-Remediation
List Auto-Remediations
GET /api/automation/auto-remediations/
Query Parameters:
remediation_type: Filter by type (SERVICE_RESTART, DEPLOYMENT_ROLLBACK, etc.)trigger_condition_type: Filter by trigger condition typeis_active: Filter by active statusrequires_approval: Filter by approval requirement
Approve Auto-Remediation Execution
POST /api/automation/auto-remediation-executions/{id}/approve/
Request Body:
{
"approval_notes": "Approved for execution"
}
Reject Auto-Remediation Execution
POST /api/automation/auto-remediation-executions/{id}/reject/
Request Body:
{
"rejection_notes": "Rejected due to risk concerns"
}
Maintenance Windows
List Maintenance Windows
GET /api/automation/maintenance-windows/
Get Active Maintenance Windows
GET /api/automation/maintenance-windows/active/
Get Upcoming Maintenance Windows
GET /api/automation/maintenance-windows/upcoming/
Workflow Templates
List Workflow Templates
GET /api/automation/workflow-templates/
Query Parameters:
template_type: Filter by type (INCIDENT_RESPONSE, DEPLOYMENT, etc.)is_public: Filter by public/private status
Data Models
Runbook
- id: UUID primary key
- name: Unique name for the runbook
- description: Detailed description
- version: Version string
- trigger_type: How the runbook is triggered
- trigger_conditions: JSON conditions for triggering
- steps: JSON array of execution steps
- estimated_duration: Expected execution time
- category: Categorization
- tags: JSON array of tags
- status: Current status
- is_public: Public/private visibility
- execution_count: Number of executions
- success_rate: Success rate (0.0-1.0)
Integration
- id: UUID primary key
- name: Unique name for the integration
- integration_type: Type of integration (JIRA, GITHUB, etc.)
- description: Description
- configuration: JSON configuration data
- authentication_config: JSON authentication data
- status: Integration status
- health_status: Health status
- request_count: Number of requests made
- last_used_at: Last usage timestamp
ChatOpsIntegration
- id: UUID primary key
- name: Unique name
- platform: Chat platform (SLACK, TEAMS, etc.)
- webhook_url: Webhook URL
- bot_token: Bot authentication token
- channel_id: Default channel ID
- command_prefix: Command prefix character
- available_commands: JSON array of available commands
- allowed_users: JSON array of allowed user IDs
- allowed_channels: JSON array of allowed channel IDs
- is_active: Active status
AutoRemediation
- id: UUID primary key
- name: Unique name
- description: Description
- remediation_type: Type of remediation action
- trigger_conditions: JSON trigger conditions
- trigger_condition_type: Type of trigger condition
- remediation_config: JSON remediation configuration
- timeout_seconds: Execution timeout
- requires_approval: Whether approval is required
- approval_users: Many-to-many relationship with users
- max_executions_per_incident: Maximum executions per incident
- is_active: Active status
- execution_count: Number of executions
- success_count: Number of successful executions
MaintenanceWindow
- id: UUID primary key
- name: Name of the maintenance window
- description: Description
- start_time: Start datetime
- end_time: End datetime
- timezone: Timezone
- affected_services: JSON array of affected services
- affected_components: JSON array of affected components
- suppress_incident_creation: Whether to suppress incident creation
- suppress_notifications: Whether to suppress notifications
- suppress_escalations: Whether to suppress escalations
- status: Current status
- incidents_suppressed: Count of suppressed incidents
- notifications_suppressed: Count of suppressed notifications
WorkflowTemplate
- id: UUID primary key
- name: Unique name
- description: Description
- template_type: Type of workflow template
- workflow_steps: JSON array of workflow steps
- input_parameters: JSON array of input parameters
- output_schema: JSON output schema
- usage_count: Number of times used
- is_public: Public/private visibility
Security Features
Access Control
- User Permissions: Role-based access control
- Data Classification: Integration with security module
- Audit Logging: Comprehensive audit trails
- API Authentication: Token and session authentication
ChatOps Security
- User Whitelisting: Restrict commands to specific users
- Channel Restrictions: Limit commands to specific channels
- Command Validation: Validate command parameters
- Execution Logging: Log all command executions
Auto-Remediation Safety
- Approval Workflows: Require manual approval for sensitive actions
- Execution Limits: Limit executions per incident
- Timeout Controls: Prevent runaway executions
- Rollback Capabilities: Support for rollback operations
Integration with Other Modules
Incident Intelligence Integration
- Automatic Triggering: Trigger runbooks based on incident characteristics
- AI Suggestions: AI-driven runbook recommendations
- Correlation: Link automation actions to incident patterns
- Maintenance Suppression: Suppress incidents during maintenance windows
Security Module Integration
- Access Control: Use security module for authentication and authorization
- Data Classification: Apply data classification to automation data
- Audit Integration: Integrate with security audit trails
- MFA Support: Support multi-factor authentication for sensitive operations
Best Practices
Runbook Design
- Clear Steps: Define clear, atomic steps
- Error Handling: Include error handling and rollback procedures
- Timeout Management: Set appropriate timeouts for each step
- Documentation: Provide clear documentation for each step
- Testing: Test runbooks in non-production environments
Integration Management
- Health Monitoring: Regularly monitor integration health
- Credential Management: Securely store and rotate credentials
- Rate Limiting: Implement appropriate rate limiting
- Error Handling: Handle integration failures gracefully
- Monitoring: Monitor integration usage and performance
Auto-Remediation
- Conservative Approach: Start with low-risk remediations
- Approval Workflows: Use approval workflows for high-risk actions
- Monitoring: Monitor remediation success rates
- Documentation: Document all remediation actions
- Testing: Test remediations in controlled environments
Maintenance Windows
- Communication: Communicate maintenance windows to stakeholders
- Scope Definition: Clearly define affected services and components
- Rollback Plans: Have rollback plans for maintenance activities
- Monitoring: Monitor system health during maintenance
- Documentation: Document maintenance activities and outcomes
Error Handling
Common Error Scenarios
- Integration Failures: Handle external system unavailability
- Authentication Errors: Handle credential expiration
- Timeout Errors: Handle execution timeouts
- Permission Errors: Handle insufficient permissions
- Data Validation Errors: Handle invalid input data
Error Response Format
{
"error": "Error message",
"code": "ERROR_CODE",
"details": {
"field": "specific field error"
},
"timestamp": "2024-01-15T10:00:00Z"
}
Rate Limiting
Default Limits
- API Requests: 1000 requests per hour per user
- Runbook Executions: 10 executions per hour per user
- Integration Calls: 100 calls per hour per integration
- ChatOps Commands: 50 commands per hour per user
Custom Limits
- Configure custom rate limits per user role
- Set different limits for different integration types
- Implement burst allowances for emergency situations
Monitoring and Alerting
Key Metrics
- Runbook Success Rate: Track runbook execution success
- Integration Health: Monitor integration availability
- Auto-Remediation Effectiveness: Track remediation success
- ChatOps Usage: Monitor ChatOps command usage
- Maintenance Window Impact: Track maintenance window effectiveness
Alerting
- Integration Failures: Alert on integration health issues
- Runbook Failures: Alert on runbook execution failures
- Auto-Remediation Issues: Alert on remediation failures
- Rate Limit Exceeded: Alert on rate limit violations
- Security Issues: Alert on security-related events
Troubleshooting
Common Issues
- Runbook Execution Failures: Check step configurations and permissions
- Integration Connection Issues: Verify credentials and network connectivity
- ChatOps Command Failures: Check user permissions and command syntax
- Auto-Remediation Not Triggering: Verify trigger conditions and permissions
- Maintenance Window Not Working: Check timezone and schedule configuration
Debug Information
- Enable debug logging for detailed execution information
- Use execution logs to trace runbook and workflow execution
- Check integration health status and error messages
- Review audit logs for security and access issues
- Monitor system metrics for performance issues
Future Enhancements
Planned Features
- Visual Workflow Builder: Drag-and-drop workflow creation
- Advanced AI Integration: Enhanced AI-driven automation suggestions
- Multi-Cloud Support: Support for multiple cloud providers
- Advanced Analytics: Enhanced reporting and analytics capabilities
- Mobile Support: Mobile app for automation management
Integration Roadmap
- Additional ITSM Tools: ServiceNow, Remedy, etc.
- Cloud Platforms: AWS, Azure, GCP integrations
- Monitoring Tools: Prometheus, Grafana, DataDog
- Communication Platforms: Additional chat platforms
- Development Tools: GitLab, Bitbucket, CircleCI