This commit is contained in:
Iliyan Angelov
2025-09-19 11:58:53 +03:00
parent 306b20e24a
commit 6b247e5b9f
11423 changed files with 1500615 additions and 778 deletions

View File

@@ -0,0 +1,477 @@
# Automation & Orchestration API Documentation
## Overview
The Automation & Orchestration module provides comprehensive automation capabilities for incident management, including runbooks, integrations with external systems, ChatOps functionality, auto-remediation, and maintenance window management.
## Features
### 1. Runbooks Automation
- **Predefined Response Steps**: Create and manage automated response procedures
- **Multiple Trigger Types**: Manual, automatic, scheduled, webhook, and ChatOps triggers
- **Execution Tracking**: Monitor runbook execution status and performance
- **Version Control**: Track runbook versions and changes
### 2. External System Integrations
- **ITSM Tools**: Jira, ServiceNow integration
- **CI/CD Tools**: GitHub, Jenkins, Ansible, Terraform
- **Chat Platforms**: Slack, Microsoft Teams, Discord, Mattermost
- **Generic APIs**: Webhook and API integrations
- **Health Monitoring**: Integration health checks and status tracking
### 3. ChatOps Integration
- **Command Execution**: Trigger workflows from chat platforms
- **Security Controls**: User and channel-based access control
- **Command History**: Track and audit ChatOps commands
- **Multi-Platform Support**: Slack, Teams, Discord, Mattermost
### 4. Auto-Remediation
- **Automatic Response**: Trigger remediation actions based on incident conditions
- **Safety Controls**: Approval workflows and execution limits
- **Multiple Remediation Types**: Service restart, deployment rollback, scaling, etc.
- **Execution Tracking**: Monitor remediation success rates and performance
### 5. Maintenance Windows
- **Scheduled Suppression**: Suppress alerts during planned maintenance
- **Service-Specific**: Target specific services and components
- **Flexible Configuration**: Control incident creation, notifications, and escalations
- **Status Management**: Automatic status updates based on schedule
### 6. Workflow Templates
- **Reusable Workflows**: Create templates for common automation scenarios
- **Parameterized Execution**: Support for input parameters and output schemas
- **Template Types**: Incident response, deployment, maintenance, scaling, monitoring
- **Usage Tracking**: Monitor template usage and performance
## API Endpoints
### Runbooks
#### List Runbooks
```
GET /api/automation/runbooks/
```
**Query Parameters:**
- `status`: Filter by status (DRAFT, ACTIVE, INACTIVE, DEPRECATED)
- `trigger_type`: Filter by trigger type (MANUAL, AUTOMATIC, SCHEDULED, WEBHOOK, CHATOPS)
- `category`: Filter by category
- `is_public`: Filter by public/private status
- `search`: Search in name, description, category
**Response:**
```json
{
"count": 10,
"next": null,
"previous": null,
"results": [
{
"id": "uuid",
"name": "Database Service Restart",
"description": "Automated runbook for restarting database services",
"version": "1.0",
"trigger_type": "AUTOMATIC",
"trigger_conditions": {
"severity": ["CRITICAL", "EMERGENCY"],
"category": "database"
},
"steps": [...],
"estimated_duration": "00:05:00",
"category": "database",
"tags": ["database", "restart", "automation"],
"status": "ACTIVE",
"is_public": true,
"execution_count": 5,
"success_rate": 0.8,
"can_trigger": true,
"created_at": "2024-01-15T10:00:00Z",
"updated_at": "2024-01-15T10:00:00Z"
}
]
}
```
#### Create Runbook
```
POST /api/automation/runbooks/
```
**Request Body:**
```json
{
"name": "New Runbook",
"description": "Description of the runbook",
"version": "1.0",
"trigger_type": "MANUAL",
"trigger_conditions": {
"severity": ["HIGH", "CRITICAL"]
},
"steps": [
{
"name": "Step 1",
"action": "check_status",
"timeout": 30,
"parameters": {"service": "web"}
}
],
"estimated_duration": "00:05:00",
"category": "web",
"tags": ["web", "restart"],
"status": "DRAFT",
"is_public": true
}
```
#### Execute Runbook
```
POST /api/automation/runbooks/{id}/execute/
```
**Request Body:**
```json
{
"trigger_data": {
"incident_id": "uuid",
"context": "additional context"
}
}
```
### Integrations
#### List Integrations
```
GET /api/automation/integrations/
```
**Query Parameters:**
- `integration_type`: Filter by type (JIRA, GITHUB, JENKINS, etc.)
- `status`: Filter by status (ACTIVE, INACTIVE, ERROR, CONFIGURING)
- `health_status`: Filter by health status (HEALTHY, WARNING, ERROR, UNKNOWN)
#### Test Integration Connection
```
POST /api/automation/integrations/{id}/test_connection/
```
#### Perform Health Check
```
POST /api/automation/integrations/{id}/health_check/
```
### ChatOps
#### List ChatOps Integrations
```
GET /api/automation/chatops-integrations/
```
#### List ChatOps Commands
```
GET /api/automation/chatops-commands/
```
**Query Parameters:**
- `status`: Filter by execution status
- `chatops_integration`: Filter by integration
- `command`: Filter by command name
- `user_id`: Filter by user ID
- `channel_id`: Filter by channel ID
### Auto-Remediation
#### List Auto-Remediations
```
GET /api/automation/auto-remediations/
```
**Query Parameters:**
- `remediation_type`: Filter by type (SERVICE_RESTART, DEPLOYMENT_ROLLBACK, etc.)
- `trigger_condition_type`: Filter by trigger condition type
- `is_active`: Filter by active status
- `requires_approval`: Filter by approval requirement
#### Approve Auto-Remediation Execution
```
POST /api/automation/auto-remediation-executions/{id}/approve/
```
**Request Body:**
```json
{
"approval_notes": "Approved for execution"
}
```
#### Reject Auto-Remediation Execution
```
POST /api/automation/auto-remediation-executions/{id}/reject/
```
**Request Body:**
```json
{
"rejection_notes": "Rejected due to risk concerns"
}
```
### Maintenance Windows
#### List Maintenance Windows
```
GET /api/automation/maintenance-windows/
```
#### Get Active Maintenance Windows
```
GET /api/automation/maintenance-windows/active/
```
#### Get Upcoming Maintenance Windows
```
GET /api/automation/maintenance-windows/upcoming/
```
### Workflow Templates
#### List Workflow Templates
```
GET /api/automation/workflow-templates/
```
**Query Parameters:**
- `template_type`: Filter by type (INCIDENT_RESPONSE, DEPLOYMENT, etc.)
- `is_public`: Filter by public/private status
## Data Models
### Runbook
- **id**: UUID primary key
- **name**: Unique name for the runbook
- **description**: Detailed description
- **version**: Version string
- **trigger_type**: How the runbook is triggered
- **trigger_conditions**: JSON conditions for triggering
- **steps**: JSON array of execution steps
- **estimated_duration**: Expected execution time
- **category**: Categorization
- **tags**: JSON array of tags
- **status**: Current status
- **is_public**: Public/private visibility
- **execution_count**: Number of executions
- **success_rate**: Success rate (0.0-1.0)
### Integration
- **id**: UUID primary key
- **name**: Unique name for the integration
- **integration_type**: Type of integration (JIRA, GITHUB, etc.)
- **description**: Description
- **configuration**: JSON configuration data
- **authentication_config**: JSON authentication data
- **status**: Integration status
- **health_status**: Health status
- **request_count**: Number of requests made
- **last_used_at**: Last usage timestamp
### ChatOpsIntegration
- **id**: UUID primary key
- **name**: Unique name
- **platform**: Chat platform (SLACK, TEAMS, etc.)
- **webhook_url**: Webhook URL
- **bot_token**: Bot authentication token
- **channel_id**: Default channel ID
- **command_prefix**: Command prefix character
- **available_commands**: JSON array of available commands
- **allowed_users**: JSON array of allowed user IDs
- **allowed_channels**: JSON array of allowed channel IDs
- **is_active**: Active status
### AutoRemediation
- **id**: UUID primary key
- **name**: Unique name
- **description**: Description
- **remediation_type**: Type of remediation action
- **trigger_conditions**: JSON trigger conditions
- **trigger_condition_type**: Type of trigger condition
- **remediation_config**: JSON remediation configuration
- **timeout_seconds**: Execution timeout
- **requires_approval**: Whether approval is required
- **approval_users**: Many-to-many relationship with users
- **max_executions_per_incident**: Maximum executions per incident
- **is_active**: Active status
- **execution_count**: Number of executions
- **success_count**: Number of successful executions
### MaintenanceWindow
- **id**: UUID primary key
- **name**: Name of the maintenance window
- **description**: Description
- **start_time**: Start datetime
- **end_time**: End datetime
- **timezone**: Timezone
- **affected_services**: JSON array of affected services
- **affected_components**: JSON array of affected components
- **suppress_incident_creation**: Whether to suppress incident creation
- **suppress_notifications**: Whether to suppress notifications
- **suppress_escalations**: Whether to suppress escalations
- **status**: Current status
- **incidents_suppressed**: Count of suppressed incidents
- **notifications_suppressed**: Count of suppressed notifications
### WorkflowTemplate
- **id**: UUID primary key
- **name**: Unique name
- **description**: Description
- **template_type**: Type of workflow template
- **workflow_steps**: JSON array of workflow steps
- **input_parameters**: JSON array of input parameters
- **output_schema**: JSON output schema
- **usage_count**: Number of times used
- **is_public**: Public/private visibility
## Security Features
### Access Control
- **User Permissions**: Role-based access control
- **Data Classification**: Integration with security module
- **Audit Logging**: Comprehensive audit trails
- **API Authentication**: Token and session authentication
### ChatOps Security
- **User Whitelisting**: Restrict commands to specific users
- **Channel Restrictions**: Limit commands to specific channels
- **Command Validation**: Validate command parameters
- **Execution Logging**: Log all command executions
### Auto-Remediation Safety
- **Approval Workflows**: Require manual approval for sensitive actions
- **Execution Limits**: Limit executions per incident
- **Timeout Controls**: Prevent runaway executions
- **Rollback Capabilities**: Support for rollback operations
## Integration with Other Modules
### Incident Intelligence Integration
- **Automatic Triggering**: Trigger runbooks based on incident characteristics
- **AI Suggestions**: AI-driven runbook recommendations
- **Correlation**: Link automation actions to incident patterns
- **Maintenance Suppression**: Suppress incidents during maintenance windows
### Security Module Integration
- **Access Control**: Use security module for authentication and authorization
- **Data Classification**: Apply data classification to automation data
- **Audit Integration**: Integrate with security audit trails
- **MFA Support**: Support multi-factor authentication for sensitive operations
## Best Practices
### Runbook Design
1. **Clear Steps**: Define clear, atomic steps
2. **Error Handling**: Include error handling and rollback procedures
3. **Timeout Management**: Set appropriate timeouts for each step
4. **Documentation**: Provide clear documentation for each step
5. **Testing**: Test runbooks in non-production environments
### Integration Management
1. **Health Monitoring**: Regularly monitor integration health
2. **Credential Management**: Securely store and rotate credentials
3. **Rate Limiting**: Implement appropriate rate limiting
4. **Error Handling**: Handle integration failures gracefully
5. **Monitoring**: Monitor integration usage and performance
### Auto-Remediation
1. **Conservative Approach**: Start with low-risk remediations
2. **Approval Workflows**: Use approval workflows for high-risk actions
3. **Monitoring**: Monitor remediation success rates
4. **Documentation**: Document all remediation actions
5. **Testing**: Test remediations in controlled environments
### Maintenance Windows
1. **Communication**: Communicate maintenance windows to stakeholders
2. **Scope Definition**: Clearly define affected services and components
3. **Rollback Plans**: Have rollback plans for maintenance activities
4. **Monitoring**: Monitor system health during maintenance
5. **Documentation**: Document maintenance activities and outcomes
## Error Handling
### Common Error Scenarios
1. **Integration Failures**: Handle external system unavailability
2. **Authentication Errors**: Handle credential expiration
3. **Timeout Errors**: Handle execution timeouts
4. **Permission Errors**: Handle insufficient permissions
5. **Data Validation Errors**: Handle invalid input data
### Error Response Format
```json
{
"error": "Error message",
"code": "ERROR_CODE",
"details": {
"field": "specific field error"
},
"timestamp": "2024-01-15T10:00:00Z"
}
```
## Rate Limiting
### Default Limits
- **API Requests**: 1000 requests per hour per user
- **Runbook Executions**: 10 executions per hour per user
- **Integration Calls**: 100 calls per hour per integration
- **ChatOps Commands**: 50 commands per hour per user
### Custom Limits
- Configure custom rate limits per user role
- Set different limits for different integration types
- Implement burst allowances for emergency situations
## Monitoring and Alerting
### Key Metrics
- **Runbook Success Rate**: Track runbook execution success
- **Integration Health**: Monitor integration availability
- **Auto-Remediation Effectiveness**: Track remediation success
- **ChatOps Usage**: Monitor ChatOps command usage
- **Maintenance Window Impact**: Track maintenance window effectiveness
### Alerting
- **Integration Failures**: Alert on integration health issues
- **Runbook Failures**: Alert on runbook execution failures
- **Auto-Remediation Issues**: Alert on remediation failures
- **Rate Limit Exceeded**: Alert on rate limit violations
- **Security Issues**: Alert on security-related events
## Troubleshooting
### Common Issues
1. **Runbook Execution Failures**: Check step configurations and permissions
2. **Integration Connection Issues**: Verify credentials and network connectivity
3. **ChatOps Command Failures**: Check user permissions and command syntax
4. **Auto-Remediation Not Triggering**: Verify trigger conditions and permissions
5. **Maintenance Window Not Working**: Check timezone and schedule configuration
### Debug Information
- Enable debug logging for detailed execution information
- Use execution logs to trace runbook and workflow execution
- Check integration health status and error messages
- Review audit logs for security and access issues
- Monitor system metrics for performance issues
## Future Enhancements
### Planned Features
1. **Visual Workflow Builder**: Drag-and-drop workflow creation
2. **Advanced AI Integration**: Enhanced AI-driven automation suggestions
3. **Multi-Cloud Support**: Support for multiple cloud providers
4. **Advanced Analytics**: Enhanced reporting and analytics capabilities
5. **Mobile Support**: Mobile app for automation management
### Integration Roadmap
1. **Additional ITSM Tools**: ServiceNow, Remedy, etc.
2. **Cloud Platforms**: AWS, Azure, GCP integrations
3. **Monitoring Tools**: Prometheus, Grafana, DataDog
4. **Communication Platforms**: Additional chat platforms
5. **Development Tools**: GitLab, Bitbucket, CircleCI