Updates
This commit is contained in:
@@ -0,0 +1,477 @@
|
||||
# Automation & Orchestration API Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The Automation & Orchestration module provides comprehensive automation capabilities for incident management, including runbooks, integrations with external systems, ChatOps functionality, auto-remediation, and maintenance window management.
|
||||
|
||||
## Features
|
||||
|
||||
### 1. Runbooks Automation
|
||||
- **Predefined Response Steps**: Create and manage automated response procedures
|
||||
- **Multiple Trigger Types**: Manual, automatic, scheduled, webhook, and ChatOps triggers
|
||||
- **Execution Tracking**: Monitor runbook execution status and performance
|
||||
- **Version Control**: Track runbook versions and changes
|
||||
|
||||
### 2. External System Integrations
|
||||
- **ITSM Tools**: Jira, ServiceNow integration
|
||||
- **CI/CD Tools**: GitHub, Jenkins, Ansible, Terraform
|
||||
- **Chat Platforms**: Slack, Microsoft Teams, Discord, Mattermost
|
||||
- **Generic APIs**: Webhook and API integrations
|
||||
- **Health Monitoring**: Integration health checks and status tracking
|
||||
|
||||
### 3. ChatOps Integration
|
||||
- **Command Execution**: Trigger workflows from chat platforms
|
||||
- **Security Controls**: User and channel-based access control
|
||||
- **Command History**: Track and audit ChatOps commands
|
||||
- **Multi-Platform Support**: Slack, Teams, Discord, Mattermost
|
||||
|
||||
### 4. Auto-Remediation
|
||||
- **Automatic Response**: Trigger remediation actions based on incident conditions
|
||||
- **Safety Controls**: Approval workflows and execution limits
|
||||
- **Multiple Remediation Types**: Service restart, deployment rollback, scaling, etc.
|
||||
- **Execution Tracking**: Monitor remediation success rates and performance
|
||||
|
||||
### 5. Maintenance Windows
|
||||
- **Scheduled Suppression**: Suppress alerts during planned maintenance
|
||||
- **Service-Specific**: Target specific services and components
|
||||
- **Flexible Configuration**: Control incident creation, notifications, and escalations
|
||||
- **Status Management**: Automatic status updates based on schedule
|
||||
|
||||
### 6. Workflow Templates
|
||||
- **Reusable Workflows**: Create templates for common automation scenarios
|
||||
- **Parameterized Execution**: Support for input parameters and output schemas
|
||||
- **Template Types**: Incident response, deployment, maintenance, scaling, monitoring
|
||||
- **Usage Tracking**: Monitor template usage and performance
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Runbooks
|
||||
|
||||
#### List Runbooks
|
||||
```
|
||||
GET /api/automation/runbooks/
|
||||
```
|
||||
|
||||
**Query Parameters:**
|
||||
- `status`: Filter by status (DRAFT, ACTIVE, INACTIVE, DEPRECATED)
|
||||
- `trigger_type`: Filter by trigger type (MANUAL, AUTOMATIC, SCHEDULED, WEBHOOK, CHATOPS)
|
||||
- `category`: Filter by category
|
||||
- `is_public`: Filter by public/private status
|
||||
- `search`: Search in name, description, category
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"count": 10,
|
||||
"next": null,
|
||||
"previous": null,
|
||||
"results": [
|
||||
{
|
||||
"id": "uuid",
|
||||
"name": "Database Service Restart",
|
||||
"description": "Automated runbook for restarting database services",
|
||||
"version": "1.0",
|
||||
"trigger_type": "AUTOMATIC",
|
||||
"trigger_conditions": {
|
||||
"severity": ["CRITICAL", "EMERGENCY"],
|
||||
"category": "database"
|
||||
},
|
||||
"steps": [...],
|
||||
"estimated_duration": "00:05:00",
|
||||
"category": "database",
|
||||
"tags": ["database", "restart", "automation"],
|
||||
"status": "ACTIVE",
|
||||
"is_public": true,
|
||||
"execution_count": 5,
|
||||
"success_rate": 0.8,
|
||||
"can_trigger": true,
|
||||
"created_at": "2024-01-15T10:00:00Z",
|
||||
"updated_at": "2024-01-15T10:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Create Runbook
|
||||
```
|
||||
POST /api/automation/runbooks/
|
||||
```
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"name": "New Runbook",
|
||||
"description": "Description of the runbook",
|
||||
"version": "1.0",
|
||||
"trigger_type": "MANUAL",
|
||||
"trigger_conditions": {
|
||||
"severity": ["HIGH", "CRITICAL"]
|
||||
},
|
||||
"steps": [
|
||||
{
|
||||
"name": "Step 1",
|
||||
"action": "check_status",
|
||||
"timeout": 30,
|
||||
"parameters": {"service": "web"}
|
||||
}
|
||||
],
|
||||
"estimated_duration": "00:05:00",
|
||||
"category": "web",
|
||||
"tags": ["web", "restart"],
|
||||
"status": "DRAFT",
|
||||
"is_public": true
|
||||
}
|
||||
```
|
||||
|
||||
#### Execute Runbook
|
||||
```
|
||||
POST /api/automation/runbooks/{id}/execute/
|
||||
```
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"trigger_data": {
|
||||
"incident_id": "uuid",
|
||||
"context": "additional context"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Integrations
|
||||
|
||||
#### List Integrations
|
||||
```
|
||||
GET /api/automation/integrations/
|
||||
```
|
||||
|
||||
**Query Parameters:**
|
||||
- `integration_type`: Filter by type (JIRA, GITHUB, JENKINS, etc.)
|
||||
- `status`: Filter by status (ACTIVE, INACTIVE, ERROR, CONFIGURING)
|
||||
- `health_status`: Filter by health status (HEALTHY, WARNING, ERROR, UNKNOWN)
|
||||
|
||||
#### Test Integration Connection
|
||||
```
|
||||
POST /api/automation/integrations/{id}/test_connection/
|
||||
```
|
||||
|
||||
#### Perform Health Check
|
||||
```
|
||||
POST /api/automation/integrations/{id}/health_check/
|
||||
```
|
||||
|
||||
### ChatOps
|
||||
|
||||
#### List ChatOps Integrations
|
||||
```
|
||||
GET /api/automation/chatops-integrations/
|
||||
```
|
||||
|
||||
#### List ChatOps Commands
|
||||
```
|
||||
GET /api/automation/chatops-commands/
|
||||
```
|
||||
|
||||
**Query Parameters:**
|
||||
- `status`: Filter by execution status
|
||||
- `chatops_integration`: Filter by integration
|
||||
- `command`: Filter by command name
|
||||
- `user_id`: Filter by user ID
|
||||
- `channel_id`: Filter by channel ID
|
||||
|
||||
### Auto-Remediation
|
||||
|
||||
#### List Auto-Remediations
|
||||
```
|
||||
GET /api/automation/auto-remediations/
|
||||
```
|
||||
|
||||
**Query Parameters:**
|
||||
- `remediation_type`: Filter by type (SERVICE_RESTART, DEPLOYMENT_ROLLBACK, etc.)
|
||||
- `trigger_condition_type`: Filter by trigger condition type
|
||||
- `is_active`: Filter by active status
|
||||
- `requires_approval`: Filter by approval requirement
|
||||
|
||||
#### Approve Auto-Remediation Execution
|
||||
```
|
||||
POST /api/automation/auto-remediation-executions/{id}/approve/
|
||||
```
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"approval_notes": "Approved for execution"
|
||||
}
|
||||
```
|
||||
|
||||
#### Reject Auto-Remediation Execution
|
||||
```
|
||||
POST /api/automation/auto-remediation-executions/{id}/reject/
|
||||
```
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"rejection_notes": "Rejected due to risk concerns"
|
||||
}
|
||||
```
|
||||
|
||||
### Maintenance Windows
|
||||
|
||||
#### List Maintenance Windows
|
||||
```
|
||||
GET /api/automation/maintenance-windows/
|
||||
```
|
||||
|
||||
#### Get Active Maintenance Windows
|
||||
```
|
||||
GET /api/automation/maintenance-windows/active/
|
||||
```
|
||||
|
||||
#### Get Upcoming Maintenance Windows
|
||||
```
|
||||
GET /api/automation/maintenance-windows/upcoming/
|
||||
```
|
||||
|
||||
### Workflow Templates
|
||||
|
||||
#### List Workflow Templates
|
||||
```
|
||||
GET /api/automation/workflow-templates/
|
||||
```
|
||||
|
||||
**Query Parameters:**
|
||||
- `template_type`: Filter by type (INCIDENT_RESPONSE, DEPLOYMENT, etc.)
|
||||
- `is_public`: Filter by public/private status
|
||||
|
||||
## Data Models
|
||||
|
||||
### Runbook
|
||||
- **id**: UUID primary key
|
||||
- **name**: Unique name for the runbook
|
||||
- **description**: Detailed description
|
||||
- **version**: Version string
|
||||
- **trigger_type**: How the runbook is triggered
|
||||
- **trigger_conditions**: JSON conditions for triggering
|
||||
- **steps**: JSON array of execution steps
|
||||
- **estimated_duration**: Expected execution time
|
||||
- **category**: Categorization
|
||||
- **tags**: JSON array of tags
|
||||
- **status**: Current status
|
||||
- **is_public**: Public/private visibility
|
||||
- **execution_count**: Number of executions
|
||||
- **success_rate**: Success rate (0.0-1.0)
|
||||
|
||||
### Integration
|
||||
- **id**: UUID primary key
|
||||
- **name**: Unique name for the integration
|
||||
- **integration_type**: Type of integration (JIRA, GITHUB, etc.)
|
||||
- **description**: Description
|
||||
- **configuration**: JSON configuration data
|
||||
- **authentication_config**: JSON authentication data
|
||||
- **status**: Integration status
|
||||
- **health_status**: Health status
|
||||
- **request_count**: Number of requests made
|
||||
- **last_used_at**: Last usage timestamp
|
||||
|
||||
### ChatOpsIntegration
|
||||
- **id**: UUID primary key
|
||||
- **name**: Unique name
|
||||
- **platform**: Chat platform (SLACK, TEAMS, etc.)
|
||||
- **webhook_url**: Webhook URL
|
||||
- **bot_token**: Bot authentication token
|
||||
- **channel_id**: Default channel ID
|
||||
- **command_prefix**: Command prefix character
|
||||
- **available_commands**: JSON array of available commands
|
||||
- **allowed_users**: JSON array of allowed user IDs
|
||||
- **allowed_channels**: JSON array of allowed channel IDs
|
||||
- **is_active**: Active status
|
||||
|
||||
### AutoRemediation
|
||||
- **id**: UUID primary key
|
||||
- **name**: Unique name
|
||||
- **description**: Description
|
||||
- **remediation_type**: Type of remediation action
|
||||
- **trigger_conditions**: JSON trigger conditions
|
||||
- **trigger_condition_type**: Type of trigger condition
|
||||
- **remediation_config**: JSON remediation configuration
|
||||
- **timeout_seconds**: Execution timeout
|
||||
- **requires_approval**: Whether approval is required
|
||||
- **approval_users**: Many-to-many relationship with users
|
||||
- **max_executions_per_incident**: Maximum executions per incident
|
||||
- **is_active**: Active status
|
||||
- **execution_count**: Number of executions
|
||||
- **success_count**: Number of successful executions
|
||||
|
||||
### MaintenanceWindow
|
||||
- **id**: UUID primary key
|
||||
- **name**: Name of the maintenance window
|
||||
- **description**: Description
|
||||
- **start_time**: Start datetime
|
||||
- **end_time**: End datetime
|
||||
- **timezone**: Timezone
|
||||
- **affected_services**: JSON array of affected services
|
||||
- **affected_components**: JSON array of affected components
|
||||
- **suppress_incident_creation**: Whether to suppress incident creation
|
||||
- **suppress_notifications**: Whether to suppress notifications
|
||||
- **suppress_escalations**: Whether to suppress escalations
|
||||
- **status**: Current status
|
||||
- **incidents_suppressed**: Count of suppressed incidents
|
||||
- **notifications_suppressed**: Count of suppressed notifications
|
||||
|
||||
### WorkflowTemplate
|
||||
- **id**: UUID primary key
|
||||
- **name**: Unique name
|
||||
- **description**: Description
|
||||
- **template_type**: Type of workflow template
|
||||
- **workflow_steps**: JSON array of workflow steps
|
||||
- **input_parameters**: JSON array of input parameters
|
||||
- **output_schema**: JSON output schema
|
||||
- **usage_count**: Number of times used
|
||||
- **is_public**: Public/private visibility
|
||||
|
||||
## Security Features
|
||||
|
||||
### Access Control
|
||||
- **User Permissions**: Role-based access control
|
||||
- **Data Classification**: Integration with security module
|
||||
- **Audit Logging**: Comprehensive audit trails
|
||||
- **API Authentication**: Token and session authentication
|
||||
|
||||
### ChatOps Security
|
||||
- **User Whitelisting**: Restrict commands to specific users
|
||||
- **Channel Restrictions**: Limit commands to specific channels
|
||||
- **Command Validation**: Validate command parameters
|
||||
- **Execution Logging**: Log all command executions
|
||||
|
||||
### Auto-Remediation Safety
|
||||
- **Approval Workflows**: Require manual approval for sensitive actions
|
||||
- **Execution Limits**: Limit executions per incident
|
||||
- **Timeout Controls**: Prevent runaway executions
|
||||
- **Rollback Capabilities**: Support for rollback operations
|
||||
|
||||
## Integration with Other Modules
|
||||
|
||||
### Incident Intelligence Integration
|
||||
- **Automatic Triggering**: Trigger runbooks based on incident characteristics
|
||||
- **AI Suggestions**: AI-driven runbook recommendations
|
||||
- **Correlation**: Link automation actions to incident patterns
|
||||
- **Maintenance Suppression**: Suppress incidents during maintenance windows
|
||||
|
||||
### Security Module Integration
|
||||
- **Access Control**: Use security module for authentication and authorization
|
||||
- **Data Classification**: Apply data classification to automation data
|
||||
- **Audit Integration**: Integrate with security audit trails
|
||||
- **MFA Support**: Support multi-factor authentication for sensitive operations
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Runbook Design
|
||||
1. **Clear Steps**: Define clear, atomic steps
|
||||
2. **Error Handling**: Include error handling and rollback procedures
|
||||
3. **Timeout Management**: Set appropriate timeouts for each step
|
||||
4. **Documentation**: Provide clear documentation for each step
|
||||
5. **Testing**: Test runbooks in non-production environments
|
||||
|
||||
### Integration Management
|
||||
1. **Health Monitoring**: Regularly monitor integration health
|
||||
2. **Credential Management**: Securely store and rotate credentials
|
||||
3. **Rate Limiting**: Implement appropriate rate limiting
|
||||
4. **Error Handling**: Handle integration failures gracefully
|
||||
5. **Monitoring**: Monitor integration usage and performance
|
||||
|
||||
### Auto-Remediation
|
||||
1. **Conservative Approach**: Start with low-risk remediations
|
||||
2. **Approval Workflows**: Use approval workflows for high-risk actions
|
||||
3. **Monitoring**: Monitor remediation success rates
|
||||
4. **Documentation**: Document all remediation actions
|
||||
5. **Testing**: Test remediations in controlled environments
|
||||
|
||||
### Maintenance Windows
|
||||
1. **Communication**: Communicate maintenance windows to stakeholders
|
||||
2. **Scope Definition**: Clearly define affected services and components
|
||||
3. **Rollback Plans**: Have rollback plans for maintenance activities
|
||||
4. **Monitoring**: Monitor system health during maintenance
|
||||
5. **Documentation**: Document maintenance activities and outcomes
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Error Scenarios
|
||||
1. **Integration Failures**: Handle external system unavailability
|
||||
2. **Authentication Errors**: Handle credential expiration
|
||||
3. **Timeout Errors**: Handle execution timeouts
|
||||
4. **Permission Errors**: Handle insufficient permissions
|
||||
5. **Data Validation Errors**: Handle invalid input data
|
||||
|
||||
### Error Response Format
|
||||
```json
|
||||
{
|
||||
"error": "Error message",
|
||||
"code": "ERROR_CODE",
|
||||
"details": {
|
||||
"field": "specific field error"
|
||||
},
|
||||
"timestamp": "2024-01-15T10:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
### Default Limits
|
||||
- **API Requests**: 1000 requests per hour per user
|
||||
- **Runbook Executions**: 10 executions per hour per user
|
||||
- **Integration Calls**: 100 calls per hour per integration
|
||||
- **ChatOps Commands**: 50 commands per hour per user
|
||||
|
||||
### Custom Limits
|
||||
- Configure custom rate limits per user role
|
||||
- Set different limits for different integration types
|
||||
- Implement burst allowances for emergency situations
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Key Metrics
|
||||
- **Runbook Success Rate**: Track runbook execution success
|
||||
- **Integration Health**: Monitor integration availability
|
||||
- **Auto-Remediation Effectiveness**: Track remediation success
|
||||
- **ChatOps Usage**: Monitor ChatOps command usage
|
||||
- **Maintenance Window Impact**: Track maintenance window effectiveness
|
||||
|
||||
### Alerting
|
||||
- **Integration Failures**: Alert on integration health issues
|
||||
- **Runbook Failures**: Alert on runbook execution failures
|
||||
- **Auto-Remediation Issues**: Alert on remediation failures
|
||||
- **Rate Limit Exceeded**: Alert on rate limit violations
|
||||
- **Security Issues**: Alert on security-related events
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
1. **Runbook Execution Failures**: Check step configurations and permissions
|
||||
2. **Integration Connection Issues**: Verify credentials and network connectivity
|
||||
3. **ChatOps Command Failures**: Check user permissions and command syntax
|
||||
4. **Auto-Remediation Not Triggering**: Verify trigger conditions and permissions
|
||||
5. **Maintenance Window Not Working**: Check timezone and schedule configuration
|
||||
|
||||
### Debug Information
|
||||
- Enable debug logging for detailed execution information
|
||||
- Use execution logs to trace runbook and workflow execution
|
||||
- Check integration health status and error messages
|
||||
- Review audit logs for security and access issues
|
||||
- Monitor system metrics for performance issues
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Planned Features
|
||||
1. **Visual Workflow Builder**: Drag-and-drop workflow creation
|
||||
2. **Advanced AI Integration**: Enhanced AI-driven automation suggestions
|
||||
3. **Multi-Cloud Support**: Support for multiple cloud providers
|
||||
4. **Advanced Analytics**: Enhanced reporting and analytics capabilities
|
||||
5. **Mobile Support**: Mobile app for automation management
|
||||
|
||||
### Integration Roadmap
|
||||
1. **Additional ITSM Tools**: ServiceNow, Remedy, etc.
|
||||
2. **Cloud Platforms**: AWS, Azure, GCP integrations
|
||||
3. **Monitoring Tools**: Prometheus, Grafana, DataDog
|
||||
4. **Communication Platforms**: Additional chat platforms
|
||||
5. **Development Tools**: GitLab, Bitbucket, CircleCI
|
||||
Reference in New Issue
Block a user