Files
ETB/ETB-API/automation_orchestration/Documentations/AUTOMATION_ORCHESTRATION_API.md
Iliyan Angelov 6b247e5b9f Updates
2025-09-19 11:58:53 +03:00

15 KiB

Automation & Orchestration API Documentation

Overview

The Automation & Orchestration module provides comprehensive automation capabilities for incident management, including runbooks, integrations with external systems, ChatOps functionality, auto-remediation, and maintenance window management.

Features

1. Runbooks Automation

  • Predefined Response Steps: Create and manage automated response procedures
  • Multiple Trigger Types: Manual, automatic, scheduled, webhook, and ChatOps triggers
  • Execution Tracking: Monitor runbook execution status and performance
  • Version Control: Track runbook versions and changes

2. External System Integrations

  • ITSM Tools: Jira, ServiceNow integration
  • CI/CD Tools: GitHub, Jenkins, Ansible, Terraform
  • Chat Platforms: Slack, Microsoft Teams, Discord, Mattermost
  • Generic APIs: Webhook and API integrations
  • Health Monitoring: Integration health checks and status tracking

3. ChatOps Integration

  • Command Execution: Trigger workflows from chat platforms
  • Security Controls: User and channel-based access control
  • Command History: Track and audit ChatOps commands
  • Multi-Platform Support: Slack, Teams, Discord, Mattermost

4. Auto-Remediation

  • Automatic Response: Trigger remediation actions based on incident conditions
  • Safety Controls: Approval workflows and execution limits
  • Multiple Remediation Types: Service restart, deployment rollback, scaling, etc.
  • Execution Tracking: Monitor remediation success rates and performance

5. Maintenance Windows

  • Scheduled Suppression: Suppress alerts during planned maintenance
  • Service-Specific: Target specific services and components
  • Flexible Configuration: Control incident creation, notifications, and escalations
  • Status Management: Automatic status updates based on schedule

6. Workflow Templates

  • Reusable Workflows: Create templates for common automation scenarios
  • Parameterized Execution: Support for input parameters and output schemas
  • Template Types: Incident response, deployment, maintenance, scaling, monitoring
  • Usage Tracking: Monitor template usage and performance

API Endpoints

Runbooks

List Runbooks

GET /api/automation/runbooks/

Query Parameters:

  • status: Filter by status (DRAFT, ACTIVE, INACTIVE, DEPRECATED)
  • trigger_type: Filter by trigger type (MANUAL, AUTOMATIC, SCHEDULED, WEBHOOK, CHATOPS)
  • category: Filter by category
  • is_public: Filter by public/private status
  • search: Search in name, description, category

Response:

{
  "count": 10,
  "next": null,
  "previous": null,
  "results": [
    {
      "id": "uuid",
      "name": "Database Service Restart",
      "description": "Automated runbook for restarting database services",
      "version": "1.0",
      "trigger_type": "AUTOMATIC",
      "trigger_conditions": {
        "severity": ["CRITICAL", "EMERGENCY"],
        "category": "database"
      },
      "steps": [...],
      "estimated_duration": "00:05:00",
      "category": "database",
      "tags": ["database", "restart", "automation"],
      "status": "ACTIVE",
      "is_public": true,
      "execution_count": 5,
      "success_rate": 0.8,
      "can_trigger": true,
      "created_at": "2024-01-15T10:00:00Z",
      "updated_at": "2024-01-15T10:00:00Z"
    }
  ]
}

Create Runbook

POST /api/automation/runbooks/

Request Body:

{
  "name": "New Runbook",
  "description": "Description of the runbook",
  "version": "1.0",
  "trigger_type": "MANUAL",
  "trigger_conditions": {
    "severity": ["HIGH", "CRITICAL"]
  },
  "steps": [
    {
      "name": "Step 1",
      "action": "check_status",
      "timeout": 30,
      "parameters": {"service": "web"}
    }
  ],
  "estimated_duration": "00:05:00",
  "category": "web",
  "tags": ["web", "restart"],
  "status": "DRAFT",
  "is_public": true
}

Execute Runbook

POST /api/automation/runbooks/{id}/execute/

Request Body:

{
  "trigger_data": {
    "incident_id": "uuid",
    "context": "additional context"
  }
}

Integrations

List Integrations

GET /api/automation/integrations/

Query Parameters:

  • integration_type: Filter by type (JIRA, GITHUB, JENKINS, etc.)
  • status: Filter by status (ACTIVE, INACTIVE, ERROR, CONFIGURING)
  • health_status: Filter by health status (HEALTHY, WARNING, ERROR, UNKNOWN)

Test Integration Connection

POST /api/automation/integrations/{id}/test_connection/

Perform Health Check

POST /api/automation/integrations/{id}/health_check/

ChatOps

List ChatOps Integrations

GET /api/automation/chatops-integrations/

List ChatOps Commands

GET /api/automation/chatops-commands/

Query Parameters:

  • status: Filter by execution status
  • chatops_integration: Filter by integration
  • command: Filter by command name
  • user_id: Filter by user ID
  • channel_id: Filter by channel ID

Auto-Remediation

List Auto-Remediations

GET /api/automation/auto-remediations/

Query Parameters:

  • remediation_type: Filter by type (SERVICE_RESTART, DEPLOYMENT_ROLLBACK, etc.)
  • trigger_condition_type: Filter by trigger condition type
  • is_active: Filter by active status
  • requires_approval: Filter by approval requirement

Approve Auto-Remediation Execution

POST /api/automation/auto-remediation-executions/{id}/approve/

Request Body:

{
  "approval_notes": "Approved for execution"
}

Reject Auto-Remediation Execution

POST /api/automation/auto-remediation-executions/{id}/reject/

Request Body:

{
  "rejection_notes": "Rejected due to risk concerns"
}

Maintenance Windows

List Maintenance Windows

GET /api/automation/maintenance-windows/

Get Active Maintenance Windows

GET /api/automation/maintenance-windows/active/

Get Upcoming Maintenance Windows

GET /api/automation/maintenance-windows/upcoming/

Workflow Templates

List Workflow Templates

GET /api/automation/workflow-templates/

Query Parameters:

  • template_type: Filter by type (INCIDENT_RESPONSE, DEPLOYMENT, etc.)
  • is_public: Filter by public/private status

Data Models

Runbook

  • id: UUID primary key
  • name: Unique name for the runbook
  • description: Detailed description
  • version: Version string
  • trigger_type: How the runbook is triggered
  • trigger_conditions: JSON conditions for triggering
  • steps: JSON array of execution steps
  • estimated_duration: Expected execution time
  • category: Categorization
  • tags: JSON array of tags
  • status: Current status
  • is_public: Public/private visibility
  • execution_count: Number of executions
  • success_rate: Success rate (0.0-1.0)

Integration

  • id: UUID primary key
  • name: Unique name for the integration
  • integration_type: Type of integration (JIRA, GITHUB, etc.)
  • description: Description
  • configuration: JSON configuration data
  • authentication_config: JSON authentication data
  • status: Integration status
  • health_status: Health status
  • request_count: Number of requests made
  • last_used_at: Last usage timestamp

ChatOpsIntegration

  • id: UUID primary key
  • name: Unique name
  • platform: Chat platform (SLACK, TEAMS, etc.)
  • webhook_url: Webhook URL
  • bot_token: Bot authentication token
  • channel_id: Default channel ID
  • command_prefix: Command prefix character
  • available_commands: JSON array of available commands
  • allowed_users: JSON array of allowed user IDs
  • allowed_channels: JSON array of allowed channel IDs
  • is_active: Active status

AutoRemediation

  • id: UUID primary key
  • name: Unique name
  • description: Description
  • remediation_type: Type of remediation action
  • trigger_conditions: JSON trigger conditions
  • trigger_condition_type: Type of trigger condition
  • remediation_config: JSON remediation configuration
  • timeout_seconds: Execution timeout
  • requires_approval: Whether approval is required
  • approval_users: Many-to-many relationship with users
  • max_executions_per_incident: Maximum executions per incident
  • is_active: Active status
  • execution_count: Number of executions
  • success_count: Number of successful executions

MaintenanceWindow

  • id: UUID primary key
  • name: Name of the maintenance window
  • description: Description
  • start_time: Start datetime
  • end_time: End datetime
  • timezone: Timezone
  • affected_services: JSON array of affected services
  • affected_components: JSON array of affected components
  • suppress_incident_creation: Whether to suppress incident creation
  • suppress_notifications: Whether to suppress notifications
  • suppress_escalations: Whether to suppress escalations
  • status: Current status
  • incidents_suppressed: Count of suppressed incidents
  • notifications_suppressed: Count of suppressed notifications

WorkflowTemplate

  • id: UUID primary key
  • name: Unique name
  • description: Description
  • template_type: Type of workflow template
  • workflow_steps: JSON array of workflow steps
  • input_parameters: JSON array of input parameters
  • output_schema: JSON output schema
  • usage_count: Number of times used
  • is_public: Public/private visibility

Security Features

Access Control

  • User Permissions: Role-based access control
  • Data Classification: Integration with security module
  • Audit Logging: Comprehensive audit trails
  • API Authentication: Token and session authentication

ChatOps Security

  • User Whitelisting: Restrict commands to specific users
  • Channel Restrictions: Limit commands to specific channels
  • Command Validation: Validate command parameters
  • Execution Logging: Log all command executions

Auto-Remediation Safety

  • Approval Workflows: Require manual approval for sensitive actions
  • Execution Limits: Limit executions per incident
  • Timeout Controls: Prevent runaway executions
  • Rollback Capabilities: Support for rollback operations

Integration with Other Modules

Incident Intelligence Integration

  • Automatic Triggering: Trigger runbooks based on incident characteristics
  • AI Suggestions: AI-driven runbook recommendations
  • Correlation: Link automation actions to incident patterns
  • Maintenance Suppression: Suppress incidents during maintenance windows

Security Module Integration

  • Access Control: Use security module for authentication and authorization
  • Data Classification: Apply data classification to automation data
  • Audit Integration: Integrate with security audit trails
  • MFA Support: Support multi-factor authentication for sensitive operations

Best Practices

Runbook Design

  1. Clear Steps: Define clear, atomic steps
  2. Error Handling: Include error handling and rollback procedures
  3. Timeout Management: Set appropriate timeouts for each step
  4. Documentation: Provide clear documentation for each step
  5. Testing: Test runbooks in non-production environments

Integration Management

  1. Health Monitoring: Regularly monitor integration health
  2. Credential Management: Securely store and rotate credentials
  3. Rate Limiting: Implement appropriate rate limiting
  4. Error Handling: Handle integration failures gracefully
  5. Monitoring: Monitor integration usage and performance

Auto-Remediation

  1. Conservative Approach: Start with low-risk remediations
  2. Approval Workflows: Use approval workflows for high-risk actions
  3. Monitoring: Monitor remediation success rates
  4. Documentation: Document all remediation actions
  5. Testing: Test remediations in controlled environments

Maintenance Windows

  1. Communication: Communicate maintenance windows to stakeholders
  2. Scope Definition: Clearly define affected services and components
  3. Rollback Plans: Have rollback plans for maintenance activities
  4. Monitoring: Monitor system health during maintenance
  5. Documentation: Document maintenance activities and outcomes

Error Handling

Common Error Scenarios

  1. Integration Failures: Handle external system unavailability
  2. Authentication Errors: Handle credential expiration
  3. Timeout Errors: Handle execution timeouts
  4. Permission Errors: Handle insufficient permissions
  5. Data Validation Errors: Handle invalid input data

Error Response Format

{
  "error": "Error message",
  "code": "ERROR_CODE",
  "details": {
    "field": "specific field error"
  },
  "timestamp": "2024-01-15T10:00:00Z"
}

Rate Limiting

Default Limits

  • API Requests: 1000 requests per hour per user
  • Runbook Executions: 10 executions per hour per user
  • Integration Calls: 100 calls per hour per integration
  • ChatOps Commands: 50 commands per hour per user

Custom Limits

  • Configure custom rate limits per user role
  • Set different limits for different integration types
  • Implement burst allowances for emergency situations

Monitoring and Alerting

Key Metrics

  • Runbook Success Rate: Track runbook execution success
  • Integration Health: Monitor integration availability
  • Auto-Remediation Effectiveness: Track remediation success
  • ChatOps Usage: Monitor ChatOps command usage
  • Maintenance Window Impact: Track maintenance window effectiveness

Alerting

  • Integration Failures: Alert on integration health issues
  • Runbook Failures: Alert on runbook execution failures
  • Auto-Remediation Issues: Alert on remediation failures
  • Rate Limit Exceeded: Alert on rate limit violations
  • Security Issues: Alert on security-related events

Troubleshooting

Common Issues

  1. Runbook Execution Failures: Check step configurations and permissions
  2. Integration Connection Issues: Verify credentials and network connectivity
  3. ChatOps Command Failures: Check user permissions and command syntax
  4. Auto-Remediation Not Triggering: Verify trigger conditions and permissions
  5. Maintenance Window Not Working: Check timezone and schedule configuration

Debug Information

  • Enable debug logging for detailed execution information
  • Use execution logs to trace runbook and workflow execution
  • Check integration health status and error messages
  • Review audit logs for security and access issues
  • Monitor system metrics for performance issues

Future Enhancements

Planned Features

  1. Visual Workflow Builder: Drag-and-drop workflow creation
  2. Advanced AI Integration: Enhanced AI-driven automation suggestions
  3. Multi-Cloud Support: Support for multiple cloud providers
  4. Advanced Analytics: Enhanced reporting and analytics capabilities
  5. Mobile Support: Mobile app for automation management

Integration Roadmap

  1. Additional ITSM Tools: ServiceNow, Remedy, etc.
  2. Cloud Platforms: AWS, Azure, GCP integrations
  3. Monitoring Tools: Prometheus, Grafana, DataDog
  4. Communication Platforms: Additional chat platforms
  5. Development Tools: GitLab, Bitbucket, CircleCI