Files
ETB/ETB-API/incident_intelligence/Documentations/INCIDENT_INTELLIGENCE_API.md
Iliyan Angelov 6b247e5b9f Updates
2025-09-19 11:58:53 +03:00

364 lines
12 KiB
Markdown

# Incident Intelligence API Documentation
## Overview
The Incident Intelligence module provides AI-driven capabilities for incident management, including:
- **AI-driven incident classification** using NLP to categorize incidents from free text
- **Automated severity suggestion** based on impact analysis
- **Correlation engine** for linking related incidents and problem detection
- **Duplication detection** for merging incidents that describe the same outage
## Features
### 1. AI-Driven Incident Classification
Automatically classifies incidents into categories and subcategories based on their content:
- **Categories**: Infrastructure, Application, Security, User Experience, Data, Integration
- **Subcategories**: Specific types within each category (e.g., API_ISSUE, DATABASE_ISSUE)
- **Confidence Scoring**: AI confidence level for each classification
- **Keyword Extraction**: Identifies relevant keywords from incident text
- **Sentiment Analysis**: Analyzes the sentiment of incident descriptions
- **Urgency Detection**: Identifies urgency indicators in the text
### 2. Automated Severity Suggestion
Suggests incident severity based on multiple factors:
- **User Impact Analysis**: Number of affected users and impact level
- **Business Impact Assessment**: Revenue and operational impact
- **Technical Impact Evaluation**: System and infrastructure impact
- **Text Analysis**: Severity indicators in incident descriptions
- **Confidence Scoring**: AI confidence in severity suggestions
### 3. Correlation Engine
Links related incidents and detects patterns:
- **Correlation Types**: Same Service, Same Component, Temporal, Pattern Match, Dependency, Cascade
- **Problem Detection**: Identifies when correlations suggest larger problems
- **Time-based Analysis**: Considers temporal proximity of incidents
- **Service Similarity**: Analyzes shared services and components
- **Pattern Recognition**: Detects recurring issues and trends
### 4. Duplication Detection
Identifies and manages duplicate incidents:
- **Duplication Types**: Exact, Near Duplicate, Similar, Potential Duplicate
- **Similarity Analysis**: Text, temporal, and service similarity scoring
- **Merge Recommendations**: Suggests actions (Merge, Link, Review, No Action)
- **Confidence Scoring**: AI confidence in duplication detection
- **Shared Elements**: Identifies common elements between incidents
## API Endpoints
### Incidents
#### Create Incident
```http
POST /api/incidents/incidents/
Content-Type: application/json
{
"title": "Database Connection Timeout",
"description": "Users are experiencing timeouts when trying to access the database",
"free_text": "Database is down, can't connect, getting timeout errors",
"affected_users": 150,
"business_impact": "Critical business operations are affected",
"reporter": 1
}
```
#### Get Incident Analysis
```http
GET /api/incidents/incidents/{id}/analysis/
```
Returns comprehensive AI analysis including:
- Classification results
- Severity suggestions
- Correlations with other incidents
- Potential duplicates
- Associated patterns
#### Trigger AI Analysis
```http
POST /api/incidents/incidents/{id}/analyze/
```
Manually triggers AI analysis for a specific incident.
#### Get Incident Statistics
```http
GET /api/incidents/incidents/stats/
```
Returns statistics including:
- Total incidents by status and severity
- Average resolution time
- AI processing statistics
- Duplicate and correlation counts
### Correlations
#### Get Correlations
```http
GET /api/incidents/correlations/
```
#### Get Problem Indicators
```http
GET /api/incidents/correlations/problem_indicators/
```
Returns correlations that indicate larger problems.
### Duplications
#### Get Duplications
```http
GET /api/incidents/duplications/
```
#### Approve Merge
```http
POST /api/incidents/duplications/{id}/approve_merge/
```
#### Reject Merge
```http
POST /api/incidents/duplications/{id}/reject_merge/
```
### Patterns
#### Get Patterns
```http
GET /api/incidents/patterns/
```
#### Get Active Patterns
```http
GET /api/incidents/patterns/active_patterns/
```
#### Resolve Pattern
```http
POST /api/incidents/patterns/{id}/resolve_pattern/
```
## Data Models
### Incident
- **id**: UUID primary key
- **title**: Incident title
- **description**: Detailed description
- **free_text**: Original free text from user
- **category**: AI-classified category
- **subcategory**: AI-classified subcategory
- **severity**: Current severity level
- **suggested_severity**: AI-suggested severity
- **status**: Current status (Open, In Progress, Resolved, Closed)
- **assigned_to**: Assigned user
- **reporter**: User who reported the incident
- **affected_users**: Number of affected users
- **business_impact**: Business impact description
- **ai_processed**: Whether AI analysis has been completed
- **is_duplicate**: Whether this is a duplicate incident
### IncidentClassification
- **incident**: Related incident
- **predicted_category**: AI-predicted category
- **predicted_subcategory**: AI-predicted subcategory
- **confidence_score**: AI confidence (0.0-1.0)
- **alternative_categories**: Alternative predictions
- **extracted_keywords**: Keywords extracted from text
- **sentiment_score**: Sentiment analysis score (-1 to 1)
- **urgency_indicators**: Detected urgency indicators
### SeveritySuggestion
- **incident**: Related incident
- **suggested_severity**: AI-suggested severity
- **confidence_score**: AI confidence (0.0-1.0)
- **user_impact_score**: User impact score (0.0-1.0)
- **business_impact_score**: Business impact score (0.0-1.0)
- **technical_impact_score**: Technical impact score (0.0-1.0)
- **reasoning**: AI explanation for suggestion
- **impact_factors**: Factors that influenced the severity
### IncidentCorrelation
- **primary_incident**: Primary incident in correlation
- **related_incident**: Related incident
- **correlation_type**: Type of correlation
- **confidence_score**: Correlation confidence (0.0-1.0)
- **correlation_strength**: Strength of correlation
- **shared_keywords**: Keywords shared between incidents
- **time_difference**: Time difference between incidents
- **similarity_score**: Overall similarity score
- **is_problem_indicator**: Whether this suggests a larger problem
### DuplicationDetection
- **incident_a**: First incident in pair
- **incident_b**: Second incident in pair
- **duplication_type**: Type of duplication
- **similarity_score**: Overall similarity score
- **confidence_score**: Duplication confidence (0.0-1.0)
- **text_similarity**: Text similarity score
- **temporal_proximity**: Temporal proximity score
- **service_similarity**: Service similarity score
- **recommended_action**: Recommended action (Merge, Link, Review, No Action)
- **status**: Current status (Detected, Reviewed, Merged, Rejected)
### IncidentPattern
- **name**: Pattern name
- **pattern_type**: Type of pattern (Recurring, Seasonal, Trend, Anomaly)
- **description**: Pattern description
- **frequency**: How often the pattern occurs
- **affected_services**: Services affected by the pattern
- **common_keywords**: Common keywords in pattern incidents
- **incidents**: Related incidents
- **confidence_score**: Pattern confidence (0.0-1.0)
- **is_active**: Whether the pattern is active
- **is_resolved**: Whether the pattern is resolved
## AI Components
### IncidentClassifier
- **Categories**: Predefined categories with keywords
- **Keyword Extraction**: Extracts relevant keywords from text
- **Sentiment Analysis**: Analyzes sentiment of incident text
- **Urgency Detection**: Identifies urgency indicators
- **Confidence Scoring**: Provides confidence scores for classifications
### SeverityAnalyzer
- **Impact Analysis**: Analyzes user, business, and technical impact
- **Severity Indicators**: Identifies severity keywords in text
- **Weighted Scoring**: Combines multiple factors for severity suggestion
- **Reasoning Generation**: Provides explanations for severity suggestions
### IncidentCorrelationEngine
- **Similarity Analysis**: Calculates various similarity metrics
- **Temporal Analysis**: Considers time-based correlations
- **Service Analysis**: Analyzes service and component similarities
- **Problem Detection**: Identifies patterns that suggest larger problems
- **Cluster Detection**: Groups related incidents into clusters
### DuplicationDetector
- **Text Similarity**: Multiple text similarity algorithms
- **Temporal Proximity**: Time-based duplication detection
- **Service Similarity**: Service and component similarity
- **Metadata Similarity**: Similarity based on incident metadata
- **Merge Recommendations**: Suggests appropriate actions
## Background Processing
The module uses Celery for background processing of AI analysis:
### Tasks
- **process_incident_ai**: Processes a single incident with AI analysis
- **batch_process_incidents_ai**: Processes multiple incidents
- **find_correlations**: Finds correlations for an incident
- **find_duplicates**: Finds duplicates for an incident
- **detect_all_duplicates**: Batch duplicate detection
- **correlate_all_incidents**: Batch correlation analysis
- **merge_duplicate_incidents**: Merges duplicate incidents
### Processing Logs
All AI processing activities are logged in the `AIProcessingLog` model for audit and debugging purposes.
## Setup and Configuration
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Run Migrations
```bash
python manage.py makemigrations incident_intelligence
python manage.py migrate
```
### 3. Create Sample Data
```bash
python manage.py setup_incident_intelligence --create-sample-data --create-patterns
```
### 4. Run AI Analysis
```bash
python manage.py setup_incident_intelligence --run-ai-analysis
```
### 5. Start Celery Worker
```bash
celery -A core worker -l info
```
## Usage Examples
### Creating an Incident with AI Analysis
```python
from incident_intelligence.models import Incident
from incident_intelligence.tasks import process_incident_ai
# Create incident
incident = Incident.objects.create(
title="API Response Slow",
description="The user service API is responding slowly",
free_text="API is slow, taking forever to respond",
affected_users=50,
business_impact="User experience is degraded"
)
# Trigger AI analysis
process_incident_ai.delay(incident.id)
```
### Finding Correlations
```python
from incident_intelligence.ai.correlation import IncidentCorrelationEngine
engine = IncidentCorrelationEngine()
correlations = engine.find_related_incidents(incident_data, all_incidents)
```
### Detecting Duplicates
```python
from incident_intelligence.ai.duplication import DuplicationDetector
detector = DuplicationDetector()
duplicates = detector.find_duplicate_candidates(incident_data, all_incidents)
```
## Performance Considerations
- **Batch Processing**: Use batch operations for large datasets
- **Caching**: Consider caching frequently accessed data
- **Indexing**: Database indexes are configured for optimal query performance
- **Background Tasks**: AI processing runs asynchronously to avoid blocking requests
- **Rate Limiting**: Consider implementing rate limiting for API endpoints
## Security Considerations
- **Authentication**: All endpoints require authentication
- **Authorization**: Users can only access incidents they have permission to view
- **Data Privacy**: Sensitive information is handled according to data classification levels
- **Audit Logging**: All AI processing activities are logged for audit purposes
## Monitoring and Maintenance
- **Processing Logs**: Monitor AI processing logs for errors and performance
- **Model Performance**: Track AI model accuracy and update as needed
- **Database Maintenance**: Regular cleanup of old processing logs and resolved incidents
- **Health Checks**: Monitor Celery workers and Redis for background processing health
## Future Enhancements
- **Machine Learning Models**: Integration with more sophisticated ML models
- **Real-time Processing**: Real-time incident analysis and correlation
- **Advanced NLP**: More sophisticated natural language processing
- **Predictive Analytics**: Predictive incident analysis and prevention
- **Integration APIs**: APIs for integrating with external incident management systems