364 lines
12 KiB
Markdown
364 lines
12 KiB
Markdown
# Incident Intelligence API Documentation
|
|
|
|
## Overview
|
|
|
|
The Incident Intelligence module provides AI-driven capabilities for incident management, including:
|
|
|
|
- **AI-driven incident classification** using NLP to categorize incidents from free text
|
|
- **Automated severity suggestion** based on impact analysis
|
|
- **Correlation engine** for linking related incidents and problem detection
|
|
- **Duplication detection** for merging incidents that describe the same outage
|
|
|
|
## Features
|
|
|
|
### 1. AI-Driven Incident Classification
|
|
|
|
Automatically classifies incidents into categories and subcategories based on their content:
|
|
|
|
- **Categories**: Infrastructure, Application, Security, User Experience, Data, Integration
|
|
- **Subcategories**: Specific types within each category (e.g., API_ISSUE, DATABASE_ISSUE)
|
|
- **Confidence Scoring**: AI confidence level for each classification
|
|
- **Keyword Extraction**: Identifies relevant keywords from incident text
|
|
- **Sentiment Analysis**: Analyzes the sentiment of incident descriptions
|
|
- **Urgency Detection**: Identifies urgency indicators in the text
|
|
|
|
### 2. Automated Severity Suggestion
|
|
|
|
Suggests incident severity based on multiple factors:
|
|
|
|
- **User Impact Analysis**: Number of affected users and impact level
|
|
- **Business Impact Assessment**: Revenue and operational impact
|
|
- **Technical Impact Evaluation**: System and infrastructure impact
|
|
- **Text Analysis**: Severity indicators in incident descriptions
|
|
- **Confidence Scoring**: AI confidence in severity suggestions
|
|
|
|
### 3. Correlation Engine
|
|
|
|
Links related incidents and detects patterns:
|
|
|
|
- **Correlation Types**: Same Service, Same Component, Temporal, Pattern Match, Dependency, Cascade
|
|
- **Problem Detection**: Identifies when correlations suggest larger problems
|
|
- **Time-based Analysis**: Considers temporal proximity of incidents
|
|
- **Service Similarity**: Analyzes shared services and components
|
|
- **Pattern Recognition**: Detects recurring issues and trends
|
|
|
|
### 4. Duplication Detection
|
|
|
|
Identifies and manages duplicate incidents:
|
|
|
|
- **Duplication Types**: Exact, Near Duplicate, Similar, Potential Duplicate
|
|
- **Similarity Analysis**: Text, temporal, and service similarity scoring
|
|
- **Merge Recommendations**: Suggests actions (Merge, Link, Review, No Action)
|
|
- **Confidence Scoring**: AI confidence in duplication detection
|
|
- **Shared Elements**: Identifies common elements between incidents
|
|
|
|
## API Endpoints
|
|
|
|
### Incidents
|
|
|
|
#### Create Incident
|
|
```http
|
|
POST /api/incidents/incidents/
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"title": "Database Connection Timeout",
|
|
"description": "Users are experiencing timeouts when trying to access the database",
|
|
"free_text": "Database is down, can't connect, getting timeout errors",
|
|
"affected_users": 150,
|
|
"business_impact": "Critical business operations are affected",
|
|
"reporter": 1
|
|
}
|
|
```
|
|
|
|
#### Get Incident Analysis
|
|
```http
|
|
GET /api/incidents/incidents/{id}/analysis/
|
|
```
|
|
|
|
Returns comprehensive AI analysis including:
|
|
- Classification results
|
|
- Severity suggestions
|
|
- Correlations with other incidents
|
|
- Potential duplicates
|
|
- Associated patterns
|
|
|
|
#### Trigger AI Analysis
|
|
```http
|
|
POST /api/incidents/incidents/{id}/analyze/
|
|
```
|
|
|
|
Manually triggers AI analysis for a specific incident.
|
|
|
|
#### Get Incident Statistics
|
|
```http
|
|
GET /api/incidents/incidents/stats/
|
|
```
|
|
|
|
Returns statistics including:
|
|
- Total incidents by status and severity
|
|
- Average resolution time
|
|
- AI processing statistics
|
|
- Duplicate and correlation counts
|
|
|
|
### Correlations
|
|
|
|
#### Get Correlations
|
|
```http
|
|
GET /api/incidents/correlations/
|
|
```
|
|
|
|
#### Get Problem Indicators
|
|
```http
|
|
GET /api/incidents/correlations/problem_indicators/
|
|
```
|
|
|
|
Returns correlations that indicate larger problems.
|
|
|
|
### Duplications
|
|
|
|
#### Get Duplications
|
|
```http
|
|
GET /api/incidents/duplications/
|
|
```
|
|
|
|
#### Approve Merge
|
|
```http
|
|
POST /api/incidents/duplications/{id}/approve_merge/
|
|
```
|
|
|
|
#### Reject Merge
|
|
```http
|
|
POST /api/incidents/duplications/{id}/reject_merge/
|
|
```
|
|
|
|
### Patterns
|
|
|
|
#### Get Patterns
|
|
```http
|
|
GET /api/incidents/patterns/
|
|
```
|
|
|
|
#### Get Active Patterns
|
|
```http
|
|
GET /api/incidents/patterns/active_patterns/
|
|
```
|
|
|
|
#### Resolve Pattern
|
|
```http
|
|
POST /api/incidents/patterns/{id}/resolve_pattern/
|
|
```
|
|
|
|
## Data Models
|
|
|
|
### Incident
|
|
- **id**: UUID primary key
|
|
- **title**: Incident title
|
|
- **description**: Detailed description
|
|
- **free_text**: Original free text from user
|
|
- **category**: AI-classified category
|
|
- **subcategory**: AI-classified subcategory
|
|
- **severity**: Current severity level
|
|
- **suggested_severity**: AI-suggested severity
|
|
- **status**: Current status (Open, In Progress, Resolved, Closed)
|
|
- **assigned_to**: Assigned user
|
|
- **reporter**: User who reported the incident
|
|
- **affected_users**: Number of affected users
|
|
- **business_impact**: Business impact description
|
|
- **ai_processed**: Whether AI analysis has been completed
|
|
- **is_duplicate**: Whether this is a duplicate incident
|
|
|
|
### IncidentClassification
|
|
- **incident**: Related incident
|
|
- **predicted_category**: AI-predicted category
|
|
- **predicted_subcategory**: AI-predicted subcategory
|
|
- **confidence_score**: AI confidence (0.0-1.0)
|
|
- **alternative_categories**: Alternative predictions
|
|
- **extracted_keywords**: Keywords extracted from text
|
|
- **sentiment_score**: Sentiment analysis score (-1 to 1)
|
|
- **urgency_indicators**: Detected urgency indicators
|
|
|
|
### SeveritySuggestion
|
|
- **incident**: Related incident
|
|
- **suggested_severity**: AI-suggested severity
|
|
- **confidence_score**: AI confidence (0.0-1.0)
|
|
- **user_impact_score**: User impact score (0.0-1.0)
|
|
- **business_impact_score**: Business impact score (0.0-1.0)
|
|
- **technical_impact_score**: Technical impact score (0.0-1.0)
|
|
- **reasoning**: AI explanation for suggestion
|
|
- **impact_factors**: Factors that influenced the severity
|
|
|
|
### IncidentCorrelation
|
|
- **primary_incident**: Primary incident in correlation
|
|
- **related_incident**: Related incident
|
|
- **correlation_type**: Type of correlation
|
|
- **confidence_score**: Correlation confidence (0.0-1.0)
|
|
- **correlation_strength**: Strength of correlation
|
|
- **shared_keywords**: Keywords shared between incidents
|
|
- **time_difference**: Time difference between incidents
|
|
- **similarity_score**: Overall similarity score
|
|
- **is_problem_indicator**: Whether this suggests a larger problem
|
|
|
|
### DuplicationDetection
|
|
- **incident_a**: First incident in pair
|
|
- **incident_b**: Second incident in pair
|
|
- **duplication_type**: Type of duplication
|
|
- **similarity_score**: Overall similarity score
|
|
- **confidence_score**: Duplication confidence (0.0-1.0)
|
|
- **text_similarity**: Text similarity score
|
|
- **temporal_proximity**: Temporal proximity score
|
|
- **service_similarity**: Service similarity score
|
|
- **recommended_action**: Recommended action (Merge, Link, Review, No Action)
|
|
- **status**: Current status (Detected, Reviewed, Merged, Rejected)
|
|
|
|
### IncidentPattern
|
|
- **name**: Pattern name
|
|
- **pattern_type**: Type of pattern (Recurring, Seasonal, Trend, Anomaly)
|
|
- **description**: Pattern description
|
|
- **frequency**: How often the pattern occurs
|
|
- **affected_services**: Services affected by the pattern
|
|
- **common_keywords**: Common keywords in pattern incidents
|
|
- **incidents**: Related incidents
|
|
- **confidence_score**: Pattern confidence (0.0-1.0)
|
|
- **is_active**: Whether the pattern is active
|
|
- **is_resolved**: Whether the pattern is resolved
|
|
|
|
## AI Components
|
|
|
|
### IncidentClassifier
|
|
- **Categories**: Predefined categories with keywords
|
|
- **Keyword Extraction**: Extracts relevant keywords from text
|
|
- **Sentiment Analysis**: Analyzes sentiment of incident text
|
|
- **Urgency Detection**: Identifies urgency indicators
|
|
- **Confidence Scoring**: Provides confidence scores for classifications
|
|
|
|
### SeverityAnalyzer
|
|
- **Impact Analysis**: Analyzes user, business, and technical impact
|
|
- **Severity Indicators**: Identifies severity keywords in text
|
|
- **Weighted Scoring**: Combines multiple factors for severity suggestion
|
|
- **Reasoning Generation**: Provides explanations for severity suggestions
|
|
|
|
### IncidentCorrelationEngine
|
|
- **Similarity Analysis**: Calculates various similarity metrics
|
|
- **Temporal Analysis**: Considers time-based correlations
|
|
- **Service Analysis**: Analyzes service and component similarities
|
|
- **Problem Detection**: Identifies patterns that suggest larger problems
|
|
- **Cluster Detection**: Groups related incidents into clusters
|
|
|
|
### DuplicationDetector
|
|
- **Text Similarity**: Multiple text similarity algorithms
|
|
- **Temporal Proximity**: Time-based duplication detection
|
|
- **Service Similarity**: Service and component similarity
|
|
- **Metadata Similarity**: Similarity based on incident metadata
|
|
- **Merge Recommendations**: Suggests appropriate actions
|
|
|
|
## Background Processing
|
|
|
|
The module uses Celery for background processing of AI analysis:
|
|
|
|
### Tasks
|
|
- **process_incident_ai**: Processes a single incident with AI analysis
|
|
- **batch_process_incidents_ai**: Processes multiple incidents
|
|
- **find_correlations**: Finds correlations for an incident
|
|
- **find_duplicates**: Finds duplicates for an incident
|
|
- **detect_all_duplicates**: Batch duplicate detection
|
|
- **correlate_all_incidents**: Batch correlation analysis
|
|
- **merge_duplicate_incidents**: Merges duplicate incidents
|
|
|
|
### Processing Logs
|
|
All AI processing activities are logged in the `AIProcessingLog` model for audit and debugging purposes.
|
|
|
|
## Setup and Configuration
|
|
|
|
### 1. Install Dependencies
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### 2. Run Migrations
|
|
```bash
|
|
python manage.py makemigrations incident_intelligence
|
|
python manage.py migrate
|
|
```
|
|
|
|
### 3. Create Sample Data
|
|
```bash
|
|
python manage.py setup_incident_intelligence --create-sample-data --create-patterns
|
|
```
|
|
|
|
### 4. Run AI Analysis
|
|
```bash
|
|
python manage.py setup_incident_intelligence --run-ai-analysis
|
|
```
|
|
|
|
### 5. Start Celery Worker
|
|
```bash
|
|
celery -A core worker -l info
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Creating an Incident with AI Analysis
|
|
```python
|
|
from incident_intelligence.models import Incident
|
|
from incident_intelligence.tasks import process_incident_ai
|
|
|
|
# Create incident
|
|
incident = Incident.objects.create(
|
|
title="API Response Slow",
|
|
description="The user service API is responding slowly",
|
|
free_text="API is slow, taking forever to respond",
|
|
affected_users=50,
|
|
business_impact="User experience is degraded"
|
|
)
|
|
|
|
# Trigger AI analysis
|
|
process_incident_ai.delay(incident.id)
|
|
```
|
|
|
|
### Finding Correlations
|
|
```python
|
|
from incident_intelligence.ai.correlation import IncidentCorrelationEngine
|
|
|
|
engine = IncidentCorrelationEngine()
|
|
correlations = engine.find_related_incidents(incident_data, all_incidents)
|
|
```
|
|
|
|
### Detecting Duplicates
|
|
```python
|
|
from incident_intelligence.ai.duplication import DuplicationDetector
|
|
|
|
detector = DuplicationDetector()
|
|
duplicates = detector.find_duplicate_candidates(incident_data, all_incidents)
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
- **Batch Processing**: Use batch operations for large datasets
|
|
- **Caching**: Consider caching frequently accessed data
|
|
- **Indexing**: Database indexes are configured for optimal query performance
|
|
- **Background Tasks**: AI processing runs asynchronously to avoid blocking requests
|
|
- **Rate Limiting**: Consider implementing rate limiting for API endpoints
|
|
|
|
## Security Considerations
|
|
|
|
- **Authentication**: All endpoints require authentication
|
|
- **Authorization**: Users can only access incidents they have permission to view
|
|
- **Data Privacy**: Sensitive information is handled according to data classification levels
|
|
- **Audit Logging**: All AI processing activities are logged for audit purposes
|
|
|
|
## Monitoring and Maintenance
|
|
|
|
- **Processing Logs**: Monitor AI processing logs for errors and performance
|
|
- **Model Performance**: Track AI model accuracy and update as needed
|
|
- **Database Maintenance**: Regular cleanup of old processing logs and resolved incidents
|
|
- **Health Checks**: Monitor Celery workers and Redis for background processing health
|
|
|
|
## Future Enhancements
|
|
|
|
- **Machine Learning Models**: Integration with more sophisticated ML models
|
|
- **Real-time Processing**: Real-time incident analysis and correlation
|
|
- **Advanced NLP**: More sophisticated natural language processing
|
|
- **Predictive Analytics**: Predictive incident analysis and prevention
|
|
- **Integration APIs**: APIs for integrating with external incident management systems
|