12 KiB
12 KiB
Incident Intelligence API Documentation
Overview
The Incident Intelligence module provides AI-driven capabilities for incident management, including:
- AI-driven incident classification using NLP to categorize incidents from free text
- Automated severity suggestion based on impact analysis
- Correlation engine for linking related incidents and problem detection
- Duplication detection for merging incidents that describe the same outage
Features
1. AI-Driven Incident Classification
Automatically classifies incidents into categories and subcategories based on their content:
- Categories: Infrastructure, Application, Security, User Experience, Data, Integration
- Subcategories: Specific types within each category (e.g., API_ISSUE, DATABASE_ISSUE)
- Confidence Scoring: AI confidence level for each classification
- Keyword Extraction: Identifies relevant keywords from incident text
- Sentiment Analysis: Analyzes the sentiment of incident descriptions
- Urgency Detection: Identifies urgency indicators in the text
2. Automated Severity Suggestion
Suggests incident severity based on multiple factors:
- User Impact Analysis: Number of affected users and impact level
- Business Impact Assessment: Revenue and operational impact
- Technical Impact Evaluation: System and infrastructure impact
- Text Analysis: Severity indicators in incident descriptions
- Confidence Scoring: AI confidence in severity suggestions
3. Correlation Engine
Links related incidents and detects patterns:
- Correlation Types: Same Service, Same Component, Temporal, Pattern Match, Dependency, Cascade
- Problem Detection: Identifies when correlations suggest larger problems
- Time-based Analysis: Considers temporal proximity of incidents
- Service Similarity: Analyzes shared services and components
- Pattern Recognition: Detects recurring issues and trends
4. Duplication Detection
Identifies and manages duplicate incidents:
- Duplication Types: Exact, Near Duplicate, Similar, Potential Duplicate
- Similarity Analysis: Text, temporal, and service similarity scoring
- Merge Recommendations: Suggests actions (Merge, Link, Review, No Action)
- Confidence Scoring: AI confidence in duplication detection
- Shared Elements: Identifies common elements between incidents
API Endpoints
Incidents
Create Incident
POST /api/incidents/incidents/
Content-Type: application/json
{
"title": "Database Connection Timeout",
"description": "Users are experiencing timeouts when trying to access the database",
"free_text": "Database is down, can't connect, getting timeout errors",
"affected_users": 150,
"business_impact": "Critical business operations are affected",
"reporter": 1
}
Get Incident Analysis
GET /api/incidents/incidents/{id}/analysis/
Returns comprehensive AI analysis including:
- Classification results
- Severity suggestions
- Correlations with other incidents
- Potential duplicates
- Associated patterns
Trigger AI Analysis
POST /api/incidents/incidents/{id}/analyze/
Manually triggers AI analysis for a specific incident.
Get Incident Statistics
GET /api/incidents/incidents/stats/
Returns statistics including:
- Total incidents by status and severity
- Average resolution time
- AI processing statistics
- Duplicate and correlation counts
Correlations
Get Correlations
GET /api/incidents/correlations/
Get Problem Indicators
GET /api/incidents/correlations/problem_indicators/
Returns correlations that indicate larger problems.
Duplications
Get Duplications
GET /api/incidents/duplications/
Approve Merge
POST /api/incidents/duplications/{id}/approve_merge/
Reject Merge
POST /api/incidents/duplications/{id}/reject_merge/
Patterns
Get Patterns
GET /api/incidents/patterns/
Get Active Patterns
GET /api/incidents/patterns/active_patterns/
Resolve Pattern
POST /api/incidents/patterns/{id}/resolve_pattern/
Data Models
Incident
- id: UUID primary key
- title: Incident title
- description: Detailed description
- free_text: Original free text from user
- category: AI-classified category
- subcategory: AI-classified subcategory
- severity: Current severity level
- suggested_severity: AI-suggested severity
- status: Current status (Open, In Progress, Resolved, Closed)
- assigned_to: Assigned user
- reporter: User who reported the incident
- affected_users: Number of affected users
- business_impact: Business impact description
- ai_processed: Whether AI analysis has been completed
- is_duplicate: Whether this is a duplicate incident
IncidentClassification
- incident: Related incident
- predicted_category: AI-predicted category
- predicted_subcategory: AI-predicted subcategory
- confidence_score: AI confidence (0.0-1.0)
- alternative_categories: Alternative predictions
- extracted_keywords: Keywords extracted from text
- sentiment_score: Sentiment analysis score (-1 to 1)
- urgency_indicators: Detected urgency indicators
SeveritySuggestion
- incident: Related incident
- suggested_severity: AI-suggested severity
- confidence_score: AI confidence (0.0-1.0)
- user_impact_score: User impact score (0.0-1.0)
- business_impact_score: Business impact score (0.0-1.0)
- technical_impact_score: Technical impact score (0.0-1.0)
- reasoning: AI explanation for suggestion
- impact_factors: Factors that influenced the severity
IncidentCorrelation
- primary_incident: Primary incident in correlation
- related_incident: Related incident
- correlation_type: Type of correlation
- confidence_score: Correlation confidence (0.0-1.0)
- correlation_strength: Strength of correlation
- shared_keywords: Keywords shared between incidents
- time_difference: Time difference between incidents
- similarity_score: Overall similarity score
- is_problem_indicator: Whether this suggests a larger problem
DuplicationDetection
- incident_a: First incident in pair
- incident_b: Second incident in pair
- duplication_type: Type of duplication
- similarity_score: Overall similarity score
- confidence_score: Duplication confidence (0.0-1.0)
- text_similarity: Text similarity score
- temporal_proximity: Temporal proximity score
- service_similarity: Service similarity score
- recommended_action: Recommended action (Merge, Link, Review, No Action)
- status: Current status (Detected, Reviewed, Merged, Rejected)
IncidentPattern
- name: Pattern name
- pattern_type: Type of pattern (Recurring, Seasonal, Trend, Anomaly)
- description: Pattern description
- frequency: How often the pattern occurs
- affected_services: Services affected by the pattern
- common_keywords: Common keywords in pattern incidents
- incidents: Related incidents
- confidence_score: Pattern confidence (0.0-1.0)
- is_active: Whether the pattern is active
- is_resolved: Whether the pattern is resolved
AI Components
IncidentClassifier
- Categories: Predefined categories with keywords
- Keyword Extraction: Extracts relevant keywords from text
- Sentiment Analysis: Analyzes sentiment of incident text
- Urgency Detection: Identifies urgency indicators
- Confidence Scoring: Provides confidence scores for classifications
SeverityAnalyzer
- Impact Analysis: Analyzes user, business, and technical impact
- Severity Indicators: Identifies severity keywords in text
- Weighted Scoring: Combines multiple factors for severity suggestion
- Reasoning Generation: Provides explanations for severity suggestions
IncidentCorrelationEngine
- Similarity Analysis: Calculates various similarity metrics
- Temporal Analysis: Considers time-based correlations
- Service Analysis: Analyzes service and component similarities
- Problem Detection: Identifies patterns that suggest larger problems
- Cluster Detection: Groups related incidents into clusters
DuplicationDetector
- Text Similarity: Multiple text similarity algorithms
- Temporal Proximity: Time-based duplication detection
- Service Similarity: Service and component similarity
- Metadata Similarity: Similarity based on incident metadata
- Merge Recommendations: Suggests appropriate actions
Background Processing
The module uses Celery for background processing of AI analysis:
Tasks
- process_incident_ai: Processes a single incident with AI analysis
- batch_process_incidents_ai: Processes multiple incidents
- find_correlations: Finds correlations for an incident
- find_duplicates: Finds duplicates for an incident
- detect_all_duplicates: Batch duplicate detection
- correlate_all_incidents: Batch correlation analysis
- merge_duplicate_incidents: Merges duplicate incidents
Processing Logs
All AI processing activities are logged in the AIProcessingLog model for audit and debugging purposes.
Setup and Configuration
1. Install Dependencies
pip install -r requirements.txt
2. Run Migrations
python manage.py makemigrations incident_intelligence
python manage.py migrate
3. Create Sample Data
python manage.py setup_incident_intelligence --create-sample-data --create-patterns
4. Run AI Analysis
python manage.py setup_incident_intelligence --run-ai-analysis
5. Start Celery Worker
celery -A core worker -l info
Usage Examples
Creating an Incident with AI Analysis
from incident_intelligence.models import Incident
from incident_intelligence.tasks import process_incident_ai
# Create incident
incident = Incident.objects.create(
title="API Response Slow",
description="The user service API is responding slowly",
free_text="API is slow, taking forever to respond",
affected_users=50,
business_impact="User experience is degraded"
)
# Trigger AI analysis
process_incident_ai.delay(incident.id)
Finding Correlations
from incident_intelligence.ai.correlation import IncidentCorrelationEngine
engine = IncidentCorrelationEngine()
correlations = engine.find_related_incidents(incident_data, all_incidents)
Detecting Duplicates
from incident_intelligence.ai.duplication import DuplicationDetector
detector = DuplicationDetector()
duplicates = detector.find_duplicate_candidates(incident_data, all_incidents)
Performance Considerations
- Batch Processing: Use batch operations for large datasets
- Caching: Consider caching frequently accessed data
- Indexing: Database indexes are configured for optimal query performance
- Background Tasks: AI processing runs asynchronously to avoid blocking requests
- Rate Limiting: Consider implementing rate limiting for API endpoints
Security Considerations
- Authentication: All endpoints require authentication
- Authorization: Users can only access incidents they have permission to view
- Data Privacy: Sensitive information is handled according to data classification levels
- Audit Logging: All AI processing activities are logged for audit purposes
Monitoring and Maintenance
- Processing Logs: Monitor AI processing logs for errors and performance
- Model Performance: Track AI model accuracy and update as needed
- Database Maintenance: Regular cleanup of old processing logs and resolved incidents
- Health Checks: Monitor Celery workers and Redis for background processing health
Future Enhancements
- Machine Learning Models: Integration with more sophisticated ML models
- Real-time Processing: Real-time incident analysis and correlation
- Advanced NLP: More sophisticated natural language processing
- Predictive Analytics: Predictive incident analysis and prevention
- Integration APIs: APIs for integrating with external incident management systems