gnx/ETB

Fork 0

Files

Iliyan Angelov 6b247e5b9f Updates

2025-09-19 11:58:53 +03:00

12 KiB

Raw Permalink Blame History

Incident Intelligence API Documentation

Overview

The Incident Intelligence module provides AI-driven capabilities for incident management, including:

AI-driven incident classification using NLP to categorize incidents from free text
Automated severity suggestion based on impact analysis
Correlation engine for linking related incidents and problem detection
Duplication detection for merging incidents that describe the same outage

Features

1. AI-Driven Incident Classification

Automatically classifies incidents into categories and subcategories based on their content:

Categories: Infrastructure, Application, Security, User Experience, Data, Integration
Subcategories: Specific types within each category (e.g., API_ISSUE, DATABASE_ISSUE)
Confidence Scoring: AI confidence level for each classification
Keyword Extraction: Identifies relevant keywords from incident text
Sentiment Analysis: Analyzes the sentiment of incident descriptions
Urgency Detection: Identifies urgency indicators in the text

2. Automated Severity Suggestion

Suggests incident severity based on multiple factors:

User Impact Analysis: Number of affected users and impact level
Business Impact Assessment: Revenue and operational impact
Technical Impact Evaluation: System and infrastructure impact
Text Analysis: Severity indicators in incident descriptions
Confidence Scoring: AI confidence in severity suggestions

3. Correlation Engine

Links related incidents and detects patterns:

Correlation Types: Same Service, Same Component, Temporal, Pattern Match, Dependency, Cascade
Problem Detection: Identifies when correlations suggest larger problems
Time-based Analysis: Considers temporal proximity of incidents
Service Similarity: Analyzes shared services and components
Pattern Recognition: Detects recurring issues and trends

4. Duplication Detection

Identifies and manages duplicate incidents:

Duplication Types: Exact, Near Duplicate, Similar, Potential Duplicate
Similarity Analysis: Text, temporal, and service similarity scoring
Merge Recommendations: Suggests actions (Merge, Link, Review, No Action)
Confidence Scoring: AI confidence in duplication detection
Shared Elements: Identifies common elements between incidents

API Endpoints

Incidents

Create Incident

POST /api/incidents/incidents/
Content-Type: application/json

{
    "title": "Database Connection Timeout",
    "description": "Users are experiencing timeouts when trying to access the database",
    "free_text": "Database is down, can't connect, getting timeout errors",
    "affected_users": 150,
    "business_impact": "Critical business operations are affected",
    "reporter": 1
}

Get Incident Analysis

GET /api/incidents/incidents/{id}/analysis/

Returns comprehensive AI analysis including:

Classification results
Severity suggestions
Correlations with other incidents
Potential duplicates
Associated patterns

Trigger AI Analysis

POST /api/incidents/incidents/{id}/analyze/

Manually triggers AI analysis for a specific incident.

Get Incident Statistics

GET /api/incidents/incidents/stats/

Returns statistics including:

Total incidents by status and severity
Average resolution time
AI processing statistics
Duplicate and correlation counts

Correlations

Get Correlations

GET /api/incidents/correlations/

Get Problem Indicators

GET /api/incidents/correlations/problem_indicators/

Returns correlations that indicate larger problems.

Duplications

Get Duplications

GET /api/incidents/duplications/

Approve Merge

POST /api/incidents/duplications/{id}/approve_merge/

Reject Merge

POST /api/incidents/duplications/{id}/reject_merge/

Patterns

Get Patterns

GET /api/incidents/patterns/

Get Active Patterns

GET /api/incidents/patterns/active_patterns/

Resolve Pattern

POST /api/incidents/patterns/{id}/resolve_pattern/

Data Models

Incident

id: UUID primary key
title: Incident title
description: Detailed description
free_text: Original free text from user
category: AI-classified category
subcategory: AI-classified subcategory
severity: Current severity level
suggested_severity: AI-suggested severity
status: Current status (Open, In Progress, Resolved, Closed)
assigned_to: Assigned user
reporter: User who reported the incident
affected_users: Number of affected users
business_impact: Business impact description
ai_processed: Whether AI analysis has been completed
is_duplicate: Whether this is a duplicate incident

IncidentClassification

incident: Related incident
predicted_category: AI-predicted category
predicted_subcategory: AI-predicted subcategory
confidence_score: AI confidence (0.0-1.0)
alternative_categories: Alternative predictions
extracted_keywords: Keywords extracted from text
sentiment_score: Sentiment analysis score (-1 to 1)
urgency_indicators: Detected urgency indicators

SeveritySuggestion

incident: Related incident
suggested_severity: AI-suggested severity
confidence_score: AI confidence (0.0-1.0)
user_impact_score: User impact score (0.0-1.0)
business_impact_score: Business impact score (0.0-1.0)
technical_impact_score: Technical impact score (0.0-1.0)
reasoning: AI explanation for suggestion
impact_factors: Factors that influenced the severity

IncidentCorrelation

primary_incident: Primary incident in correlation
related_incident: Related incident
correlation_type: Type of correlation
confidence_score: Correlation confidence (0.0-1.0)
correlation_strength: Strength of correlation
shared_keywords: Keywords shared between incidents
time_difference: Time difference between incidents
similarity_score: Overall similarity score
is_problem_indicator: Whether this suggests a larger problem

DuplicationDetection

incident_a: First incident in pair
incident_b: Second incident in pair
duplication_type: Type of duplication
similarity_score: Overall similarity score
confidence_score: Duplication confidence (0.0-1.0)
text_similarity: Text similarity score
temporal_proximity: Temporal proximity score
service_similarity: Service similarity score
recommended_action: Recommended action (Merge, Link, Review, No Action)
status: Current status (Detected, Reviewed, Merged, Rejected)

IncidentPattern

name: Pattern name
pattern_type: Type of pattern (Recurring, Seasonal, Trend, Anomaly)
description: Pattern description
frequency: How often the pattern occurs
affected_services: Services affected by the pattern
common_keywords: Common keywords in pattern incidents
incidents: Related incidents
confidence_score: Pattern confidence (0.0-1.0)
is_active: Whether the pattern is active
is_resolved: Whether the pattern is resolved

AI Components

IncidentClassifier

Categories: Predefined categories with keywords
Keyword Extraction: Extracts relevant keywords from text
Sentiment Analysis: Analyzes sentiment of incident text
Urgency Detection: Identifies urgency indicators
Confidence Scoring: Provides confidence scores for classifications

SeverityAnalyzer

Impact Analysis: Analyzes user, business, and technical impact
Severity Indicators: Identifies severity keywords in text
Weighted Scoring: Combines multiple factors for severity suggestion
Reasoning Generation: Provides explanations for severity suggestions

IncidentCorrelationEngine

Similarity Analysis: Calculates various similarity metrics
Temporal Analysis: Considers time-based correlations
Service Analysis: Analyzes service and component similarities
Problem Detection: Identifies patterns that suggest larger problems
Cluster Detection: Groups related incidents into clusters

DuplicationDetector

Text Similarity: Multiple text similarity algorithms
Temporal Proximity: Time-based duplication detection
Service Similarity: Service and component similarity
Metadata Similarity: Similarity based on incident metadata
Merge Recommendations: Suggests appropriate actions

Background Processing

The module uses Celery for background processing of AI analysis:

Tasks

process_incident_ai: Processes a single incident with AI analysis
batch_process_incidents_ai: Processes multiple incidents
find_correlations: Finds correlations for an incident
find_duplicates: Finds duplicates for an incident
detect_all_duplicates: Batch duplicate detection
correlate_all_incidents: Batch correlation analysis
merge_duplicate_incidents: Merges duplicate incidents

Processing Logs

All AI processing activities are logged in the AIProcessingLog model for audit and debugging purposes.

Setup and Configuration

1. Install Dependencies

pip install -r requirements.txt

2. Run Migrations

python manage.py makemigrations incident_intelligence
python manage.py migrate

3. Create Sample Data

python manage.py setup_incident_intelligence --create-sample-data --create-patterns

4. Run AI Analysis

python manage.py setup_incident_intelligence --run-ai-analysis

5. Start Celery Worker

celery -A core worker -l info

Usage Examples

Creating an Incident with AI Analysis

from incident_intelligence.models import Incident
from incident_intelligence.tasks import process_incident_ai

# Create incident
incident = Incident.objects.create(
    title="API Response Slow",
    description="The user service API is responding slowly",
    free_text="API is slow, taking forever to respond",
    affected_users=50,
    business_impact="User experience is degraded"
)

# Trigger AI analysis
process_incident_ai.delay(incident.id)

Finding Correlations

from incident_intelligence.ai.correlation import IncidentCorrelationEngine

engine = IncidentCorrelationEngine()
correlations = engine.find_related_incidents(incident_data, all_incidents)

Detecting Duplicates

from incident_intelligence.ai.duplication import DuplicationDetector

detector = DuplicationDetector()
duplicates = detector.find_duplicate_candidates(incident_data, all_incidents)

Performance Considerations

Batch Processing: Use batch operations for large datasets
Caching: Consider caching frequently accessed data
Indexing: Database indexes are configured for optimal query performance
Background Tasks: AI processing runs asynchronously to avoid blocking requests
Rate Limiting: Consider implementing rate limiting for API endpoints

Security Considerations

Authentication: All endpoints require authentication
Authorization: Users can only access incidents they have permission to view
Data Privacy: Sensitive information is handled according to data classification levels
Audit Logging: All AI processing activities are logged for audit purposes

Monitoring and Maintenance

Processing Logs: Monitor AI processing logs for errors and performance
Model Performance: Track AI model accuracy and update as needed
Database Maintenance: Regular cleanup of old processing logs and resolved incidents
Health Checks: Monitor Celery workers and Redis for background processing health

Future Enhancements

Machine Learning Models: Integration with more sophisticated ML models
Real-time Processing: Real-time incident analysis and correlation
Advanced NLP: More sophisticated natural language processing
Predictive Analytics: Predictive incident analysis and prevention
Integration APIs: APIs for integrating with external incident management systems

12 KiB Raw Permalink Blame History