AI Root Cause Analysis
Automated root cause analysis using LLM-optimized trace payloads — local by default, with optional external LLM API integration and strict privacy guarantees.
AI Root Cause Analysis
CoreSDK's AI Root Cause Analysis (RCA) feature feeds a structured, LLM-optimised trace payload to a language model and returns a root cause hypothesis, a fix suggestion, and a confidence score. The analysis runs locally by default using an embedded model, so no request data ever leaves your infrastructure unless you explicitly opt in to an external LLM endpoint.
Phase note. AI Root Cause Analysis is available from Phase 2. The embedded model requires the Standard plan or above. External LLM API integration is a Phase 3 enterprise feature.
How it works
When an anomaly is detected — elevated error rate, latency spike, policy evaluation timeout, or a manually triggered analysis — CoreSDK:
- Collects the relevant trace window from the local OTEL buffer (default: last 5 minutes, configurable).
- Builds an LLM-optimised trace payload via the
llm-trace-exportpipeline. This payload contains span structure, timing, status codes, error messages, and policy decision outcomes. Variable values and request bodies are never included. - Submits the payload to the configured model (embedded or external).
- Returns a structured RCA response with a root cause, fix suggestion, confidence score, and supporting span references.
The LLM-optimised payload is a lossy projection of the full trace — it retains causality and timing relationships while stripping all data that could contain personal information.
Privacy guarantees
The following categories of data are never included in the LLM trace payload:
- JWT claim values (user IDs, email addresses, roles are replaced with type labels:
<string>,<uuid>, etc.) - Request path parameters and query string values
- Request and response bodies
- HTTP headers (only header names are included, not values)
- Policy input document field values (field names and types are included; values are omitted)
- Span attribute values that match PII masking rules (same ruleset as the OTEL
SpanProcessor)
What the payload does include:
- Span names, kinds, and parent-child relationships
- Timing: start time, duration, and relative ordering
- Status codes (HTTP and gRPC) and error flags
- Error messages from CoreSDK internals (auth failures, policy denials, circuit breaker events)
- Metric summaries: request count, error rate, p50/p95/p99 latency
This design means the analysis degrades gracefully — the model reasons about structure and timing, not values — which is sufficient for the majority of operational failure modes CoreSDK addresses.
Configuration
Local embedded model (default)
No configuration is required to use the embedded model. It is enabled by default when the ai_rca block is present:
# coresdk.yaml
ai_rca:
enabled: true
trace_window_seconds: 300 # how far back to collect spans for a single analysis
auto_trigger:
error_rate_threshold: 0.05 # trigger automatically when error rate exceeds 5%
p99_latency_ms: 2000 # or when p99 exceeds 2000msThe embedded model is a quantised GGUF model bundled with the sidecar binary. It runs on CPU via llama.cpp and requires no GPU. Inference takes 2–8 seconds depending on host hardware.
External LLM API (opt-in)
To use an external LLM — OpenAI, Anthropic, Azure OpenAI, or any OpenAI-compatible endpoint — set CORESDK_LLM_ENDPOINT and the corresponding API key:
export CORESDK_LLM_ENDPOINT="https://api.openai.com/v1/chat/completions"
export CORESDK_LLM_API_KEY="sk-..."
export CORESDK_LLM_MODEL="gpt-4o"Or configure in coresdk.yaml:
ai_rca:
enabled: true
llm:
endpoint: "https://api.openai.com/v1/chat/completions"
model: "gpt-4o"
api_key_env: "CORESDK_LLM_API_KEY" # read from env, never hardcode
timeout_seconds: 30
max_tokens: 2048When an external endpoint is configured, the privacy-scrubbed LLM payload is transmitted to that endpoint. No raw span data, variable values, or PII-adjacent fields are included. Review the Privacy guarantees section and your LLM provider's data processing terms before enabling external integration.
To use Azure OpenAI:
ai_rca:
enabled: true
llm:
endpoint: "https://my-resource.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview"
model: "gpt-4o"
api_key_env: "AZURE_OPENAI_API_KEY"Triggering an analysis
Automatic triggering
When auto_trigger thresholds are configured, CoreSDK triggers an analysis automatically and writes the result to the audit log and the configured RCA sink.
Manual trigger via CLI
core rca analyze --window 10m
core rca analyze --trace-id 01HX9K3ZQVFM4NP8WJTY6EGCD2
core rca analyze --window 10m --output rca-report.jsonManual trigger via API
curl -X POST http://localhost:7000/v1/rca/analyze \
-H "Authorization: Bearer $CORESDK_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"window_seconds": 600}'Python example
from coresdk.tracing.processor import PIIMaskingSpanProcessor
from coresdk.client import CoreSDKClient, SDKConfig
sdk = CoreSDKClient.from_env()
# Trigger analysis over the last 10 minutes
result = sdk.rca.analyze(window_seconds=600)
print(f"Root cause: {result.root_cause}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Fix: {result.fix_suggestion}")
for span_ref in result.supporting_spans:
print(f" - {span_ref.span_name} ({span_ref.trace_id[:8]}): {span_ref.annotation}")Trigger analysis for a specific trace ID:
result = sdk.rca.analyze_trace(trace_id="01HX9K3ZQVFM4NP8WJTY6EGCD2")
if result.confidence >= 0.8:
# High-confidence result — page on-call
pager.trigger(
title=f"RCA: {result.root_cause}",
body=result.fix_suggestion,
severity="high",
)Go example
import (
"context"
"fmt"
"time"
coresdk "github.com/coresdk/coresdk-go"
)
func runRCA(ctx context.Context, sdk *coresdk.SDK) {
result, err := sdk.RCA().Analyze(ctx, coresdk.RCAOptions{
Window: 10 * time.Minute,
})
if err != nil {
// RCA failure is non-fatal — log and continue
sdk.Logger().Warn("rca analysis failed", "error", err)
return
}
fmt.Printf("Root cause: %s\n", result.RootCause)
fmt.Printf("Confidence: %.0f%%\n", result.Confidence*100)
fmt.Printf("Fix suggestion: %s\n", result.FixSuggestion)
for _, span := range result.SupportingSpans {
fmt.Printf(" span=%s trace=%s annotation=%s\n",
span.SpanName, span.TraceID[:8], span.Annotation)
}
}
// Analyze a specific trace
func analyzeTrace(ctx context.Context, sdk *coresdk.SDK, traceID string) (*coresdk.RCAResult, error) {
return sdk.RCA().AnalyzeTrace(ctx, traceID)
}Sample RCA response
The following is a representative RCA response JSON returned by the API or the analyze() SDK method:
{
"analysis_id": "rca_01HX9K3ZQVFM4NP8WJTY6EGCD2",
"generated_at": "2026-03-19T14:32:07Z",
"window": {
"start": "2026-03-19T14:22:07Z",
"end": "2026-03-19T14:32:07Z"
},
"root_cause": "Policy engine pool exhaustion: all 8 worker threads were blocked on synchronous Rego evaluation for >1500ms. Incoming requests queued and timed out after the circuit breaker threshold.",
"confidence": 0.91,
"fix_suggestion": "Increase CORESDK_POLICY_POOL_SIZE from 8 to 16, or reduce Rego policy complexity — the 'data.authz.rbac.allow' rule appears to be performing O(n) iteration over a large data set on each evaluation. Consider indexing the roles array.",
"category": "resource_exhaustion",
"severity": "high",
"supporting_spans": [
{
"trace_id": "01HX9K3ZQVFM4NP8WJTY6EGCD2",
"span_id": "a1b2c3d4e5f60001",
"span_name": "policy.evaluate",
"duration_ms": 1847,
"status": "timeout",
"annotation": "Policy evaluation exceeded 1500ms SLO; circuit breaker opened after 10 consecutive timeouts"
},
{
"trace_id": "01HX9K3ZQVFM4NP8WJTY6EGCD3",
"span_id": "a1b2c3d4e5f60002",
"span_name": "policy.evaluate",
"duration_ms": 1923,
"status": "timeout",
"annotation": "Concurrent evaluation; pool saturation confirmed"
},
{
"trace_id": "01HX9K3ZQVFM4NP8WJTY6EGCD4",
"span_id": "a1b2c3d4e5f60003",
"span_name": "circuit_breaker.open",
"duration_ms": 0,
"status": "event",
"annotation": "Circuit breaker transitioned to open state; all subsequent policy calls fail-open per CORESDK_FAIL_MODE=open"
}
],
"metrics_summary": {
"error_rate_percent": 18.4,
"p50_latency_ms": 142,
"p95_latency_ms": 1680,
"p99_latency_ms": 2100,
"request_count": 4821,
"policy_timeout_count": 47
},
"model": {
"name": "coresdk-rca-v1",
"source": "embedded",
"version": "1.0.3"
}
}Sinking RCA results
RCA results can be written to the same sinks as audit events, or to a dedicated sink:
ai_rca:
enabled: true
sink:
type: s3
s3:
bucket: "my-rca-results"
prefix: "coresdk-rca/"
region: "us-east-1"Results are also available via the admin API at GET /v1/rca/results and via the CLI:
core rca list --limit 10
core rca get rca_01HX9K3ZQVFM4NP8WJTY6EGCD2Limitations
- The embedded model is optimised for CoreSDK failure patterns — policy exhaustion, auth misconfiguration, circuit breaker events, PII masking errors, and mTLS handshake failures. Generic application errors in your own code may produce lower-confidence results.
- Confidence scores below 0.5 indicate insufficient signal in the trace window. Increase
trace_window_secondsor wait for more traffic before re-triggering. - RCA does not execute remediation actions. It produces a hypothesis and suggestion only. All changes remain the operator's responsibility.
- The embedded model is updated with each CoreSDK release. Pin your sidecar version if you require reproducible RCA output across deployments.
Next steps
- OTEL Integration — configuring the trace pipeline that feeds RCA
- Alerting — automatic alert triggers that can initiate an RCA analysis
- Metrics — the metric summaries included in RCA payloads
- Compliance Controls — how RCA audit trail entries support SOC 2 CC7.2