Automated root cause analysis using LLM-optimized trace payloads — local by default, with optional external LLM API integration and strict privacy guarantees.

AI Root Cause Analysis

CoreSDK's AI Root Cause Analysis (RCA) feature feeds a structured, LLM-optimised trace payload to a language model and returns a root cause hypothesis, a fix suggestion, and a confidence score. The analysis runs locally by default using an embedded model, so no request data ever leaves your infrastructure unless you explicitly opt in to an external LLM endpoint.

Phase note. AI Root Cause Analysis is available from Phase 2. The embedded model requires the Standard plan or above. External LLM API integration is a Phase 3 enterprise feature.

How it works

When an anomaly is detected — elevated error rate, latency spike, policy evaluation timeout, or a manually triggered analysis — CoreSDK:

Collects the relevant trace window from the local OTEL buffer (default: last 5 minutes, configurable).
Builds an LLM-optimised trace payload via the llm-trace-export pipeline. This payload contains span structure, timing, status codes, error messages, and policy decision outcomes. Variable values and request bodies are never included.
Submits the payload to the configured model (embedded or external).
Returns a structured RCA response with a root cause, fix suggestion, confidence score, and supporting span references.

The LLM-optimised payload is a lossy projection of the full trace — it retains causality and timing relationships while stripping all data that could contain personal information.

Privacy guarantees

The following categories of data are never included in the LLM trace payload:

JWT claim values (user IDs, email addresses, roles are replaced with type labels: <string>, <uuid>, etc.)
Request path parameters and query string values
Request and response bodies
HTTP headers (only header names are included, not values)
Policy input document field values (field names and types are included; values are omitted)
Span attribute values that match PII masking rules (same ruleset as the OTEL SpanProcessor)

What the payload does include:

Span names, kinds, and parent-child relationships
Timing: start time, duration, and relative ordering
Status codes (HTTP and gRPC) and error flags
Error messages from CoreSDK internals (auth failures, policy denials, circuit breaker events)
Metric summaries: request count, error rate, p50/p95/p99 latency

This design means the analysis degrades gracefully — the model reasons about structure and timing, not values — which is sufficient for the majority of operational failure modes CoreSDK addresses.

Configuration

Local embedded model (default)

No configuration is required to use the embedded model. It is enabled by default when the ai_rca block is present:

# coresdk.yaml
ai_rca:
  enabled: true
  trace_window_seconds: 300    # how far back to collect spans for a single analysis
  auto_trigger:
    error_rate_threshold: 0.05  # trigger automatically when error rate exceeds 5%
    p99_latency_ms: 2000        # or when p99 exceeds 2000ms

The embedded model is a quantised GGUF model bundled with the sidecar binary. It runs on CPU via llama.cpp and requires no GPU. Inference takes 2–8 seconds depending on host hardware.

External LLM API (opt-in)

To use an external LLM — OpenAI, Anthropic, Azure OpenAI, or any OpenAI-compatible endpoint — set CORESDK_LLM_ENDPOINT and the corresponding API key:

export CORESDK_LLM_ENDPOINT="https://api.openai.com/v1/chat/completions"
export CORESDK_LLM_API_KEY="sk-..."
export CORESDK_LLM_MODEL="gpt-4o"

Or configure in coresdk.yaml:

ai_rca:
  enabled: true
  llm:
    endpoint: "https://api.openai.com/v1/chat/completions"
    model: "gpt-4o"
    api_key_env: "CORESDK_LLM_API_KEY"   # read from env, never hardcode
    timeout_seconds: 30
    max_tokens: 2048

When an external endpoint is configured, the privacy-scrubbed LLM payload is transmitted to that endpoint. No raw span data, variable values, or PII-adjacent fields are included. Review the Privacy guarantees section and your LLM provider's data processing terms before enabling external integration.

To use Azure OpenAI:

ai_rca:
  enabled: true
  llm:
    endpoint: "https://my-resource.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview"
    model: "gpt-4o"
    api_key_env: "AZURE_OPENAI_API_KEY"

core rca analyze --window 10m
core rca analyze --trace-id 01HX9K3ZQVFM4NP8WJTY6EGCD2
core rca analyze --window 10m --output rca-report.json

Manual trigger via API

curl -X POST http://localhost:7000/v1/rca/analyze \
  -H "Authorization: Bearer $CORESDK_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"window_seconds": 600}'

Python example

from coresdk.tracing.processor import PIIMaskingSpanProcessor
from coresdk.client import CoreSDKClient, SDKConfig

sdk = CoreSDKClient.from_env()

# Trigger analysis over the last 10 minutes
result = sdk.rca.analyze(window_seconds=600)

print(f"Root cause: {result.root_cause}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Fix: {result.fix_suggestion}")

for span_ref in result.supporting_spans:
    print(f"  - {span_ref.span_name} ({span_ref.trace_id[:8]}): {span_ref.annotation}")

Trigger analysis for a specific trace ID:

result = sdk.rca.analyze_trace(trace_id="01HX9K3ZQVFM4NP8WJTY6EGCD2")

if result.confidence >= 0.8:
    # High-confidence result — page on-call
    pager.trigger(
        title=f"RCA: {result.root_cause}",
        body=result.fix_suggestion,
        severity="high",
    )

Go example

import (
    "context"
    "fmt"
    "time"

    coresdk "github.com/coresdk/coresdk-go"
)

func runRCA(ctx context.Context, sdk *coresdk.SDK) {
    result, err := sdk.RCA().Analyze(ctx, coresdk.RCAOptions{
        Window: 10 * time.Minute,
    })
    if err != nil {
        // RCA failure is non-fatal — log and continue
        sdk.Logger().Warn("rca analysis failed", "error", err)
        return
    }

    fmt.Printf("Root cause: %s\n", result.RootCause)
    fmt.Printf("Confidence: %.0f%%\n", result.Confidence*100)
    fmt.Printf("Fix suggestion: %s\n", result.FixSuggestion)

    for _, span := range result.SupportingSpans {
        fmt.Printf("  span=%s trace=%s annotation=%s\n",
            span.SpanName, span.TraceID[:8], span.Annotation)
    }
}

// Analyze a specific trace
func analyzeTrace(ctx context.Context, sdk *coresdk.SDK, traceID string) (*coresdk.RCAResult, error) {
    return sdk.RCA().AnalyzeTrace(ctx, traceID)
}

Sample RCA response

The following is a representative RCA response JSON returned by the API or the analyze() SDK method:

{
  "analysis_id": "rca_01HX9K3ZQVFM4NP8WJTY6EGCD2",
  "generated_at": "2026-03-19T14:32:07Z",
  "window": {
    "start": "2026-03-19T14:22:07Z",
    "end": "2026-03-19T14:32:07Z"
  },
  "root_cause": "Policy engine pool exhaustion: all 8 worker threads were blocked on synchronous Rego evaluation for >1500ms. Incoming requests queued and timed out after the circuit breaker threshold.",
  "confidence": 0.91,
  "fix_suggestion": "Increase CORESDK_POLICY_POOL_SIZE from 8 to 16, or reduce Rego policy complexity — the 'data.authz.rbac.allow' rule appears to be performing O(n) iteration over a large data set on each evaluation. Consider indexing the roles array.",
  "category": "resource_exhaustion",
  "severity": "high",
  "supporting_spans": [
    {
      "trace_id": "01HX9K3ZQVFM4NP8WJTY6EGCD2",
      "span_id": "a1b2c3d4e5f60001",
      "span_name": "policy.evaluate",
      "duration_ms": 1847,
      "status": "timeout",
      "annotation": "Policy evaluation exceeded 1500ms SLO; circuit breaker opened after 10 consecutive timeouts"
    },
    {
      "trace_id": "01HX9K3ZQVFM4NP8WJTY6EGCD3",
      "span_id": "a1b2c3d4e5f60002",
      "span_name": "policy.evaluate",
      "duration_ms": 1923,
      "status": "timeout",
      "annotation": "Concurrent evaluation; pool saturation confirmed"
    },
    {
      "trace_id": "01HX9K3ZQVFM4NP8WJTY6EGCD4",
      "span_id": "a1b2c3d4e5f60003",
      "span_name": "circuit_breaker.open",
      "duration_ms": 0,
      "status": "event",
      "annotation": "Circuit breaker transitioned to open state; all subsequent policy calls fail-open per CORESDK_FAIL_MODE=open"
    }
  ],
  "metrics_summary": {
    "error_rate_percent": 18.4,
    "p50_latency_ms": 142,
    "p95_latency_ms": 1680,
    "p99_latency_ms": 2100,
    "request_count": 4821,
    "policy_timeout_count": 47
  },
  "model": {
    "name": "coresdk-rca-v1",
    "source": "embedded",
    "version": "1.0.3"
  }
}

Sinking RCA results

RCA results can be written to the same sinks as audit events, or to a dedicated sink:

ai_rca:
  enabled: true
  sink:
    type: s3
    s3:
      bucket: "my-rca-results"
      prefix: "coresdk-rca/"
      region: "us-east-1"

Results are also available via the admin API at GET /v1/rca/results and via the CLI:

core rca list --limit 10
core rca get rca_01HX9K3ZQVFM4NP8WJTY6EGCD2

Limitations

The embedded model is optimised for CoreSDK failure patterns — policy exhaustion, auth misconfiguration, circuit breaker events, PII masking errors, and mTLS handshake failures. Generic application errors in your own code may produce lower-confidence results.
Confidence scores below 0.5 indicate insufficient signal in the trace window. Increase trace_window_seconds or wait for more traffic before re-triggering.
RCA does not execute remediation actions. It produces a hypothesis and suggestion only. All changes remain the operator's responsibility.
The embedded model is updated with each CoreSDK release. Pin your sidecar version if you require reproducible RCA output across deployments.

Next steps

OTEL Integration — configuring the trace pipeline that feeds RCA
Alerting — automatic alert triggers that can initiate an RCA analysis
Metrics — the metric summaries included in RCA payloads
Compliance Controls — how RCA audit trail entries support SOC 2 CC7.2

AI Root Cause Analysis

AI Root Cause Analysis

How it works

Privacy guarantees

Configuration

Local embedded model (default)

External LLM API (opt-in)

Triggering an analysis

Automatic triggering

Manual trigger via CLI

Manual trigger via API

Python example

Go example

Sample RCA response

Sinking RCA results

Limitations

Next steps

On this page