Skip to main content
CoreSDK
Observability

SLA / SLO Tracking

Map CoreSDK OTel metrics to SLO signals, configure Prometheus recording rules for burn rate, and integrate with Nobl9, Datadog SLOs, and Google Cloud Monitoring.

Phase note. Ships Phase 2.

SLA / SLO Tracking

CoreSDK emits the OTel metrics you need to define and track SLOs without additional instrumentation. Every metric carries tenant_id and service_name labels so you can define per-tenant SLOs and compare against overall service targets.

SLO tracking in CoreSDK covers three primary reliability signals:

SignalSLO TypeCoreSDK Metric
Auth latencyLatencycoresdk.auth.latency_ms
Policy eval latencyLatencycoresdk.policy.eval_latency_ms
Sidecar uptimeAvailabilitycoresdk.sidecar.uptime
Request error rateError ratecoresdk.requests.error_rate

Rust API — SloTracker

SloTracker is the embedded SLO engine that computes budget consumption and breach state from the OTel metrics stream. It is used internally by the sidecar and can be embedded directly in Rust services:

use coresdk_slo::{SloTracker, SloConfig, SloWindow};
use std::time::Duration;

// Define an SLO
let config = SloConfig {
    name:        "auth-latency-p99".to_string(),
    objective:   0.999,               // 99.9% of requests must succeed
    window:      SloWindow::Rolling(Duration::from_secs(30 * 24 * 3600)), // 30 days
    // A "good event" is any auth request completing under 10ms
    good_metric: "coresdk.auth.latency_ms".to_string(),
    good_threshold_ms: Some(10),
    total_metric: "coresdk.auth.latency_ms".to_string(),
};

let tracker = SloTracker::new(config);

// Feed metric observations — call this from your OTel metric exporter
tracker.record_good(1);   // one request under threshold
tracker.record_total(1);  // one request total

// Query the current state
let budget  = tracker.error_budget_remaining(); // 0.0 – 1.0 fraction remaining
let breach  = tracker.is_breaching();           // true if budget < 0
let avail   = tracker.availability();           // e.g. 0.9994 (good / total)

println!("Error budget remaining: {:.1}%", budget * 100.0);
println!("Availability:           {:.4}%", avail * 100.0);
println!("Breaching:              {}", breach);

SloWindow

VariantDescription
SloWindow::Rolling(Duration)Sliding window; most common for latency / error-rate SLOs
SloWindow::Calendar { period }Calendar-aligned (month, quarter); used for uptime SLAs

Integration with OTel metrics

SloTracker reads from the same OTel metric stream that CoreSDK exports. Attach it to the sidecar metrics pipeline via the sidecar YAML:

slo:
  - name: auth-latency-p99
    objective: 0.999
    window: 30d
    good_metric: coresdk.auth.latency_ms
    good_threshold_ms: 10
    total_metric: coresdk.auth.latency_ms

The sidecar exposes breach state and budget remaining as OTel gauge metrics (coresdk.slo.budget_remaining, coresdk.slo.is_breaching) so they flow into your existing dashboards without additional instrumentation.


Built-in SLO metrics

coresdk.auth.latency_ms

Histogram of JWT validation and authorization latency in milliseconds, measured at the sidecar.

LabelValuesDescription
tenant_idstringTenant context for the request
service_namestringOTel service.name of the caller
outcomeallowed, denied, errorAuth decision result

Use the p99 bucket to drive latency SLOs. Example SLO objective: p99 auth latency < 10ms over a 30-day window.

coresdk.policy.eval_latency_ms

Histogram of Rego policy evaluation latency in milliseconds. The engine pool ensures p99 < 2ms for policies under 50 rules on warm cache.

LabelValuesDescription
tenant_idstringTenant context
rulestringFully-qualified Rego rule path (e.g. data.billing.can_upgrade)
outcomeallow, deny, errorPolicy decision

coresdk.sidecar.uptime

Gauge. Value is 1 when the sidecar is healthy (passes /readyz), 0 otherwise. Scrape interval drives availability granularity — 15s scrape gives 0.017% availability resolution per interval.

coresdk.requests.error_rate

Counter of requests that resulted in an RFC 9457 error response. Divide by coresdk.requests.total for a ratio.


Prometheus recording rules

Define recording rules to pre-compute SLO burn rate. Recording rules are evaluated on every Prometheus scrape and stored as new time series — dashboards and alerts query the pre-computed series instead of recalculating on each request.

groups:
  - name: coresdk_slo
    interval: 30s
    rules:

      # Auth latency — fraction of requests exceeding 10ms threshold
      - record: coresdk:auth_latency_slo_error_ratio:5m
        expr: |
          1 - (
            rate(coresdk_auth_latency_ms_bucket{le="10"}[5m])
            /
            rate(coresdk_auth_latency_ms_count[5m])
          )

      # Policy eval latency — fraction of evals exceeding 2ms threshold
      - record: coresdk:policy_latency_slo_error_ratio:5m
        expr: |
          1 - (
            rate(coresdk_policy_eval_latency_ms_bucket{le="2"}[5m])
            /
            rate(coresdk_policy_eval_latency_ms_count[5m])
          )

      # Sidecar availability — fraction of scrapes where sidecar was unhealthy
      - record: coresdk:sidecar_availability:5m
        expr: avg_over_time(coresdk_sidecar_uptime[5m])

      # Error rate ratio
      - record: coresdk:request_error_ratio:5m
        expr: |
          rate(coresdk_requests_error_rate_total[5m])
          /
          rate(coresdk_requests_total[5m])

Multi-window burn rate alerts

Multi-window burn rate alerting catches SLO budget exhaustion at two timescales simultaneously: a short window (1h) detects fast burns and a long window (6h) detects slow burns that would be missed otherwise.

groups:
  - name: coresdk_slo_alerts
    rules:

      # Fast burn: consuming 14.4× the error budget over 1h
      # This burns the full 30-day budget in ~2 days
      - alert: CoreSDKAuthLatencySLOFastBurn
        expr: |
          coresdk:auth_latency_slo_error_ratio:5m > (14.4 * 0.001)
          and
          coresdk:auth_latency_slo_error_ratio:1h > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: auth_latency
        annotations:
          summary: "CoreSDK auth latency SLO fast burn ({{ $labels.service_name }})"
          description: >
            Error ratio {{ $value | humanizePercentage }} is burning the 30-day
            SLO budget at 14.4× rate. Budget exhaustion in ~2 days.

      # Slow burn: consuming 6× the error budget over 6h
      # This burns the full 30-day budget in ~5 days
      - alert: CoreSDKAuthLatencySLOSlowBurn
        expr: |
          coresdk:auth_latency_slo_error_ratio:5m > (6 * 0.001)
          and
          coresdk:auth_latency_slo_error_ratio:6h > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
          slo: auth_latency
        annotations:
          summary: "CoreSDK auth latency SLO slow burn ({{ $labels.service_name }})"

      # Sidecar availability SLO breach
      - alert: CoreSDKSidecarAvailabilitySLOBreach
        expr: coresdk:sidecar_availability:5m < 0.999
        for: 5m
        labels:
          severity: critical
          slo: sidecar_availability
        annotations:
          summary: "CoreSDK sidecar availability below 99.9% ({{ $labels.service_name }})"

Grafana dashboard setup

Import the CoreSDK SLO dashboard or build your own from the recording rules.

Auth latency SLO compliance panel

# PromQL for SLO compliance gauge (target: 99.9%)
1 - coresdk:auth_latency_slo_error_ratio:5m

Use a stat panel with thresholds: green above 0.999, yellow above 0.995, red below 0.995.

Error budget remaining (30-day)

# Error budget consumed (ratio)
1 - (
  sum_over_time(coresdk:auth_latency_slo_error_ratio:5m[30d]) * 5 / 60 / 24 / 30
  / 0.001
)

This shows what fraction of the 0.1% error budget has been consumed in the current 30-day window.

Policy eval latency heatmap

rate(coresdk_policy_eval_latency_ms_bucket[5m])

Use a heatmap panel to visualize latency distribution over time per tenant_id.


Integration with external SLO systems

Nobl9

Nobl9 can scrape CoreSDK metrics directly from Prometheus. Define a SLO resource using Nobl9's CRD:

apiVersion: n9/v1alpha
kind: SLO
metadata:
  name: coresdk-auth-latency
spec:
  service: orders-service
  indicator:
    rawMetric:
      query:
        prometheus:
          promql: |
            rate(coresdk_auth_latency_ms_bucket{le="10"}[{{.Window}}])
            /
            rate(coresdk_auth_latency_ms_count[{{.Window}}])
  objectives:
    - displayName: "99.9% of auth calls under 10ms"
      target: 0.999
      value: 1.0
      op: lte
  timeWindows:
    - duration: 30d
      isRolling: true
  budgetingMethod: Occurrences

Datadog SLOs

Use Datadog's Metric-Based SLO with the CoreSDK Prometheus metrics forwarded via the Datadog Agent's OpenMetrics integration.

  1. Configure the Datadog Agent to scrape http://localhost:9091/metrics (the sidecar Prometheus endpoint).
  2. Create a Metric-Based SLO in the Datadog UI:
    • Good events: coresdk.auth.latency_ms.count{outcome:allowed} with histogram filter le:10
    • Total events: coresdk.auth.latency_ms.count
    • Target: 99.9% over 30 days

Google Cloud Monitoring

Use the OpenTelemetry Collector's googlecloud exporter to forward CoreSDK metrics to Cloud Monitoring, then define a Service-Level Objective in Cloud Monitoring:

# otel-collector-config.yaml (relevant snippet)
exporters:
  googlecloud:
    project: my-gcp-project
    metric:
      prefix: custom.googleapis.com/coresdk

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [googlecloud]

Once metrics are in Cloud Monitoring, create an SLO from the Cloud Console under Monitoring → Services → Create SLO, selecting custom.googleapis.com/coresdk/auth/latency_ms as the indicator metric.


Next steps

On this page