SLA / SLO Tracking
Map CoreSDK OTel metrics to SLO signals, configure Prometheus recording rules for burn rate, and integrate with Nobl9, Datadog SLOs, and Google Cloud Monitoring.
Phase note. Ships Phase 2.
SLA / SLO Tracking
CoreSDK emits the OTel metrics you need to define and track SLOs without additional instrumentation. Every metric carries tenant_id and service_name labels so you can define per-tenant SLOs and compare against overall service targets.
SLO tracking in CoreSDK covers three primary reliability signals:
| Signal | SLO Type | CoreSDK Metric |
|---|---|---|
| Auth latency | Latency | coresdk.auth.latency_ms |
| Policy eval latency | Latency | coresdk.policy.eval_latency_ms |
| Sidecar uptime | Availability | coresdk.sidecar.uptime |
| Request error rate | Error rate | coresdk.requests.error_rate |
Rust API — SloTracker
SloTracker is the embedded SLO engine that computes budget consumption and breach state from the OTel metrics stream. It is used internally by the sidecar and can be embedded directly in Rust services:
use coresdk_slo::{SloTracker, SloConfig, SloWindow};
use std::time::Duration;
// Define an SLO
let config = SloConfig {
name: "auth-latency-p99".to_string(),
objective: 0.999, // 99.9% of requests must succeed
window: SloWindow::Rolling(Duration::from_secs(30 * 24 * 3600)), // 30 days
// A "good event" is any auth request completing under 10ms
good_metric: "coresdk.auth.latency_ms".to_string(),
good_threshold_ms: Some(10),
total_metric: "coresdk.auth.latency_ms".to_string(),
};
let tracker = SloTracker::new(config);
// Feed metric observations — call this from your OTel metric exporter
tracker.record_good(1); // one request under threshold
tracker.record_total(1); // one request total
// Query the current state
let budget = tracker.error_budget_remaining(); // 0.0 – 1.0 fraction remaining
let breach = tracker.is_breaching(); // true if budget < 0
let avail = tracker.availability(); // e.g. 0.9994 (good / total)
println!("Error budget remaining: {:.1}%", budget * 100.0);
println!("Availability: {:.4}%", avail * 100.0);
println!("Breaching: {}", breach);SloWindow
| Variant | Description |
|---|---|
SloWindow::Rolling(Duration) | Sliding window; most common for latency / error-rate SLOs |
SloWindow::Calendar { period } | Calendar-aligned (month, quarter); used for uptime SLAs |
Integration with OTel metrics
SloTracker reads from the same OTel metric stream that CoreSDK exports. Attach it to the sidecar metrics pipeline via the sidecar YAML:
slo:
- name: auth-latency-p99
objective: 0.999
window: 30d
good_metric: coresdk.auth.latency_ms
good_threshold_ms: 10
total_metric: coresdk.auth.latency_msThe sidecar exposes breach state and budget remaining as OTel gauge metrics (coresdk.slo.budget_remaining, coresdk.slo.is_breaching) so they flow into your existing dashboards without additional instrumentation.
Built-in SLO metrics
coresdk.auth.latency_ms
Histogram of JWT validation and authorization latency in milliseconds, measured at the sidecar.
| Label | Values | Description |
|---|---|---|
tenant_id | string | Tenant context for the request |
service_name | string | OTel service.name of the caller |
outcome | allowed, denied, error | Auth decision result |
Use the p99 bucket to drive latency SLOs. Example SLO objective: p99 auth latency < 10ms over a 30-day window.
coresdk.policy.eval_latency_ms
Histogram of Rego policy evaluation latency in milliseconds. The engine pool ensures p99 < 2ms for policies under 50 rules on warm cache.
| Label | Values | Description |
|---|---|---|
tenant_id | string | Tenant context |
rule | string | Fully-qualified Rego rule path (e.g. data.billing.can_upgrade) |
outcome | allow, deny, error | Policy decision |
coresdk.sidecar.uptime
Gauge. Value is 1 when the sidecar is healthy (passes /readyz), 0 otherwise. Scrape interval drives availability granularity — 15s scrape gives 0.017% availability resolution per interval.
coresdk.requests.error_rate
Counter of requests that resulted in an RFC 9457 error response. Divide by coresdk.requests.total for a ratio.
Prometheus recording rules
Define recording rules to pre-compute SLO burn rate. Recording rules are evaluated on every Prometheus scrape and stored as new time series — dashboards and alerts query the pre-computed series instead of recalculating on each request.
groups:
- name: coresdk_slo
interval: 30s
rules:
# Auth latency — fraction of requests exceeding 10ms threshold
- record: coresdk:auth_latency_slo_error_ratio:5m
expr: |
1 - (
rate(coresdk_auth_latency_ms_bucket{le="10"}[5m])
/
rate(coresdk_auth_latency_ms_count[5m])
)
# Policy eval latency — fraction of evals exceeding 2ms threshold
- record: coresdk:policy_latency_slo_error_ratio:5m
expr: |
1 - (
rate(coresdk_policy_eval_latency_ms_bucket{le="2"}[5m])
/
rate(coresdk_policy_eval_latency_ms_count[5m])
)
# Sidecar availability — fraction of scrapes where sidecar was unhealthy
- record: coresdk:sidecar_availability:5m
expr: avg_over_time(coresdk_sidecar_uptime[5m])
# Error rate ratio
- record: coresdk:request_error_ratio:5m
expr: |
rate(coresdk_requests_error_rate_total[5m])
/
rate(coresdk_requests_total[5m])Multi-window burn rate alerts
Multi-window burn rate alerting catches SLO budget exhaustion at two timescales simultaneously: a short window (1h) detects fast burns and a long window (6h) detects slow burns that would be missed otherwise.
groups:
- name: coresdk_slo_alerts
rules:
# Fast burn: consuming 14.4× the error budget over 1h
# This burns the full 30-day budget in ~2 days
- alert: CoreSDKAuthLatencySLOFastBurn
expr: |
coresdk:auth_latency_slo_error_ratio:5m > (14.4 * 0.001)
and
coresdk:auth_latency_slo_error_ratio:1h > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: auth_latency
annotations:
summary: "CoreSDK auth latency SLO fast burn ({{ $labels.service_name }})"
description: >
Error ratio {{ $value | humanizePercentage }} is burning the 30-day
SLO budget at 14.4× rate. Budget exhaustion in ~2 days.
# Slow burn: consuming 6× the error budget over 6h
# This burns the full 30-day budget in ~5 days
- alert: CoreSDKAuthLatencySLOSlowBurn
expr: |
coresdk:auth_latency_slo_error_ratio:5m > (6 * 0.001)
and
coresdk:auth_latency_slo_error_ratio:6h > (6 * 0.001)
for: 15m
labels:
severity: warning
slo: auth_latency
annotations:
summary: "CoreSDK auth latency SLO slow burn ({{ $labels.service_name }})"
# Sidecar availability SLO breach
- alert: CoreSDKSidecarAvailabilitySLOBreach
expr: coresdk:sidecar_availability:5m < 0.999
for: 5m
labels:
severity: critical
slo: sidecar_availability
annotations:
summary: "CoreSDK sidecar availability below 99.9% ({{ $labels.service_name }})"Grafana dashboard setup
Import the CoreSDK SLO dashboard or build your own from the recording rules.
Recommended panels
Auth latency SLO compliance panel
# PromQL for SLO compliance gauge (target: 99.9%)
1 - coresdk:auth_latency_slo_error_ratio:5mUse a stat panel with thresholds: green above 0.999, yellow above 0.995, red below 0.995.
Error budget remaining (30-day)
# Error budget consumed (ratio)
1 - (
sum_over_time(coresdk:auth_latency_slo_error_ratio:5m[30d]) * 5 / 60 / 24 / 30
/ 0.001
)This shows what fraction of the 0.1% error budget has been consumed in the current 30-day window.
Policy eval latency heatmap
rate(coresdk_policy_eval_latency_ms_bucket[5m])Use a heatmap panel to visualize latency distribution over time per tenant_id.
Integration with external SLO systems
Nobl9
Nobl9 can scrape CoreSDK metrics directly from Prometheus. Define a SLO resource using Nobl9's CRD:
apiVersion: n9/v1alpha
kind: SLO
metadata:
name: coresdk-auth-latency
spec:
service: orders-service
indicator:
rawMetric:
query:
prometheus:
promql: |
rate(coresdk_auth_latency_ms_bucket{le="10"}[{{.Window}}])
/
rate(coresdk_auth_latency_ms_count[{{.Window}}])
objectives:
- displayName: "99.9% of auth calls under 10ms"
target: 0.999
value: 1.0
op: lte
timeWindows:
- duration: 30d
isRolling: true
budgetingMethod: OccurrencesDatadog SLOs
Use Datadog's Metric-Based SLO with the CoreSDK Prometheus metrics forwarded via the Datadog Agent's OpenMetrics integration.
- Configure the Datadog Agent to scrape
http://localhost:9091/metrics(the sidecar Prometheus endpoint). - Create a Metric-Based SLO in the Datadog UI:
- Good events:
coresdk.auth.latency_ms.count{outcome:allowed}with histogram filterle:10 - Total events:
coresdk.auth.latency_ms.count - Target: 99.9% over 30 days
- Good events:
Google Cloud Monitoring
Use the OpenTelemetry Collector's googlecloud exporter to forward CoreSDK metrics to Cloud Monitoring, then define a Service-Level Objective in Cloud Monitoring:
# otel-collector-config.yaml (relevant snippet)
exporters:
googlecloud:
project: my-gcp-project
metric:
prefix: custom.googleapis.com/coresdk
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [googlecloud]Once metrics are in Cloud Monitoring, create an SLO from the Cloud Console under Monitoring → Services → Create SLO, selecting custom.googleapis.com/coresdk/auth/latency_ms as the indicator metric.
Next steps
- Metrics — full metric catalog and Prometheus scrape config
- Alerts & Anomaly Detection — burn rate alerts and PagerDuty/Slack sinks
- OpenTelemetry — trace context linking to SLO signals