Resilience Primitives
Fail-open vs fail-closed behavior on sidecar partition. HMAC-verified cache. Phase 1 Rust crate coresdk-resilience. Python retry/circuit-breaker config ships Phase 2.
Resilience Primitives
CoreSDK wraps every call to the Sidecar and control plane in a resilience layer. The default configuration handles transient failures transparently. When failures are sustained, the SDK falls back to a local HMAC-verified cache rather than blocking requests or crashing.
All failures surface as RFC 9457 Problem Details when they reach your handler.
Fail-open vs fail-closed
When the sidecar is unreachable and the local cache is absent or expired, the SDK applies the CORESDK_FAIL_MODE policy.
| Mode | Behavior | Use case |
|---|---|---|
open (default) | Allow the request; record the partition in telemetry | Consumer services where availability outweighs security |
closed | Reject with 503 Service Unavailable RFC 9457 error | Enterprise and regulated environments |
Python: configuring fail mode
Pass fail_mode in SDKConfig:
from coresdk import CoreSDKClient, SDKConfig
_sdk = CoreSDKClient(SDKConfig(
sidecar_addr="127.0.0.1:50051",
tenant_id="acme",
service_name="orders-api",
fail_mode="closed", # "open" (default) or "closed"
))Or via environment variable (no code change required):
export CORESDK_FAIL_MODE=closedRust: configuring fail mode
use coresdk_engine::{Engine, EngineConfig};
// Via environment variable:
// CORESDK_FAIL_MODE=closed
// Or in EngineConfig (see coresdk-resilience crate):
let engine = Engine::from_env()?;HMAC-verified cache
When the SDK cannot reach the Sidecar, it serves policy and auth decisions from a local in-process cache. Each cache entry is signed with an HMAC key delivered over the mTLS channel at startup.
On a cache hit the SDK:
- Verifies the HMAC signature of the cached decision
- If the signature is valid and the entry is not expired, uses the cached result
- If HMAC verification fails for any reason, rejects the cache entry and applies
CORESDK_FAIL_MODE
HMAC mismatch always fails closed, regardless of CORESDK_FAIL_MODE. A tampered cache entry is treated as a hard failure.
Cache hit
│
├── HMAC valid + not expired ──► use cached decision
│
├── HMAC valid + expired ───────► miss; apply fail mode
│
└── HMAC invalid ───────────────► REJECT; fail closed (always)Rust crate: coresdk-resilience
The coresdk-resilience crate provides the circuit-breaker, retry, and timeout primitives used internally by coresdk-engine. It is also available as a standalone dependency for Rust services that want these primitives without the full engine.
use coresdk_resilience::{CircuitBreaker, RetryPolicy, TimeoutConfig};
// All three are configured via CORESDK_* env vars or EngineConfig.
// The engine applies them automatically — no manual wiring required.
let engine = coresdk_engine::Engine::from_env()?;Retry: exponential backoff with jitter
Transient errors (connection reset, timeout, 503) are retried automatically. The retry delay uses full jitter:
delay = random(0, min(cap, base * 2^attempt))Default parameters:
| Parameter | Default | Env var |
|---|---|---|
max_attempts | 3 | CORESDK_RETRY_MAX_ATTEMPTS |
base_delay_ms | 100 | CORESDK_RETRY_BASE_DELAY_MS |
max_delay_ms | 2000 | CORESDK_RETRY_MAX_DELAY_MS |
Circuit breaker
The circuit breaker prevents repeated calls to a failing dependency.
failures >= threshold
Closed ──────────────────────────► Open
▲ │
│ probe succeeds │ after reset_timeout
│ ▼
└──────────────────────────── Half-Open
probe fails → OpenDefault parameters:
| Parameter | Default | Env var |
|---|---|---|
failure_threshold | 5 | CORESDK_CB_FAILURE_THRESHOLD |
window_secs | 30 | — |
reset_timeout_secs | 60 | CORESDK_CB_RESET_TIMEOUT_SECS |
Timeout tiers
| Tier | Default | Env var |
|---|---|---|
connect_timeout_ms | 500 | CORESDK_CONNECT_TIMEOUT_MS |
request_timeout_ms | 1000 | CORESDK_REQUEST_TIMEOUT_MS |
total_timeout_ms | 3000 | CORESDK_TOTAL_TIMEOUT_MS |
Phase note. Python retry and circuit-breaker configuration (equivalent to the Rust
coresdk-resilienceknobs) ships Phase 2. In Phase 1b, Python resilience is controlled exclusively byfail_modeinSDKConfigand theCORESDK_FAIL_MODEenvironment variable.
Observing resilience events
Retry attempts, circuit breaker state transitions, and cache hits/misses are recorded as OTEL span events on the active span.
| Event name | When emitted |
|---|---|
coresdk.retry.attempt | Each retry attempt, with attempt_number and delay_ms |
coresdk.circuit_breaker.opened | Breaker transitions Closed → Open |
coresdk.circuit_breaker.closed | Breaker transitions Half-Open → Closed |
coresdk.cache.hit | Cached decision used, with hmac_verified=true |
coresdk.cache.miss | No valid cache entry; fail mode applied |
coresdk.partition | Control plane unreachable; partition recorded |
Environment variable reference
| Variable | Default | Description |
|---|---|---|
CORESDK_FAIL_MODE | open | open or closed |
CORESDK_RETRY_MAX_ATTEMPTS | 3 | Maximum retry attempts |
CORESDK_RETRY_BASE_DELAY_MS | 100 | Base retry delay |
CORESDK_RETRY_MAX_DELAY_MS | 2000 | Maximum retry delay |
CORESDK_CB_FAILURE_THRESHOLD | 5 | Circuit breaker failure threshold |
CORESDK_CB_RESET_TIMEOUT_SECS | 60 | Circuit breaker reset timeout |
CORESDK_CONNECT_TIMEOUT_MS | 500 | mTLS connect timeout |
CORESDK_REQUEST_TIMEOUT_MS | 1000 | RPC request timeout |
CORESDK_TOTAL_TIMEOUT_MS | 3000 | Total timeout across retries |
Next steps
- TLS & mTLS — how HMAC keys are distributed and the mTLS channel that protects them
- Error Handling — RFC 9457 error shape for partition and timeout failures
- Observability — how resilience span events appear in your traces
TLS 1.3 & mTLS
SDK-to-Sidecar communication is always mutually authenticated — automatic, no application code required. ECDSA P-256 client certificates rotate every 24 hours. rustls only, no OpenSSL.
Offline Mode
How CoreSDK behaves when the control plane or sidecar is unreachable. CORESDK_FAIL_MODE=open/closed. HMAC-verified cache. No application code changes required.