Skip to main content
CoreSDK
Core Concepts

Resilience Primitives

Fail-open vs fail-closed behavior on sidecar partition. HMAC-verified cache. Phase 1 Rust crate coresdk-resilience. Python retry/circuit-breaker config ships Phase 2.

Resilience Primitives

CoreSDK wraps every call to the Sidecar and control plane in a resilience layer. The default configuration handles transient failures transparently. When failures are sustained, the SDK falls back to a local HMAC-verified cache rather than blocking requests or crashing.

All failures surface as RFC 9457 Problem Details when they reach your handler.

Fail-open vs fail-closed

When the sidecar is unreachable and the local cache is absent or expired, the SDK applies the CORESDK_FAIL_MODE policy.

ModeBehaviorUse case
open (default)Allow the request; record the partition in telemetryConsumer services where availability outweighs security
closedReject with 503 Service Unavailable RFC 9457 errorEnterprise and regulated environments

Python: configuring fail mode

Pass fail_mode in SDKConfig:

from coresdk import CoreSDKClient, SDKConfig

_sdk = CoreSDKClient(SDKConfig(
    sidecar_addr="127.0.0.1:50051",
    tenant_id="acme",
    service_name="orders-api",
    fail_mode="closed",   # "open" (default) or "closed"
))

Or via environment variable (no code change required):

export CORESDK_FAIL_MODE=closed

Rust: configuring fail mode

use coresdk_engine::{Engine, EngineConfig};

// Via environment variable:
// CORESDK_FAIL_MODE=closed

// Or in EngineConfig (see coresdk-resilience crate):
let engine = Engine::from_env()?;

HMAC-verified cache

When the SDK cannot reach the Sidecar, it serves policy and auth decisions from a local in-process cache. Each cache entry is signed with an HMAC key delivered over the mTLS channel at startup.

On a cache hit the SDK:

  1. Verifies the HMAC signature of the cached decision
  2. If the signature is valid and the entry is not expired, uses the cached result
  3. If HMAC verification fails for any reason, rejects the cache entry and applies CORESDK_FAIL_MODE

HMAC mismatch always fails closed, regardless of CORESDK_FAIL_MODE. A tampered cache entry is treated as a hard failure.

Cache hit

    ├── HMAC valid + not expired ──► use cached decision

    ├── HMAC valid + expired ───────► miss; apply fail mode

    └── HMAC invalid ───────────────► REJECT; fail closed (always)

Rust crate: coresdk-resilience

The coresdk-resilience crate provides the circuit-breaker, retry, and timeout primitives used internally by coresdk-engine. It is also available as a standalone dependency for Rust services that want these primitives without the full engine.

use coresdk_resilience::{CircuitBreaker, RetryPolicy, TimeoutConfig};

// All three are configured via CORESDK_* env vars or EngineConfig.
// The engine applies them automatically — no manual wiring required.
let engine = coresdk_engine::Engine::from_env()?;

Retry: exponential backoff with jitter

Transient errors (connection reset, timeout, 503) are retried automatically. The retry delay uses full jitter:

delay = random(0, min(cap, base * 2^attempt))

Default parameters:

ParameterDefaultEnv var
max_attempts3CORESDK_RETRY_MAX_ATTEMPTS
base_delay_ms100CORESDK_RETRY_BASE_DELAY_MS
max_delay_ms2000CORESDK_RETRY_MAX_DELAY_MS

Circuit breaker

The circuit breaker prevents repeated calls to a failing dependency.

          failures >= threshold
Closed ──────────────────────────► Open
  ▲                                  │
  │         probe succeeds           │ after reset_timeout
  │                                  ▼
  └──────────────────────────── Half-Open
         probe fails → Open

Default parameters:

ParameterDefaultEnv var
failure_threshold5CORESDK_CB_FAILURE_THRESHOLD
window_secs30
reset_timeout_secs60CORESDK_CB_RESET_TIMEOUT_SECS

Timeout tiers

TierDefaultEnv var
connect_timeout_ms500CORESDK_CONNECT_TIMEOUT_MS
request_timeout_ms1000CORESDK_REQUEST_TIMEOUT_MS
total_timeout_ms3000CORESDK_TOTAL_TIMEOUT_MS

Phase note. Python retry and circuit-breaker configuration (equivalent to the Rust coresdk-resilience knobs) ships Phase 2. In Phase 1b, Python resilience is controlled exclusively by fail_mode in SDKConfig and the CORESDK_FAIL_MODE environment variable.

Observing resilience events

Retry attempts, circuit breaker state transitions, and cache hits/misses are recorded as OTEL span events on the active span.

Event nameWhen emitted
coresdk.retry.attemptEach retry attempt, with attempt_number and delay_ms
coresdk.circuit_breaker.openedBreaker transitions Closed → Open
coresdk.circuit_breaker.closedBreaker transitions Half-Open → Closed
coresdk.cache.hitCached decision used, with hmac_verified=true
coresdk.cache.missNo valid cache entry; fail mode applied
coresdk.partitionControl plane unreachable; partition recorded

Environment variable reference

VariableDefaultDescription
CORESDK_FAIL_MODEopenopen or closed
CORESDK_RETRY_MAX_ATTEMPTS3Maximum retry attempts
CORESDK_RETRY_BASE_DELAY_MS100Base retry delay
CORESDK_RETRY_MAX_DELAY_MS2000Maximum retry delay
CORESDK_CB_FAILURE_THRESHOLD5Circuit breaker failure threshold
CORESDK_CB_RESET_TIMEOUT_SECS60Circuit breaker reset timeout
CORESDK_CONNECT_TIMEOUT_MS500mTLS connect timeout
CORESDK_REQUEST_TIMEOUT_MS1000RPC request timeout
CORESDK_TOTAL_TIMEOUT_MS3000Total timeout across retries

Next steps

  • TLS & mTLS — how HMAC keys are distributed and the mTLS channel that protects them
  • Error Handling — RFC 9457 error shape for partition and timeout failures
  • Observability — how resilience span events appear in your traces

On this page