Fail-open vs fail-closed behavior on sidecar partition. HMAC-verified cache. Phase 1 Rust crate coresdk-resilience. Python retry/circuit-breaker config ships Phase 2.

Resilience Primitives

CoreSDK wraps every call to the Sidecar and control plane in a resilience layer. The default configuration handles transient failures transparently. When failures are sustained, the SDK falls back to a local HMAC-verified cache rather than blocking requests or crashing.

All failures surface as RFC 9457 Problem Details when they reach your handler.

Fail-open vs fail-closed

When the sidecar is unreachable and the local cache is absent or expired, the SDK applies the CORESDK_FAIL_MODE policy.

Mode	Behavior	Use case
`open` (default)	Allow the request; record the partition in telemetry	Consumer services where availability outweighs security
`closed`	Reject with `503 Service Unavailable` RFC 9457 error	Enterprise and regulated environments

Python: configuring fail mode

Pass fail_mode in SDKConfig:

from coresdk import CoreSDKClient, SDKConfig

_sdk = CoreSDKClient(SDKConfig(
    sidecar_addr="127.0.0.1:50051",
    tenant_id="acme",
    service_name="orders-api",
    fail_mode="closed",   # "open" (default) or "closed"
))

Or via environment variable (no code change required):

export CORESDK_FAIL_MODE=closed

Rust: configuring fail mode

use coresdk_engine::{Engine, EngineConfig};

// Via environment variable:
// CORESDK_FAIL_MODE=closed

// Or in EngineConfig (see coresdk-resilience crate):
let engine = Engine::from_env()?;

HMAC-verified cache

When the SDK cannot reach the Sidecar, it serves policy and auth decisions from a local in-process cache. Each cache entry is signed with an HMAC key delivered over the mTLS channel at startup.

On a cache hit the SDK:

Verifies the HMAC signature of the cached decision
If the signature is valid and the entry is not expired, uses the cached result
If HMAC verification fails for any reason, rejects the cache entry and applies CORESDK_FAIL_MODE

HMAC mismatch always fails closed, regardless of CORESDK_FAIL_MODE. A tampered cache entry is treated as a hard failure.

Cache hit
    │
    ├── HMAC valid + not expired ──► use cached decision
    │
    ├── HMAC valid + expired ───────► miss; apply fail mode
    │
    └── HMAC invalid ───────────────► REJECT; fail closed (always)

Rust crate: coresdk-resilience

The coresdk-resilience crate provides the circuit-breaker, retry, and timeout primitives used internally by coresdk-engine. It is also available as a standalone dependency for Rust services that want these primitives without the full engine.

use coresdk_resilience::{CircuitBreaker, RetryPolicy, TimeoutConfig};

// All three are configured via CORESDK_* env vars or EngineConfig.
// The engine applies them automatically — no manual wiring required.
let engine = coresdk_engine::Engine::from_env()?;

Retry: exponential backoff with jitter

Transient errors (connection reset, timeout, 503) are retried automatically. The retry delay uses full jitter:

delay = random(0, min(cap, base * 2^attempt))

Default parameters:

Parameter	Default	Env var
`max_attempts`	`3`	`CORESDK_RETRY_MAX_ATTEMPTS`
`base_delay_ms`	`100`	`CORESDK_RETRY_BASE_DELAY_MS`
`max_delay_ms`	`2000`	`CORESDK_RETRY_MAX_DELAY_MS`

Circuit breaker

The circuit breaker prevents repeated calls to a failing dependency.

          failures >= threshold
Closed ──────────────────────────► Open
  ▲                                  │
  │         probe succeeds           │ after reset_timeout
  │                                  ▼
  └──────────────────────────── Half-Open
         probe fails → Open

Default parameters:

Parameter	Default	Env var
`failure_threshold`	`5`	`CORESDK_CB_FAILURE_THRESHOLD`
`window_secs`	`30`	—
`reset_timeout_secs`	`60`	`CORESDK_CB_RESET_TIMEOUT_SECS`

Timeout tiers

Tier	Default	Env var
`connect_timeout_ms`	`500`	`CORESDK_CONNECT_TIMEOUT_MS`
`request_timeout_ms`	`1000`	`CORESDK_REQUEST_TIMEOUT_MS`
`total_timeout_ms`	`3000`	`CORESDK_TOTAL_TIMEOUT_MS`

Phase note. Python retry and circuit-breaker configuration (equivalent to the Rust coresdk-resilience knobs) ships Phase 2. In Phase 1b, Python resilience is controlled exclusively by fail_mode in SDKConfig and the CORESDK_FAIL_MODE environment variable.

Observing resilience events

Retry attempts, circuit breaker state transitions, and cache hits/misses are recorded as OTEL span events on the active span.

Event name	When emitted
`coresdk.retry.attempt`	Each retry attempt, with `attempt_number` and `delay_ms`
`coresdk.circuit_breaker.opened`	Breaker transitions Closed → Open
`coresdk.circuit_breaker.closed`	Breaker transitions Half-Open → Closed
`coresdk.cache.hit`	Cached decision used, with `hmac_verified=true`
`coresdk.cache.miss`	No valid cache entry; fail mode applied
`coresdk.partition`	Control plane unreachable; partition recorded

Environment variable reference

Variable	Default	Description
`CORESDK_FAIL_MODE`	`open`	`open` or `closed`
`CORESDK_RETRY_MAX_ATTEMPTS`	`3`	Maximum retry attempts
`CORESDK_RETRY_BASE_DELAY_MS`	`100`	Base retry delay
`CORESDK_RETRY_MAX_DELAY_MS`	`2000`	Maximum retry delay
`CORESDK_CB_FAILURE_THRESHOLD`	`5`	Circuit breaker failure threshold
`CORESDK_CB_RESET_TIMEOUT_SECS`	`60`	Circuit breaker reset timeout
`CORESDK_CONNECT_TIMEOUT_MS`	`500`	mTLS connect timeout
`CORESDK_REQUEST_TIMEOUT_MS`	`1000`	RPC request timeout
`CORESDK_TOTAL_TIMEOUT_MS`	`3000`	Total timeout across retries

Next steps

TLS & mTLS — how HMAC keys are distributed and the mTLS channel that protects them
Error Handling — RFC 9457 error shape for partition and timeout failures
Observability — how resilience span events appear in your traces

Resilience Primitives

On this page