Offline Mode
How CoreSDK behaves when the control plane or sidecar is unreachable. CORESDK_FAIL_MODE=open/closed. HMAC-verified cache. No application code changes required.
Available in Phase 1b. Offline mode relies on the sidecar daemon's local cache. Phase 1a (Rust crate only) users access this via coresdk-resilience directly.
Offline Mode
CoreSDK's sidecar daemon maintains a local HMAC-verified cache of policies, JWT signing keys, and configuration. When the control plane becomes unreachable — due to a network partition, a rolling deploy, or a cloud outage — the sidecar switches to this cache automatically. Your application continues to authenticate requests and evaluate policies without any code changes.
How it works
Normal operation
──────────────────────────────────────────────────
Application → Sidecar → Control plane
↑
writes to local cache
(HMAC-SHA256 signed)
Offline / partitioned
──────────────────────────────────────────────────
Application → Sidecar → ✗ Control plane (unreachable)
↑
reads from local cache
(signature verified on every read)
logs warning every sync intervalThe sidecar detects a partition when a sync attempt times out or returns a non-2xx response. From that point it operates entirely from the local cache until the control plane becomes reachable again, at which point it re-syncs automatically without a restart.
What the cache contains
| Data | Used for |
|---|---|
| JWT public keys (JWKS) | Verifying inbound JWTs |
| Rego policy bundle | Policy evaluation |
| SDK configuration | Feature flags, rate limits, tenant config |
| Tenant roster | Multi-tenancy isolation |
All four categories continue to work in offline mode.
Cache integrity
HMAC keys are distributed to the sidecar via the mTLS-authenticated channel — never written to config files or environment variables. Every cache read verifies the HMAC-SHA256 signature of the stored blob. A tampered or corrupted entry is rejected and the fail mode is applied.
Configuring fail mode
CORESDK_FAIL_MODE controls what happens when the sidecar itself is unreachable from the application process (distinct from the control plane being partitioned).
| Mode | Behavior |
|---|---|
open (default) | Requests pass through; the partition is recorded in telemetry |
closed | Requests are rejected with 503 Service Unavailable |
Set via environment variable (no code change required):
export CORESDK_FAIL_MODE=closedOr in SDKConfig:
from coresdk import CoreSDKClient, SDKConfig
_sdk = CoreSDKClient(SDKConfig(
sidecar_addr="127.0.0.1:50051",
tenant_id="acme",
service_name="orders-api",
fail_mode="closed", # "open" (default) or "closed"
))Or via SDKConfig.from_env() which reads CORESDK_FAIL_MODE automatically:
_sdk = CoreSDKClient(SDKConfig.from_env())Choosing the right fail mode
Use open (the default) for services where availability outweighs strict security enforcement — public read APIs, health checks, internal tooling. Use closed for surfaces that process financial transactions, modify sensitive data, or are subject to compliance requirements where unauthenticated access is never acceptable.
Cache persistence across restarts
The cache is written to disk on every successful sync and survives sidecar restarts. If the sidecar starts while the control plane is unreachable, it loads the last known-good cache and begins serving immediately.
Sidecar warning logs during a partition
Every sync interval (default 30 seconds, configurable via CORESDK_SYNC_INTERVAL_SECONDS) the sidecar emits a structured warning:
level=warn msg="control plane unreachable — operating from cache"
partition_duration_seconds=142
cache_age_seconds=142
cache_valid=true
policies_cached=4
jwks_cached=2
next_retry_in_seconds=30These are emitted at WARN level. Configure your log aggregator to alert on control plane unreachable for extended partitions.
Testing offline behavior locally
Step 1 — seed the cache
coresdk-sidecar start --log-level=debug
# Wait for: level=info msg="sync complete" policies=4 jwks=2Step 2 — simulate a partition
# macOS
echo "block drop out proto tcp from any to api.coresdk.io" \
| sudo pfctl -ef -
# Linux
sudo iptables -A OUTPUT -d api.coresdk.io -j DROP
# or stop the local control plane
docker compose stop control-planeStep 3 — verify auth still works from cache
curl -H "Authorization: Bearer $VALID_JWT" http://localhost:8080/api/orders
# → 200 OK (JWT verified from cached JWKS)
curl http://localhost:7700/status | jq .
# → { "partitioned": true, "cache_valid": true, ... }Step 4 — restore and confirm re-sync
sudo pfctl -d # macOS
# or
sudo iptables -D OUTPUT -d api.coresdk.io -j DROP # Linux
curl http://localhost:7700/status | jq .partitioned
# → falseTesting closed fail mode
CORESDK_FAIL_MODE=closed python -m your_app &
# Stop the sidecar process (not the control plane)
coresdk-sidecar stop
curl http://localhost:8080/api/orders
# → 503 Service UnavailableEnvironment variable reference
| Variable | Default | Description |
|---|---|---|
CORESDK_FAIL_MODE | open | open or closed |
CORESDK_SIDECAR_ADDR | 127.0.0.1:50051 | Sidecar address |
CORESDK_SYNC_INTERVAL_SECONDS | 30 | Control plane sync interval |
CORESDK_SIDECAR_PORT | 7700 | Sidecar status HTTP port |
Next steps
- Resilience Primitives — HMAC cache details, circuit breaker, retry
- Error Handling — how
503and auth failures surface to callers - TLS & mTLS — HMAC keys distributed over the mTLS channel
Resilience Primitives
Fail-open vs fail-closed behavior on sidecar partition. HMAC-verified cache. Phase 1 Rust crate coresdk-resilience. Python retry/circuit-breaker config ships Phase 2.
Feature Flags
Remote configuration, percentage rollouts, user and tenant targeting, and instant kill-switches — evaluated locally with no per-request network call. Ships Phase 2.