Skip to main content
CoreSDK
Reference

Engineering Decisions

Key architectural and tooling decisions in CoreSDK — what we chose, why we chose it, and what trade-offs we accepted.

Engineering Decisions

This page documents the significant technical choices behind CoreSDK: the alternatives we considered, the rationale for each decision, and the constraints those decisions impose. Understanding these helps you predict how CoreSDK will behave in edge cases and how to extend it correctly.


Guiding principles

  1. Core ships first, alone. Nothing else — no sidecar, no SDK — until the Rust engine API is stable and validated by a design partner.
  2. One import, one decision. coresdk-engine is the only crate users import. Internal decomposition is invisible.
  3. Policy evaluation is pooled from day one. regorus::Engine::eval_rule takes &mut self. A single Mutex<Engine> will not survive production load.
  4. PII masking is zero-tolerance. A secret in an exported trace is a security incident, not a bug. Masking is tested on every output path with fuzz coverage.
  5. No generated code in CI. Proto stubs are checked in. Downstream builds must not require protoc or buf.
  6. Fail-open on partition, fail-closed on cache integrity. Network unreachable → continue on cached data. HMAC mismatch → reject the cache entirely.
  7. Lock-free reads on the hot path. arc-swap for config snapshots and JWK sets. No RwLock where reads dominate.

Language and tooling decisions

ConcernDecisionRationale
Rust edition2024Latest stable; async improvements
Async runtimeTokio (multi-thread)tonic requires it
Policy engineregorus (Rust-native Rego)Sub-2ms p99; in-process; no OPA process dependency
regorus concurrencyPool of N engines via spawn_blockingeval_rule is &mut self + synchronous
Config hot-reloadarc-swap + notify (file watcher)Lock-free reads; no restart required
JWK cachearc-swap<JwkSet> + background refreshStale-while-revalidate; no auth cliff-edge on refresh failure
TLSrustls (TLS 1.3 only)Memory-safe; no OpenSSL CVE exposure by default
mTLS certsrcgen for issuance; rustls for terminationNo external CA dependency for dev
gRPC servertonicDe-facto Rust gRPC; native TLS support
Python gRPCgrpcioWidest compatibility
Python OTel maskingCustom SpanProcessor (not SpanExporter)Masking before export queue, not after
TypeScript proto codegenBuf + connect-esESM-first; streaming support; active maintenance
Go proto codegenBuf + protoc-gen-go-grpcStandard toolchain
All-language codegenSingle buf.gen.yamlOne tool, consistent output across all SDKs
mTLS rotation (Go)tls.GetClientCertificate callbackCalled per-handshake; no process restart
mTLS rotation (Node.js)File watcher + channel recreationNode SecureContext is immutable; must recreate channel
OTel dependencies (SDKs)Peer dependencies, not bundledPrevent version skew and duplicate singletons

Key decisions and rationale

Why regorus over embedded OPA?

OPA requires a separate process or Wasm compilation. regorus runs in-process in Rust, eliminating a process boundary and IPC overhead. The tradeoff is a subset of Rego — this is acceptable and is documented in the Rego compatibility reference.

Why pool regorus engines?

regorus::Engine::eval_rule is &mut self and synchronous. A single Mutex<Engine> serialises all policy evaluations. Under realistic load (100+ concurrent requests), this is a throughput bottleneck. Pooling — one engine per blocking thread, loaded identically — gives linear scaling.

Hot-reload must update all N engines atomically. This is a hard constraint on the pool design; retrofitting it is expensive. The pool was designed this way from day one.

Why SpanProcessor for PII masking, not SpanExporter?

SpanProcessor fires before the span enters the export queue. SpanExporter fires after. The export queue holds spans in memory briefly — masking at the exporter layer means PII exists in memory buffers, which is insufficient for a zero-tolerance commitment. SpanProcessor masking is the only correct architecture.

Consequence: third-party span processors that run after CoreSDK's processor may still see unmasked data if they inspect attribute maps before CoreSDK's processor runs. Always register CoreSDK's processor first.

Why connect-es over ts-proto?

connect-es is ESM-native, generates strict TypeScript, and supports server-streaming RPCs (required for WatchConfig and WatchPolicyUpdates). ts-proto has better DX for simple unary RPCs but streaming support is immature. The streaming requirement is not optional — config and policy hot-reload depend on it.

Why fail-open on control plane partition?

Failing closed on network partition would make the sidecar a single point of failure for every application it protects. Enterprise users in regulated environments that require fail-closed behaviour can configure it explicitly — but the default must be fail-open to prevent mass outages from transient network issues.

See Offline Mode for fail_mode configuration.

Why ring + aes-gcm over josekit for JWE?

josekit pulls in OpenSSL as a transitive dependency, defeating the pure-Rust / no-OpenSSL-CVE goal. JWE decryption (RSA-OAEP + A256GCM) is two well-scoped operations implementable from ring (RSA-OAEP) + aes-gcm (A256GCM). Both are audited RustCrypto crates.

Why a KeyProvider trait?

HSM (PKCS#11) integration is a Phase 3 feature. Designing a KeyProvider trait now (sign(), verify(), hmac(), decrypt()) isolates all key material operations behind one boundary. The software keystore is the default implementation; HSM is a swap-in with no business logic changes required.

Why is HMAC key distribution over mTLS only?

If HMAC keys appear in config files or environment variables, any process with read access to the environment can forge cache entries. The mTLS channel is the only channel where both ends are authenticated. This is a hard architectural constraint, not a preference.


What CoreSDK is not doing

  • No rayon alongside Tokio. spawn_blocking is sufficient.
  • No premature crate splitting for Phase 2/3 features. Crates are added when features ship.
  • No dynamic hook registration at runtime. Security hooks are boot-time only.
  • No TLS 1.2 automatic fallback. Explicit opt-in only.
  • No variable values in exported traces or LLM payloads. The local terminal viewer is the only place variable values are visible.
  • No grpcio replacement in Python v1. betterproto and grpclib are not mature enough for production streaming use.

Testing strategy

Test typeScopeTooling
Unit testsEach crate in isolationcargo test + proptest for PII masking
Integration testsEngine with real regorus + real JWTscargo test against fixtures
gRPC contract testsProto API stabilitybuf breaking in CI
PII masking fuzzMasking engine against generated payloadscargo fuzz (libfuzzer)
Python SDK testsMiddleware against real sidecarpytest + testcontainers
Cross-language paritySame JWT + policy fixture, all SDKsShared fixture file in repo
OTel span assertionsNo PII in exported spansassert_no_pii(spans) in all SDK test suites
Load / latencyPolicy eval p99 <2ms at 1,000 rpsk6 + custom gRPC script

Policy eval latency is a CI gate: p99 >2ms on the standard fixture set blocks merge.


Release strategy

  • Crate versioning: Start at 1.0.0-rc.1. Gate 1.0.0 on design partner validation.
  • Proto stability: v1 namespace is frozen from first public release. Additive changes only (new fields, new RPCs) without a version bump.
  • SDK versioning: Each language SDK has its own version. They do not need to be in lockstep with the engine crate version.
  • Docker: Multi-arch (linux/amd64, linux/arm64) from day one. FROM scratch + static musl binary.

Next steps

On this page