Key architectural and tooling decisions in CoreSDK — what we chose, why we chose it, and what trade-offs we accepted.

Engineering Decisions

This page documents the significant technical choices behind CoreSDK: the alternatives we considered, the rationale for each decision, and the constraints those decisions impose. Understanding these helps you predict how CoreSDK will behave in edge cases and how to extend it correctly.

Guiding principles

Core ships first, alone. Nothing else — no sidecar, no SDK — until the Rust engine API is stable and validated by a design partner.
One import, one decision. coresdk-engine is the only crate users import. Internal decomposition is invisible.
Policy evaluation is pooled from day one. regorus::Engine::eval_rule takes &mut self. A single Mutex<Engine> will not survive production load.
PII masking is zero-tolerance. A secret in an exported trace is a security incident, not a bug. Masking is tested on every output path with fuzz coverage.
No generated code in CI. Proto stubs are checked in. Downstream builds must not require protoc or buf.
Fail-open on partition, fail-closed on cache integrity. Network unreachable → continue on cached data. HMAC mismatch → reject the cache entirely.
Lock-free reads on the hot path. arc-swap for config snapshots and JWK sets. No RwLock where reads dominate.

Language and tooling decisions

Concern	Decision	Rationale
Rust edition	2024	Latest stable; async improvements
Async runtime	Tokio (multi-thread)	tonic requires it
Policy engine	regorus (Rust-native Rego)	Sub-2ms p99; in-process; no OPA process dependency
regorus concurrency	Pool of N engines via `spawn_blocking`	`eval_rule` is `&mut self` + synchronous
Config hot-reload	`arc-swap` + `notify` (file watcher)	Lock-free reads; no restart required
JWK cache	`arc-swap<JwkSet>` + background refresh	Stale-while-revalidate; no auth cliff-edge on refresh failure
TLS	rustls (TLS 1.3 only)	Memory-safe; no OpenSSL CVE exposure by default
mTLS certs	`rcgen` for issuance; rustls for termination	No external CA dependency for dev
gRPC server	tonic	De-facto Rust gRPC; native TLS support
Python gRPC	grpcio	Widest compatibility
Python OTel masking	Custom `SpanProcessor` (not `SpanExporter`)	Masking before export queue, not after
TypeScript proto codegen	Buf + connect-es	ESM-first; streaming support; active maintenance
Go proto codegen	Buf + protoc-gen-go-grpc	Standard toolchain
All-language codegen	Single `buf.gen.yaml`	One tool, consistent output across all SDKs
mTLS rotation (Go)	`tls.GetClientCertificate` callback	Called per-handshake; no process restart
mTLS rotation (Node.js)	File watcher + channel recreation	Node `SecureContext` is immutable; must recreate channel
OTel dependencies (SDKs)	Peer dependencies, not bundled	Prevent version skew and duplicate singletons

Key decisions and rationale

Why regorus over embedded OPA?

OPA requires a separate process or Wasm compilation. regorus runs in-process in Rust, eliminating a process boundary and IPC overhead. The tradeoff is a subset of Rego — this is acceptable and is documented in the Rego compatibility reference.

Why pool regorus engines?

regorus::Engine::eval_rule is &mut self and synchronous. A single Mutex<Engine> serialises all policy evaluations. Under realistic load (100+ concurrent requests), this is a throughput bottleneck. Pooling — one engine per blocking thread, loaded identically — gives linear scaling.

Hot-reload must update all N engines atomically. This is a hard constraint on the pool design; retrofitting it is expensive. The pool was designed this way from day one.

Why SpanProcessor for PII masking, not SpanExporter?

SpanProcessor fires before the span enters the export queue. SpanExporter fires after. The export queue holds spans in memory briefly — masking at the exporter layer means PII exists in memory buffers, which is insufficient for a zero-tolerance commitment. SpanProcessor masking is the only correct architecture.

Consequence: third-party span processors that run after CoreSDK's processor may still see unmasked data if they inspect attribute maps before CoreSDK's processor runs. Always register CoreSDK's processor first.

Why connect-es over ts-proto?

connect-es is ESM-native, generates strict TypeScript, and supports server-streaming RPCs (required for WatchConfig and WatchPolicyUpdates). ts-proto has better DX for simple unary RPCs but streaming support is immature. The streaming requirement is not optional — config and policy hot-reload depend on it.

Why fail-open on control plane partition?

Failing closed on network partition would make the sidecar a single point of failure for every application it protects. Enterprise users in regulated environments that require fail-closed behaviour can configure it explicitly — but the default must be fail-open to prevent mass outages from transient network issues.

See Offline Mode for fail_mode configuration.

Why `ring` + `aes-gcm` over `josekit` for JWE?

josekit pulls in OpenSSL as a transitive dependency, defeating the pure-Rust / no-OpenSSL-CVE goal. JWE decryption (RSA-OAEP + A256GCM) is two well-scoped operations implementable from ring (RSA-OAEP) + aes-gcm (A256GCM). Both are audited RustCrypto crates.

Why a `KeyProvider` trait?

HSM (PKCS#11) integration is a Phase 3 feature. Designing a KeyProvider trait now (sign(), verify(), hmac(), decrypt()) isolates all key material operations behind one boundary. The software keystore is the default implementation; HSM is a swap-in with no business logic changes required.

Why is HMAC key distribution over mTLS only?

If HMAC keys appear in config files or environment variables, any process with read access to the environment can forge cache entries. The mTLS channel is the only channel where both ends are authenticated. This is a hard architectural constraint, not a preference.

What CoreSDK is not doing

No rayon alongside Tokio. spawn_blocking is sufficient.
No premature crate splitting for Phase 2/3 features. Crates are added when features ship.
No dynamic hook registration at runtime. Security hooks are boot-time only.
No TLS 1.2 automatic fallback. Explicit opt-in only.
No variable values in exported traces or LLM payloads. The local terminal viewer is the only place variable values are visible.
No grpcio replacement in Python v1. betterproto and grpclib are not mature enough for production streaming use.

Testing strategy

Test type	Scope	Tooling
Unit tests	Each crate in isolation	`cargo test` + `proptest` for PII masking
Integration tests	Engine with real regorus + real JWTs	`cargo test` against fixtures
gRPC contract tests	Proto API stability	`buf breaking` in CI
PII masking fuzz	Masking engine against generated payloads	`cargo fuzz` (libfuzzer)
Python SDK tests	Middleware against real sidecar	`pytest` + `testcontainers`
Cross-language parity	Same JWT + policy fixture, all SDKs	Shared fixture file in repo
OTel span assertions	No PII in exported spans	`assert_no_pii(spans)` in all SDK test suites
Load / latency	Policy eval p99 <2ms at 1,000 rps	`k6` + custom gRPC script

Policy eval latency is a CI gate: p99 >2ms on the standard fixture set blocks merge.

Release strategy

Crate versioning: Start at 1.0.0-rc.1. Gate 1.0.0 on design partner validation.
Proto stability: v1 namespace is frozen from first public release. Additive changes only (new fields, new RPCs) without a version bump.
SDK versioning: Each language SDK has its own version. They do not need to be in lockstep with the engine crate version.
Docker: Multi-arch (linux/amd64, linux/arm64) from day one. FROM scratch + static musl binary.

Next steps

Roadmap — phased feature delivery plan
Configuration Reference — all configurable knobs and defaults
Offline Mode — fail mode and cache integrity details
Authorization with Rego — Rego subset and compatibility notes

Engineering Decisions

On this page