Engineering Decisions
Key architectural and tooling decisions in CoreSDK — what we chose, why we chose it, and what trade-offs we accepted.
Engineering Decisions
This page documents the significant technical choices behind CoreSDK: the alternatives we considered, the rationale for each decision, and the constraints those decisions impose. Understanding these helps you predict how CoreSDK will behave in edge cases and how to extend it correctly.
Guiding principles
- Core ships first, alone. Nothing else — no sidecar, no SDK — until the Rust engine API is stable and validated by a design partner.
- One import, one decision.
coresdk-engineis the only crate users import. Internal decomposition is invisible. - Policy evaluation is pooled from day one.
regorus::Engine::eval_ruletakes&mut self. A singleMutex<Engine>will not survive production load. - PII masking is zero-tolerance. A secret in an exported trace is a security incident, not a bug. Masking is tested on every output path with fuzz coverage.
- No generated code in CI. Proto stubs are checked in. Downstream builds must not require
protocorbuf. - Fail-open on partition, fail-closed on cache integrity. Network unreachable → continue on cached data. HMAC mismatch → reject the cache entirely.
- Lock-free reads on the hot path.
arc-swapfor config snapshots and JWK sets. NoRwLockwhere reads dominate.
Language and tooling decisions
| Concern | Decision | Rationale |
|---|---|---|
| Rust edition | 2024 | Latest stable; async improvements |
| Async runtime | Tokio (multi-thread) | tonic requires it |
| Policy engine | regorus (Rust-native Rego) | Sub-2ms p99; in-process; no OPA process dependency |
| regorus concurrency | Pool of N engines via spawn_blocking | eval_rule is &mut self + synchronous |
| Config hot-reload | arc-swap + notify (file watcher) | Lock-free reads; no restart required |
| JWK cache | arc-swap<JwkSet> + background refresh | Stale-while-revalidate; no auth cliff-edge on refresh failure |
| TLS | rustls (TLS 1.3 only) | Memory-safe; no OpenSSL CVE exposure by default |
| mTLS certs | rcgen for issuance; rustls for termination | No external CA dependency for dev |
| gRPC server | tonic | De-facto Rust gRPC; native TLS support |
| Python gRPC | grpcio | Widest compatibility |
| Python OTel masking | Custom SpanProcessor (not SpanExporter) | Masking before export queue, not after |
| TypeScript proto codegen | Buf + connect-es | ESM-first; streaming support; active maintenance |
| Go proto codegen | Buf + protoc-gen-go-grpc | Standard toolchain |
| All-language codegen | Single buf.gen.yaml | One tool, consistent output across all SDKs |
| mTLS rotation (Go) | tls.GetClientCertificate callback | Called per-handshake; no process restart |
| mTLS rotation (Node.js) | File watcher + channel recreation | Node SecureContext is immutable; must recreate channel |
| OTel dependencies (SDKs) | Peer dependencies, not bundled | Prevent version skew and duplicate singletons |
Key decisions and rationale
Why regorus over embedded OPA?
OPA requires a separate process or Wasm compilation. regorus runs in-process in Rust, eliminating a process boundary and IPC overhead. The tradeoff is a subset of Rego — this is acceptable and is documented in the Rego compatibility reference.
Why pool regorus engines?
regorus::Engine::eval_rule is &mut self and synchronous. A single Mutex<Engine> serialises all policy evaluations. Under realistic load (100+ concurrent requests), this is a throughput bottleneck. Pooling — one engine per blocking thread, loaded identically — gives linear scaling.
Hot-reload must update all N engines atomically. This is a hard constraint on the pool design; retrofitting it is expensive. The pool was designed this way from day one.
Why SpanProcessor for PII masking, not SpanExporter?
SpanProcessor fires before the span enters the export queue. SpanExporter fires after. The export queue holds spans in memory briefly — masking at the exporter layer means PII exists in memory buffers, which is insufficient for a zero-tolerance commitment. SpanProcessor masking is the only correct architecture.
Consequence: third-party span processors that run after CoreSDK's processor may still see unmasked data if they inspect attribute maps before CoreSDK's processor runs. Always register CoreSDK's processor first.
Why connect-es over ts-proto?
connect-es is ESM-native, generates strict TypeScript, and supports server-streaming RPCs (required for WatchConfig and WatchPolicyUpdates). ts-proto has better DX for simple unary RPCs but streaming support is immature. The streaming requirement is not optional — config and policy hot-reload depend on it.
Why fail-open on control plane partition?
Failing closed on network partition would make the sidecar a single point of failure for every application it protects. Enterprise users in regulated environments that require fail-closed behaviour can configure it explicitly — but the default must be fail-open to prevent mass outages from transient network issues.
See Offline Mode for fail_mode configuration.
Why ring + aes-gcm over josekit for JWE?
josekit pulls in OpenSSL as a transitive dependency, defeating the pure-Rust / no-OpenSSL-CVE goal. JWE decryption (RSA-OAEP + A256GCM) is two well-scoped operations implementable from ring (RSA-OAEP) + aes-gcm (A256GCM). Both are audited RustCrypto crates.
Why a KeyProvider trait?
HSM (PKCS#11) integration is a Phase 3 feature. Designing a KeyProvider trait now (sign(), verify(), hmac(), decrypt()) isolates all key material operations behind one boundary. The software keystore is the default implementation; HSM is a swap-in with no business logic changes required.
Why is HMAC key distribution over mTLS only?
If HMAC keys appear in config files or environment variables, any process with read access to the environment can forge cache entries. The mTLS channel is the only channel where both ends are authenticated. This is a hard architectural constraint, not a preference.
What CoreSDK is not doing
- No rayon alongside Tokio.
spawn_blockingis sufficient. - No premature crate splitting for Phase 2/3 features. Crates are added when features ship.
- No dynamic hook registration at runtime. Security hooks are boot-time only.
- No TLS 1.2 automatic fallback. Explicit opt-in only.
- No variable values in exported traces or LLM payloads. The local terminal viewer is the only place variable values are visible.
- No grpcio replacement in Python v1. betterproto and grpclib are not mature enough for production streaming use.
Testing strategy
| Test type | Scope | Tooling |
|---|---|---|
| Unit tests | Each crate in isolation | cargo test + proptest for PII masking |
| Integration tests | Engine with real regorus + real JWTs | cargo test against fixtures |
| gRPC contract tests | Proto API stability | buf breaking in CI |
| PII masking fuzz | Masking engine against generated payloads | cargo fuzz (libfuzzer) |
| Python SDK tests | Middleware against real sidecar | pytest + testcontainers |
| Cross-language parity | Same JWT + policy fixture, all SDKs | Shared fixture file in repo |
| OTel span assertions | No PII in exported spans | assert_no_pii(spans) in all SDK test suites |
| Load / latency | Policy eval p99 <2ms at 1,000 rps | k6 + custom gRPC script |
Policy eval latency is a CI gate: p99 >2ms on the standard fixture set blocks merge.
Release strategy
- Crate versioning: Start at
1.0.0-rc.1. Gate1.0.0on design partner validation. - Proto stability:
v1namespace is frozen from first public release. Additive changes only (new fields, new RPCs) without a version bump. - SDK versioning: Each language SDK has its own version. They do not need to be in lockstep with the engine crate version.
- Docker: Multi-arch (
linux/amd64,linux/arm64) from day one.FROM scratch+ static musl binary.
Next steps
- Roadmap — phased feature delivery plan
- Configuration Reference — all configurable knobs and defaults
- Offline Mode — fail mode and cache integrity details
- Authorization with Rego — Rego subset and compatibility notes