Operations

Eight operational disciplines turn the architecture into a system reviewers can trust. Five of them are platform-wide and arrived with the Foundation v1 refactor: AlloyDB pause-by-default, the per-route BACKEND_ROUTING cutover, the PAM-mediator service that injects per-grant elevation delays, the audit-alerter Cloud Run job, and the EU residency boundary. Three are calibration-pipeline disciplines that pre-dated the foundation and ride along: idempotency under retry, the iteration namespace convention, and the cohort-keyed calibration profile registry.

This page is the unvarnished status of each. Where a discipline has not yet been drilled end-to-end, it says so explicitly with a ⏳.

1. AlloyDB pause-by-default — the cost-control plane

The AlloyDB cluster is paused by default. It resumes on demand when an authorised request needs it, stays awake while load continues, and pauses again after a configurable idle window (default 15 min). The 3–5 minute cold-start latency is an explicit trade-off: always-on AlloyDB at ~€450/mo is unjustified for a workload that needs the cluster awake only a fraction of each day; pause-by-default lands at ~€48–50/mo and re-tightens to always-on automatically when sustained traffic continuously fires the heartbeat. Foundation v1 §F-3.13 commits this; the implementation lives at services/nexus-alloydb-controller/, services/nexus-alloydb-auto-pause/, and src/lib/alloydb-warming.ts.

Five components

nexus-alloydb-controller Cloud Run service. Single source of truth for cluster state. Holds a Firestore document per cluster (system_status/alloydb_prod) with fields state (PAUSED | RESUMING | READY | PAUSING), last_resume_at, last_heartbeat_at, pause_after_idle_sec (default 900), pinned_until, pin_reason. Endpoints: POST /resume, POST /pause, POST /heartbeat, GET /status, POST /pin {duration, reason}, POST /unpin. The service runs paused-by-default itself (min-instances=0; no resources held when not invoked).
nexus-alloydb-auto-pause Cloud Run Job. Triggered by Cloud Scheduler every 2 minutes. Per cluster: if pinned_until > now, skip; if now - last_heartbeat_at > pause_after_idle_sec AND state == READY, pause. Logs every decision; emits an alloydb_pause_decision Cloud Monitoring metric for the cost dashboard.
Wake-up middleware library in every AlloyDB-touching Cloud Run service. Two modes — chosen per-endpoint:
- Block-and-wait for background / admin / batch endpoints: call /resume, poll status every 10 s, proceed when READY (up to 5 min).
- Fail-fast 503 for user-facing synchronous endpoints: if state != READY, return 503 with Retry-After: 180 header and {error: "warming_up", eta_seconds: N} body; trigger /resume async so the next retry succeeds. Every successful DB transaction posts an async heartbeat — one line in the data-access path, no perf impact.
Frontend warming UX in src/lib/alloydb-warming.ts (208 LOC). Central handler for 503 + warming_up: toast "Database is warming up — this takes ~3 minutes. Refreshing automatically when ready."; polls /api/system/alloydb every 15 s; auto-retries the original action when READY. Initial page load fires a "warm" probe in layout.tsx so a user clicking through finds the DB ready by the time they reach an AlloyDB-backed view.
Operator CLI at scripts/nexus-alloydb.ps1 and scripts/nexus-alloydb-pin.ps1. Single-purpose commands: wake prod, pin prod --duration 4h --reason active-dev, sleep prod, status. Used by the operator at session start and by the smoke-test runner.

Operator pinning patterns

Situation	Recommended pin
Active development session	`pin prod --duration 4h --reason active-dev`
Smoke test run	Auto-pinned 30 min by the smoke-test runner
External review window	`pin prod --duration 8h --reason review-window` (business hours of each review day)
Phase-2 calibration corpus run	`pin prod --duration 7d --reason phase2-corpus`
Otherwise	Nothing pinned; auto-pause after 15 min idle

The full operator playbook lives at docs/operations/alloydb_lifecycle.md.

Edge-case discipline (built in by-design, not retrofitted)

Datastream CDC during pause. Datastream replication slot stays valid; catches up automatically on resume. AlloyDB's max safe pause is ~7 days; a "max-idle-pause" safety override re-resumes briefly every 5 days to keep the slot fresh.
Connection-pool reconnect. Every service uses pgbouncer in transaction-pooling mode with server_reset_query; first query after resume retries-once on connection_reset.
Scheduled jobs that need DB. Every scheduled job starts with a block-and-wait /resume call. Schedule includes a 5-minute wake-buffer.
Catastrophic-fail safety. If /resume fails 3 times in 10 min, the controller emits a Pub/Sub alert AND falls back to "stay-awake" mode for 1 hr to prevent cascading outage. Better to overspend €0.50 than to drop traffic.
Concurrent resume calls. Controller serialises resume / pause requests per cluster (Firestore transaction on the state doc); a flood of simultaneous calls from 10 services results in exactly one resume.

Drilled evidence

The auto-pause loop has demonstrated successful idle-pause and successful resume cycles in prod. The fail-fast-503 → frontend retry → success path has been integration-tested against the prod cluster via scripts/nexus-alloydb.ps1 sleep followed by an AlloyDB-backed page load.

The cost-control plane is structural, not optional. AlloyDB at €450/mo always-on would buy hours of idle compute for every hour of real work. ~€50/mo with the same architecture is the correct posture at current scale; the same code re-tightens to always-on automatically when sustained traffic justifies it.

2. Per-route backend cutover — `BACKEND_ROUTING`

The Foundation v1 compute commitment (§F-3.4) is to decompose the 1,461-line functions/main.py monolith into per-domain Cloud Run services. The cutover is per-route and per-deploy, not big-bang. The single source of truth lives in src/lib/api-helpers.ts as the BACKEND_ROUTING map: keys are <METHOD> <path> exactly as the Next.js route is shaped; values are "new" (per-domain Cloud Run service in neumatics-prod, eu-west4) or "legacy" (historical: the old-org Cloud Functions — decommissioned 2026-06, all routes are now "new").

Frontend React → /api/<route> → Next.js handler → callBackend(...)
                                                     │
                          ┌──────────────────────────┴──────────────────┐
                          │                                             │
                   BACKEND_ROUTING[key] = "new"            BACKEND_ROUTING[key] = "legacy"
                          │                                             │
                   API_BASE_URL_NEW                          API_BASE_URL_LEGACY
                   (per-service Cloud Run                    (Cloud Functions URLs;
                    URL on neumatics-prod)                decommissioned)

Every callBackend response is asserted against the X-Backend-Origin header — new- prefix for new-infra, legacy- prefix for legacy. Mismatches log to the server console so routing bugs surface loudly instead of silently corrupting data.

Per-route status (as of 2026-07-09)

Route	Backend	Service / function
`POST /api/consent/subject`	`new`	`nexus-consent`
`GET /api/consent/subject`	`new`	`nexus-consent`
`GET /api/consent/catalog`	`new`	`nexus-consent`
`POST /api/budget/poll`	`new`	`nexus-budget`
`GET /api/audit/events`	`new`	`nexus-audit-alerter`
`POST /api/translate`	`new`	`nexus-translate`
`GET /api/vocabulary`	`new`	`nexus-vocabulary`
`POST /api/embedding`	`new`	`embedding`
`GET /api/feature-registry`	`new`	`nexus-feature-registry`
`POST /api/echo/start`	`new`	`nexus-echo`
`POST /api/echo/turn`	`new`	`nexus-echo`
`POST /api/echo/discard`	`new`	`nexus-echo` session lifecycle
`POST /api/echo/pause`	`new`	`nexus-echo` session lifecycle
`POST /api/echo/resume`	`new`	`nexus-echo` session lifecycle
`GET /api/profile/current`	`new`	`/v1/subject_current` query path (AlloyDB)
`POST /api/profile/film`	`new`	`nexus-substrate-api` `/v1/query`
`POST /api/cohort/drift`	`new`	`nexus-substrate-api` `/v1/query`

Adding a new endpoint MUST add an entry; silent fall-through to legacy is exactly the bug the constant is supposed to prevent (routeBackend() throws on an unrouted key). Flipping a route from "legacy" to "new" requires no apphosting redeploy if both env vars are already set — the per-service graduation gate is iteration-B + 24-hour dual-stack soak per docs/operations/apphosting_cutover.md.

3. PAM-mediator — per-grant elevation delays

Foundation v1 §S-4.10 + OD-19 commit to a Privileged Access Manager (PAM) policy with per-grant delays before activation: deployer 5 min, DBA 30 min, KMS-admin 30 min + mandatory justification + per-action notification email. Native PAM does not document a per-grant delay knob; the spec resolves this with a thin Cloud Run mediator that intermediates between the elevation request and the actual PAM grant.

The mediator is deployed at services/nexus-pam-mediator/ (~318 LOC of Terraform plus the service code). The flow:

The operator (or Claude Code, or any tool) requests an elevation through the mediator endpoint.
The mediator records the request to audit, sleeps the configured delay (5 / 30 / 30 min depending on the grant class), then issues the actual PAM grant request via the PAM API.
The operator receives the activation when the delay completes.
For KMS-admin: a per-action notification email fires inside the delay window, so the operator can cancel the request if they didn't intend it.

The PAM entitlements themselves (per-class grant scopes, justification requirements, audit-log routing) are configured outside Terraform via the GCP console; the mediator service exists to bridge the spec's delay commitment to PAM's missing knob. The elevation flow is documented in full at docs/security/access_patterns.md and the Claude-Code-specific snippet at docs/security/claude_code_access.md.

4. Audit alerter — curated security event routing

Foundation v1 §S-4.5 + S-4.9 commit to a curated event list that routes to Pub/Sub topic nexus-security-alert.v1, fanned out to the operator. The operator does not read raw audit logs; the event list is what surfaces. Most weeks none fire.

The nexus-audit-alerter Cloud Run job is deployed at services/nexus-audit-alerter/ and scheduled by Cloud Scheduler. It subscribes to the Pub/Sub topic and processes the following event types out of the aggregated org-level audit sink in neumatics-audit-logs:

Service-account-key creation attempts (denied by iam.disableServiceAccountKeyCreation org policy, but log the attempt).
KMS key access from a service account not on the expected list.
VPC Service Controls perimeter violations.
IAM grants to principals outside the neumatics.eu domain.
setIamPolicy on neumatics-prod.
bigquery.dataAccess on tables tagged art9_status:true|inferred ⏳ (Knowledge Catalog tagging baseline not yet applied).
storage.buckets.setIamPolicy on the audit-log archive bucket.
cloudkms.cryptoKeyVersions.destroy outside the rotation flow.
Failed Binary Authorization deploys (more than 3 per day → alert).
Severity-HIGH or CRITICAL Security Command Center findings ⏳ (SCC Premium not yet enabled).
Datastream replication slot lag > 30 minutes (data-loss risk).
Apphosting deploy failures (operational, not strictly security).

Routes are read-only to the audit dataset; only nexus-audit-readers workforce group can query the dataset directly. The full operator weekly / monthly / quarterly cadence is in docs/security/operator_playbook.md.

5. EU residency boundary

The hard line: every BigQuery dataset, Cloud Storage bucket, Cloud Run service, Cloud Run Job, Cloud Workflow, AlloyDB cluster, KMS keyring, and Vertex AI Custom Training Job runs in europe-west4 or in the EU BigQuery multi-region. One documented exception:

The Vertex publisher model gemini-3-flash-preview is /global/-only for the Phase-2 calibration corpus generation. The corpus is fully synthetic — every persona is sampled from the copula on gs://neumatics-prod-corpus/personas/library_v2_copula.jsonl, every scenario is from the locked library gs://neumatics-prod-corpus/scenarios/library_v1.jsonl. The Vertex Batch input that crosses the residency boundary contains only synthetic personas + synthetic scenarios + a structured JSON response schema — no personal data, no consumer text, no API keys. Smoke #8 surfaced the /global/-only constraint when an europe-west1-pinned submission failed with 400 location in model name doesn't match.

The org-policy bundle (infra/org-policies/) enforces gcp.resourceLocations in:europe-west4-locations with a BigQuery-EU exception. New resources outside this constraint cannot be created — the API call is denied org-wide.

6. Idempotency under retry (calibration pipeline)

Every operation in the calibration pipeline is keyed by a stable dedupe identifier; resubmission of the same key produces zero duplicate work or zero duplicate spend.

Operation	Dedupe key	Mechanism
Vertex AI Batch shard	`(iteration, shard_index)`	Resubmit produces identical input JSONL + identical output GCS prefix; output overwrite is benign.
BigQuery `sessions` row	`(iteration, persona_id, session_id, attempt)`	Per-shard staging table → `MERGE` on key → drop staging.
BigQuery `cost_ledger` row	`vertex_call_uid`	Per-shard staging → `MERGE` on Vertex's unique request ID → drop staging.
Cloud Run Job invocation	Cloud Run Job execution name	Job-level dedup via Workflow execution ID.
Vertex Custom Training	`displayName = calibration-iter{N}-fam{F}`	Re-submission picks up where a failed job left off.
Workflow execution	`workflow_execution_id`	Logs / heartbeats / Firestore docs all keyed off it.

The mechanism on the BigQuery side: the shard worker writes a per-shard JSONL file to GCS, runs a BigQuery load job into a per-shard staging table, runs a MERGE from staging into the canonical table on the dedupe key, drops staging. Duplicate rows in staging — say, because the worker retried a partial batch — collapse into one canonical row at the MERGE step. Free up to 1,500 load jobs per day per project; we are three orders of magnitude under that limit.

Drilled evidence (idempotency)

Smoke #11 produced this drill incidentally. A Vertex Batch SUCCEEDED, the downstream MERGE failed on a STRING-vs-FLOAT64 schema drift, and the Cloud Run Job auto-retry kicked a fresh batch. The retry-and-MERGE path absorbed the duplicate work cleanly: zero duplicate cost_ledger rows, zero duplicate sessions rows, the cost being one extra Vertex Batch's worth of compute (~$25, attributed in the cost-engineering page).

Smoke #16 hit the same path with a different upstream cause — flatten_session_for_bq was writing a json.dumps()'d string into a JSON-typed column. After the fix, the worker retried; the MERGE absorbed the duplicates cleanly.

A deliberate idempotency stress drill — re-fire the same (iteration, shard_index) six times in parallel and assert one MERGE'd row per vertex_call_uid across all six attempts — has not been run. The infrastructure to drill it is in place. The drill is on the Phase-2 hardening backlog (R5 ⏳ #4).

7. Calibration auto-pause — two pathways

Two pathways guard against calibration runaway spend (in addition to the platform-wide AlloyDB pause-by-default and the per-project budget alerts described in cost engineering). One is live and exercised by every iteration; one is shipped infrastructure but not yet drilled end-to-end.

7.1 Live: cost-ledger-driven gate

workflows/iteration_runner.yaml step check_budget_pre calls Cloud Function budget_check at iteration start. The function:

Reads the Firestore flag calibration_runtime/budget_state.paused.
Queries cost_ledger for current iteration spend and lifetime spend.
Returns {paused: true} if iteration spend ≥ 90 % of iteration_budget_usd or lifetime spend ≥ 90 % of HARD_CAP_USD (default $25,000).

If paused == true, the workflow short-circuits to pause_iteration without submitting LLM calls. This pathway is live. Every shakedown smoke called check_budget_pre. The cost-ledger query is the binding safety net.

7.2 Shipped: Cloud Billing alert pathway

Auto-pause control flow. Soft alerts (< 90 %) emit operator notifications without pausing; hard alerts (≥ 90 %) set the Firestore flag.

The Cloud Billing alert pathway has not been driven end-to-end. Configuring Cloud Billing budgets requires roles/billing.user, which the project's operator account currently lacks. The deferral is recorded in the shakedown ledger; the drill is on the Phase-2 hardening backlog (R5 ⏳ #1).

What this means in practice: until the role is granted, the cost-ledger gate (§7.1) is the binding safety net. It catches every iteration-boundary case and is exercised by every smoke. What it does not catch is mid-iteration runaway — a single shard going hot inside an already-started iteration, blowing the per-iteration budget before check_budget_pre runs again. The Cloud Billing pathway exists specifically to close that gap.

Operator override

gcloud firestore documents delete calibration_runtime/budget_state clears the pause flag; the next iteration re-runs the spend check with fresh data. Top up the budget first; then clear the flag.

8. Iteration namespace + cohort-keyed profile registry (calibration pipeline)

Two post-shakedown disciplines that prevent Phase-1 smoke from colliding with Phase-2 production calibration in BigQuery, and prevent smoke configs from silently inheriting into production runs.

Iteration namespace

Iteration range	Use	Enforced where
0–9	Phase-1 smoke + ad-hoc debugging	`scripts/forge_persona_library.py` validators; workflow input validators
10+	Phase-2 production calibration	Same validators

Every BigQuery row in sessions, cost_ledger, iteration_summary, calibration_metrics, and invariance is keyed by iteration as part of its MERGE clause. If a Phase-1 retry tried to MERGE into a row keyed at iteration 7, and Phase-2 happened to have a production row keyed at iteration 7, the MERGE would silently update production data with smoke data. The convention closes the class of bug; the validators enforce it; the local QA harness test_workflow_safety.py checks that workflow inputs honour it.

Cohort-keyed profile registry

functions/persona_factory/calibration_profile.py exposes get_profile(cohort: str). Phase-1 smoke uses an explicit cohort named phase1_smoke with permissive defaults — reduced thinking budget, abbreviated scenario library, looser convergence gates. Phase-2 cohorts must be explicitly registered by name. An unregistered cohort raises ValueError rather than silently falling back to defaults.

This closes a class of bug we had not anticipated before shakedown: a Phase-1 smoke configuration silently inheriting into a Phase-2 production calibration run via default-argument fall-through. The smoke had a reduced thinking budget (T1 = 2048, T2/T3 = 1024 — tuned down from production); the default fallback would have shipped that thinking budget into a production iteration, which would have produced low-quality LLM outputs and invalidated the calibration fit.

The fix is to make defaults raise. Every workflow input that drives a calibration parameter set carries a cohort field; get_profile(cohort) either returns the registered profile or raises. There is no silent fallback to "the default smoke config." Smoke configs are only reachable by passing cohort="phase1_smoke" explicitly.

The R1 dataset design report (§8.1) carries the methodology framing for this discipline; R3 §3.2 carries the operational narrative; R7 §13.1 logs it as an OK post-shakedown addition that closes one of the original audit's flagged gaps.

What is observable

Every operation above produces an observable trace:

Surface	What you can read
Cloud Logging (per project)	Cloud Run service stdout/stderr; Cloud Run Job stdout/stderr; Custom Job stdout/stderr; Workflow execution log per step.
Aggregated org log sink → `neumatics-audit-logs` BigQuery	Every Data Access log on Firestore / AlloyDB / BigQuery / Secret Manager / KMS. Read-only to humans. 7y retention.
Cloud Monitoring	Custom metrics under `corpus.`, `calibration.`, `alloydb.*` namespaces. `alloydb_pause_decision` for cost-control.
Firestore `system_status/alloydb_prod`	Live AlloyDB cluster state + pin reason + last heartbeat.
Firestore `factory_loop_runs/{workflow_exec_id}/shards/{shard_index}`	Per-shard heartbeat with timestamps + status.
Firestore `calibration_runtime/budget_state`	Current pause flag + reason + percent + setter.
BigQuery `nexus_calibration_corpus.iteration_summary`	One row per (iteration, component) with spend, sessions, gate verdicts, calibration deltas.
BigQuery `nexus_calibration_corpus.cost_ledger`	Per-Vertex-call cost line with `iteration` / `component` / `cohort` labels.
`gcloud workflows executions list iteration_runner --location=europe-west4`	Every iteration's execution log + status.
`scripts/nexus-alloydb.ps1 status`	One-liner cluster state.

The combination is what makes the operational discipline verifiable end-to-end: every dollar in cost_ledger is attributed to an iteration × component × cohort triple; every iteration's outcome lands in iteration_summary; every operator action that clears a pause flag leaves a Firestore audit trail; every AlloyDB resume / pause / pin is in the controller's structured logs and surfaces to Cloud Monitoring.

The Gate-10 measurement-invariance check that lands in invariance follows the Cheung & Rensvold ΔCFI ≤ 0.01 threshold; methodology details are in /about/science/methodology and the gated reports tier (available on request).

What is not drilled yet

To be honest about the surface:

Auto-pause end-to-end — Cloud Billing pathway. ⏳ blocked on roles/billing.user.
Mid-shard worker crash with CHAOS_FAULT injection — chaos image not built. ⏳ Phase-2 hardening sprint.
Vertex 429 backoff under live quota pressure — _RateLimiter fully built, but shakedown never produced a 429. ⏳ Phase-2 corpus may incidentally exercise.
Idempotency stress with six concurrent re-fires of the same (iteration, shard_index) — smokes #11 and #16 are supportive but not equivalent. ⏳ Phase-2 hardening sprint.
Knowledge Catalog tag-coverage report — depends on a Knowledge Catalog tagging baseline that is ⏳ not yet applied across the BigQuery + AlloyDB + Firestore surfaces.
Firebase Stream-Firestore-to-BQ extension — replaces the legacy nightly Firestore-to-BQ export. ⏳ extension installation pending; docs/operations/firestore_extension.md is the runbook.
Workflows: erasure-cascade.yaml, calibration-promote.yaml, cohort-freeze.yaml — ⏳ reserved real-estate per F-3.9 / S-4.8; not yet implemented.

The reviewer-grade R5 robustness report (available on request) carries the full per-drill execution status with infrastructure citations and the Phase-2 hardening checklist. None of the calibration-pipeline gaps blocks the next hardening phase; the platform-level gaps are scheduled work as the foundation refactor's later stages land.