Operations
Eight operational disciplines turn the architecture into a system reviewers can trust. Five of them are platform-wide and arrived with the Foundation v1 refactor: AlloyDB pause-by-default, the per-route BACKEND_ROUTING cutover, the PAM-mediator service that injects per-grant elevation delays, the audit-alerter Cloud Run job, and the EU residency boundary. Three are calibration-pipeline disciplines that pre-dated the foundation and ride along: idempotency under retry, the iteration namespace convention, and the cohort-keyed calibration profile registry.
This page is the unvarnished status of each. Where a discipline has not yet been drilled end-to-end, it says so explicitly with a ⏳.
1. AlloyDB pause-by-default — the cost-control plane
Both AlloyDB clusters (prod live; staging deferred) are paused by default. They resume on demand when an authorised request needs them, stay awake while load continues, and pause again after a configurable idle window (default 15 min). The 3–5 minute cold-start latency is an explicit trade-off: at our pre-launch scale (10 test users, no live external traffic), always-on AlloyDB at ~€450/mo is unjustified; pause-by-default lands at ~€95–100/mo combined and re-tightens to always-on automatically when real traffic continuously fires the heartbeat. Foundation v1 §F-3.13 commits this; the implementation lives at services/nexus-alloydb-controller/, services/nexus-alloydb-auto-pause/, and src/lib/alloydb-warming.ts.
Five components
nexus-alloydb-controllerCloud Run service. Single source of truth for cluster state. Holds a Firestore document per cluster (system_status/alloydb_{prod,staging}) with fieldsstate(PAUSED | RESUMING | READY | PAUSING),last_resume_at,last_heartbeat_at,pause_after_idle_sec(default 900),pinned_until,pin_reason. Endpoints:POST /resume,POST /pause,POST /heartbeat,GET /status,POST /pin {duration, reason},POST /unpin. The service runs paused-by-default itself (min-instances=0; no resources held when not invoked).nexus-alloydb-auto-pauseCloud Run Job. Triggered by Cloud Scheduler every 2 minutes. Per cluster: ifpinned_until > now, skip; ifnow - last_heartbeat_at > pause_after_idle_secANDstate == READY, pause. Logs every decision; emits analloydb_pause_decisionCloud Monitoring metric for the cost dashboard.- Wake-up middleware library in every AlloyDB-touching Cloud Run service. Two modes — chosen per-endpoint:
- Block-and-wait for background / admin / batch endpoints: call
/resume, poll status every 10 s, proceed when READY (up to 5 min). - Fail-fast 503 for user-facing synchronous endpoints: if
state != READY, return503withRetry-After: 180header and{error: "warming_up", eta_seconds: N}body; trigger/resumeasync so the next retry succeeds. Every successful DB transaction posts an async heartbeat — one line in the data-access path, no perf impact.
- Block-and-wait for background / admin / batch endpoints: call
- Frontend warming UX in
src/lib/alloydb-warming.ts(208 LOC). Central handler for503 + warming_up: toast "Database is warming up — this takes ~3 minutes. Refreshing automatically when ready."; polls/api/system/alloydbevery 15 s; auto-retries the original action when READY. Initial page load fires a "warm" probe inlayout.tsxso a user clicking through finds the DB ready by the time they reach an AlloyDB-backed view. - Operator CLI at
scripts/nexus-alloydb.ps1andscripts/nexus-alloydb-pin.ps1. Single-purpose commands:wake [staging|prod],pin [staging|prod] --duration 4h --reason active-dev,sleep [staging|prod],status. Used by the operator at session start and by the smoke-test runner.
Operator pinning patterns
| Situation | Recommended pin |
|---|---|
| Active development session | pin staging --duration 4h --reason active-dev |
| Smoke test run | Auto-pinned 30 min by the smoke-test runner |
| Funding-review window | pin prod --duration 8h --reason funding-review (business hours of each review day) |
| Phase-2 calibration corpus run | pin prod --duration 7d --reason phase2-corpus |
| Otherwise | Nothing pinned; auto-pause after 15 min idle |
The full operator playbook lives at docs/operations/alloydb_lifecycle.md.
Edge-case discipline (built in by-design, not retrofitted)
- Datastream CDC during pause. Datastream replication slot stays valid; catches up automatically on resume. AlloyDB's max safe pause is ~7 days; a "max-idle-pause" safety override re-resumes briefly every 5 days to keep the slot fresh.
- Connection-pool reconnect. Every service uses pgbouncer in transaction-pooling mode with
server_reset_query; first query after resume retries-once onconnection_reset. - Scheduled jobs that need DB. Every scheduled job starts with a block-and-wait
/resumecall. Schedule includes a 5-minute wake-buffer. - Catastrophic-fail safety. If
/resumefails 3 times in 10 min, the controller emits a Pub/Sub alert AND falls back to "stay-awake" mode for 1 hr to prevent cascading outage. Better to overspend €0.50 than to drop traffic. - Concurrent resume calls. Controller serialises resume / pause requests per cluster (Firestore transaction on the state doc); a flood of simultaneous calls from 10 services results in exactly one resume.
Drilled evidence
The auto-pause loop has demonstrated successful idle-pause and successful resume cycles in prod (the only live cluster as of 2026-05-10). The fail-fast-503 → frontend retry → success path has been integration-tested against the prod cluster via scripts/nexus-alloydb.ps1 sleep followed by an AlloyDB-backed page load. Staging-mirror drill is ⏳ deferred until the staging project is provisioned.
2. Per-route backend cutover — BACKEND_ROUTING
The Foundation v1 compute commitment (§F-3.4) is to decompose the 1,461-line functions/main.py monolith into per-domain Cloud Run services. The cutover is per-route and per-deploy, not big-bang. The single source of truth lives in src/lib/api-helpers.ts as the BACKEND_ROUTING map: keys are <METHOD> <path> exactly as the Next.js route is shaped; values are "new" (per-domain Cloud Run service in neumatics-prod, eu-west4) or "legacy" (historical: the old-org Cloud Functions — decommissioned 2026-06, all routes are now "new").
Frontend React → /api/<route> → Next.js handler → callBackend(...)
│
┌──────────────────────────┴──────────────────┐
│ │
BACKEND_ROUTING[key] = "new" BACKEND_ROUTING[key] = "legacy"
│ │
API_BASE_URL_NEW API_BASE_URL_LEGACY
(per-service Cloud Run (Cloud Functions URLs;
URL on neumatics-prod) decommissioned)
Every callBackend response is asserted against the X-Backend-Origin header — new- prefix for new-infra, legacy- prefix for legacy. Mismatches log to the server console so routing bugs surface loudly instead of silently corrupting data.
Per-route status (as of 2026-05-10)
| Route | Backend | Service / function |
|---|---|---|
POST /api/consent/subject | new | nexus-consent |
GET /api/consent/subject | new | nexus-consent |
GET /api/consent/catalog | new | nexus-consent |
POST /api/budget/poll | new | nexus-budget |
GET /api/audit/events | new | nexus-audit-alerter |
POST /api/translate | new | nexus-translate |
GET /api/vocabulary | new | nexus-vocabulary |
POST /api/embedding | new | embedding |
GET /api/feature-registry | new | nexus-feature-registry |
POST /api/echo/start | new | nexus-echo + Vertex Echo Reasoning Engine |
POST /api/echo/turn | new | nexus-echo |
GET /api/profile/current | new | nexus-analyst /v1/subject_current (AlloyDB) |
POST /api/profile/film | new | nexus-analyst → nexus-substrate-api /v1/query |
POST /api/cohort/drift | new | nexus-analyst → nexus-substrate-api /v1/query |
POST /api/echo/discard | legacy | legacy session cleanup |
POST /api/echo/pause | legacy | |
POST /api/echo/resume | legacy | |
POST /api/marketplace/quote | legacy | out of demo scope |
POST /api/wallet/payment | legacy | out of demo scope |
POST /api/compensation/score | legacy | |
POST /api/analyst/query | legacy |
Adding a new endpoint MUST add an entry; silent fall-through to legacy is exactly the bug the constant is supposed to prevent (routeBackend() throws on an unrouted key). Flipping a route from "legacy" to "new" requires no apphosting redeploy if both env vars are already set — the per-service graduation gate is iteration-B + 24-hour dual-stack soak per docs/operations/apphosting_cutover.md.
One known caveat — Vertex AI Agent Engine
The three Vertex Reasoning Engine resource IDs in apphosting.yaml (NEXUS_REASONING_AGENT_RESOURCE_NAME, NEXUS_ECHO_AGENT_RESOURCE_NAME, NEXUS_SYNTHESIS_AGENT_RESOURCE_NAME) point at the eu-west4 deployment inside neumatics-prod — ✅ redeployed at the 2026-06 org migration; the legacy eu-west1 engines are gone with the old org.
3. PAM-mediator — per-grant elevation delays
Foundation v1 §S-4.10 + OD-19 commit to a Privileged Access Manager (PAM) policy with per-grant delays before activation: deployer 5 min, DBA 30 min, KMS-admin 30 min + mandatory justification + per-action notification email. Native PAM does not document a per-grant delay knob; the spec resolves this with a thin Cloud Run mediator that intermediates between the elevation request and the actual PAM grant.
The mediator is deployed at services/nexus-pam-mediator/ (~318 LOC of Terraform plus the service code). The flow:
- The operator (or Claude Code, or any tool) requests an elevation through the mediator endpoint.
- The mediator records the request to audit, sleeps the configured delay (5 / 30 / 30 min depending on the grant class), then issues the actual PAM grant request via the PAM API.
- The operator receives the activation when the delay completes.
- For KMS-admin: a per-action notification email fires inside the delay window, so the operator can cancel the request if they didn't intend it.
The PAM entitlements themselves (per-class grant scopes, justification requirements, audit-log routing) are configured outside Terraform via the GCP console; the mediator service exists to bridge the spec's delay commitment to PAM's missing knob. The elevation flow is documented in full at docs/security/access_patterns.md and the Claude-Code-specific snippet at docs/security/claude_code_access.md.
4. Audit alerter — curated security event routing
Foundation v1 §S-4.5 + S-4.9 commit to a curated event list that routes to Pub/Sub topic nexus-security-alert.v1, fanned out to the operator. The operator does not read raw audit logs; the event list is what surfaces. Most weeks none fire.
The nexus-audit-alerter Cloud Run job is deployed at services/nexus-audit-alerter/ and scheduled by Cloud Scheduler. It subscribes to the Pub/Sub topic and processes the following event types out of the aggregated org-level audit sink in neumatics-audit-logs:
- Service-account-key creation attempts (denied by
iam.disableServiceAccountKeyCreationorg policy, but log the attempt). - KMS key access from a service account not on the expected list.
- VPC Service Controls perimeter violations.
- IAM grants to principals outside the
neumatics.eudomain. setIamPolicyonneumatics-prod.bigquery.dataAccesson tables taggedart9_status:true|inferred⏳ (Knowledge Catalog tagging baseline not yet applied).storage.buckets.setIamPolicyon the audit-log archive bucket.cloudkms.cryptoKeyVersions.destroyoutside the rotation flow.- Failed Binary Authorization deploys (more than 3 per day → alert).
- Severity-HIGH or CRITICAL Security Command Center findings ⏳ (SCC Premium not yet enabled).
- Datastream replication slot lag > 30 minutes (data-loss risk).
- Apphosting deploy failures (operational, not strictly security).
Routes are read-only to the audit dataset; only nexus-audit-readers workforce group can query the dataset directly. The full operator weekly / monthly / quarterly cadence is in docs/security/operator_playbook.md.
5. EU residency boundary
The hard line: every BigQuery dataset, Cloud Storage bucket, Cloud Run service, Cloud Run Job, Cloud Workflow, AlloyDB cluster, KMS keyring, and Vertex AI Custom Training Job runs in europe-west4 or in the EU BigQuery multi-region. Two documented exceptions:
- The Vertex publisher model
gemini-3-flash-previewis/global/-only for the Phase-2 calibration corpus generation. The corpus is fully synthetic — every persona is sampled from the copula ongs://neumatics-prod-corpus/personas/library_v2_copula.jsonl, every scenario is from the locked librarygs://neumatics-prod-corpus/scenarios/library_v1.jsonl. The Vertex Batch input that crosses the residency boundary contains only synthetic personas + synthetic scenarios + a structured JSON response schema — no personal data, no consumer text, no API keys. Smoke #8 surfaced the/global/-only constraint when aneurope-west1-pinned submission failed with400 location in model name doesn't match. - Vertex AI Agent Engine resource IDs — ✅ redeployed to
europe-west4insideneumatics-prodat the 2026-06 org migration;apphosting.yamlcarries the new IDs.
The org-policy bundle (infra/org-policies/) enforces gcp.resourceLocations in:europe-west4-locations with a BigQuery-EU exception. New resources outside this constraint cannot be created — the API call is denied org-wide.
6. Idempotency under retry (calibration pipeline)
Every operation in the calibration pipeline is keyed by a stable dedupe identifier; resubmission of the same key produces zero duplicate work or zero duplicate spend.
| Operation | Dedupe key | Mechanism |
|---|---|---|
| Vertex AI Batch shard | (iteration, shard_index) | Resubmit produces identical input JSONL + identical output GCS prefix; output overwrite is benign. |
BigQuery sessions row | (iteration, persona_id, session_id, attempt) | Per-shard staging table → MERGE on key → drop staging. |
BigQuery cost_ledger row | vertex_call_uid | Per-shard staging → MERGE on Vertex's unique request ID → drop staging. |
| Cloud Run Job invocation | Cloud Run Job execution name | Job-level dedup via Workflow execution ID. |
| Vertex Custom Training | displayName = calibration-iter{N}-fam{F} | Re-submission picks up where a failed job left off. |
| Workflow execution | workflow_execution_id | Logs / heartbeats / Firestore docs all keyed off it. |
The mechanism on the BigQuery side: the shard worker writes a per-shard JSONL file to GCS, runs a BigQuery load job into a per-shard staging table, runs a MERGE from staging into the canonical table on the dedupe key, drops staging. Duplicate rows in staging — say, because the worker retried a partial batch — collapse into one canonical row at the MERGE step. Free up to 1,500 load jobs per day per project; we are three orders of magnitude under that limit.
Drilled evidence (idempotency)
Smoke #11 produced this drill incidentally. A Vertex Batch SUCCEEDED, the downstream MERGE failed on a STRING-vs-FLOAT64 schema drift, and the Cloud Run Job auto-retry kicked a fresh batch. The retry-and-MERGE path absorbed the duplicate work cleanly: zero duplicate cost_ledger rows, zero duplicate sessions rows, the cost being one extra Vertex Batch's worth of compute (~$25, attributed in the cost-engineering page).
Smoke #16 hit the same path with a different upstream cause — flatten_session_for_bq was writing a json.dumps()'d string into a JSON-typed column. After the fix, the worker retried; the MERGE absorbed the duplicates cleanly.
A deliberate idempotency stress drill — re-fire the same (iteration, shard_index) six times in parallel and assert one MERGE'd row per vertex_call_uid across all six attempts — has not been run. The infrastructure to drill it is in place. The drill is on the Phase-2 hardening backlog (R5 ⏳ #4).
7. Calibration auto-pause — two pathways
Two pathways guard against calibration runaway spend (in addition to the platform-wide AlloyDB pause-by-default and the per-project budget alerts described in cost engineering). One is live and exercised by every iteration; one is shipped infrastructure but not yet drilled end-to-end.
7.1 Live: cost-ledger-driven gate
workflows/iteration_runner.yaml step check_budget_pre calls Cloud Function budget_check at iteration start. The function:
- Reads the Firestore flag
calibration_runtime/budget_state.paused. - Queries
cost_ledgerfor current iteration spend and lifetime spend. - Returns
{paused: true}if iteration spend ≥ 90 % ofiteration_budget_usdor lifetime spend ≥ 90 % ofHARD_CAP_USD(default $25,000).
If paused == true, the workflow short-circuits to pause_iteration without submitting LLM calls. This pathway is live. Every shakedown smoke called check_budget_pre. The cost-ledger query is the binding safety net.
7.2 Shipped: Cloud Billing alert pathway
The Cloud Billing alert pathway has not been driven end-to-end. Configuring Cloud Billing budgets requires roles/billing.user, which the project's operator account currently lacks. The deferral is recorded in the shakedown ledger; the drill is on the Phase-2 hardening backlog (R5 ⏳ #1).
What this means in practice: until the role is granted, the cost-ledger gate (§7.1) is the binding safety net. It catches every iteration-boundary case and is exercised by every smoke. What it does not catch is mid-iteration runaway — a single shard going hot inside an already-started iteration, blowing the per-iteration budget before check_budget_pre runs again. The Cloud Billing pathway exists specifically to close that gap.
Operator override
gcloud firestore documents delete calibration_runtime/budget_state clears the pause flag; the next iteration re-runs the spend check with fresh data. Top up the budget first; then clear the flag.
8. Iteration namespace + cohort-keyed profile registry (calibration pipeline)
Two post-shakedown disciplines that prevent Phase-1 smoke from colliding with Phase-2 production calibration in BigQuery, and prevent smoke configs from silently inheriting into production runs.
Iteration namespace
| Iteration range | Use | Enforced where |
|---|---|---|
| 0–9 | Phase-1 smoke + ad-hoc debugging | scripts/forge_persona_library.py validators; workflow input validators |
| 10+ | Phase-2 production calibration | Same validators |
Every BigQuery row in sessions, cost_ledger, iteration_summary, calibration_metrics, and invariance is keyed by iteration as part of its MERGE clause. If a Phase-1 retry tried to MERGE into a row keyed at iteration 7, and Phase-2 happened to have a production row keyed at iteration 7, the MERGE would silently update production data with smoke data. The convention closes the class of bug; the validators enforce it; the local QA harness test_workflow_safety.py checks that workflow inputs honour it.
Cohort-keyed profile registry
functions/persona_factory/calibration_profile.py exposes get_profile(cohort: str). Phase-1 smoke uses an explicit cohort named phase1_smoke with permissive defaults — reduced thinking budget, abbreviated scenario library, looser convergence gates. Phase-2 cohorts must be explicitly registered by name. An unregistered cohort raises ValueError rather than silently falling back to defaults.
This closes a class of bug we had not anticipated before shakedown: a Phase-1 smoke configuration silently inheriting into a Phase-2 production calibration run via default-argument fall-through. The smoke had a reduced thinking budget (T1 = 2048, T2/T3 = 1024 — tuned down from production); the default fallback would have shipped that thinking budget into a production iteration, which would have produced low-quality LLM outputs and invalidated the calibration fit.
The fix is to make defaults raise. Every workflow input that drives a calibration parameter set carries a cohort field; get_profile(cohort) either returns the registered profile or raises. There is no silent fallback to "the default smoke config." Smoke configs are only reachable by passing cohort="phase1_smoke" explicitly.
The R1 dataset design report (§8.1) carries the methodology framing for this discipline; R3 §3.2 carries the operational narrative; R7 §13.1 logs it as an OK post-shakedown addition that closes one of the original audit's flagged gaps.
What is observable
Every operation above produces an observable trace:
| Surface | What you can read |
|---|---|
| Cloud Logging (per project) | Cloud Run service stdout/stderr; Cloud Run Job stdout/stderr; Custom Job stdout/stderr; Workflow execution log per step. |
Aggregated org log sink → neumatics-audit-logs BigQuery | Every Data Access log on Firestore / AlloyDB / BigQuery / Secret Manager / KMS. Read-only to humans. 7y retention. |
| Cloud Monitoring | Custom metrics under corpus.*, calibration.*, alloydb.* namespaces. alloydb_pause_decision for cost-control. |
Firestore system_status/alloydb_{prod,staging} | Live AlloyDB cluster state + pin reason + last heartbeat. |
Firestore factory_loop_runs/{workflow_exec_id}/shards/{shard_index} | Per-shard heartbeat with timestamps + status. |
Firestore calibration_runtime/budget_state | Current pause flag + reason + percent + setter. |
BigQuery nexus_calibration_corpus.iteration_summary | One row per (iteration, component) with spend, sessions, gate verdicts, calibration deltas. |
BigQuery nexus_calibration_corpus.cost_ledger | Per-Vertex-call cost line with iteration / component / tier / cohort labels. |
gcloud workflows executions list iteration_runner --location=europe-west4 | Every iteration's execution log + status. |
scripts/nexus-alloydb.ps1 status | One-liner cluster state across both clusters. |
The combination is what makes the operational discipline verifiable end-to-end: every dollar in cost_ledger is attributed to an iteration × component × cohort triple; every iteration's outcome lands in iteration_summary; every operator action that clears a pause flag leaves a Firestore audit trail; every AlloyDB resume / pause / pin is in the controller's structured logs and surfaces to Cloud Monitoring.
The Gate-10 measurement-invariance check that lands in invariance follows the Cheung & Rensvold ΔCFI ≤ 0.01 threshold; methodology details are in /about/science/methodology and the gated reports tier.
What is not drilled yet
To be honest about the surface:
- Auto-pause end-to-end — Cloud Billing pathway. ⏳ blocked on
roles/billing.user. - Mid-shard worker crash with
CHAOS_FAULTinjection — chaos image not built. ⏳ Phase-2 hardening sprint. - Vertex 429 backoff under live quota pressure —
_RateLimiterfully built, but shakedown never produced a 429. ⏳ Phase-2 corpus may incidentally exercise. - Idempotency stress with six concurrent re-fires of the same
(iteration, shard_index)— smokes #11 and #16 are supportive but not equivalent. ⏳ Phase-2 hardening sprint. - Knowledge Catalog tag-coverage report — depends on a Knowledge Catalog tagging baseline that is ⏳ not yet applied across the BigQuery + AlloyDB + Firestore surfaces.
- AlloyDB pause/resume cycle on
neumatics-staging— ⏳ deferred until the staging project is provisioned (billing-quota approval pending). - Vertex Reasoning Engine redeploy in
europe-west4— ⏳ still pointing ateurope-west1IDs in the legacy project. - Firebase Stream-Firestore-to-BQ extension — replaces the legacy nightly Firestore-to-BQ export. ⏳ extension installation pending;
docs/operations/firestore_extension.mdis the runbook. - Workflows:
erasure-cascade.yaml,calibration-promote.yaml,cohort-freeze.yaml— ⏳ reserved real-estate per F-3.9 / S-4.8; not yet implemented.
The robustness report carries the full per-drill execution status with infrastructure citations and the Phase-2 hardening checklist. None of the calibration-pipeline gaps blocks Phase-2 milestone-review approval; the platform-level gaps are scheduled work as the foundation refactor's later stages land.