Operations

Eight operational disciplines turn the architecture into a system reviewers can trust. Five of them are platform-wide and arrived with the Foundation v1 refactor: AlloyDB pause-by-default, the per-route BACKEND_ROUTING cutover, the PAM-mediator service that injects per-grant elevation delays, the audit-alerter Cloud Run job, and the EU residency boundary. Three are calibration-pipeline disciplines that pre-dated the foundation and ride along: idempotency under retry, the iteration namespace convention, and the cohort-keyed calibration profile registry.

This page is the unvarnished status of each. Where a discipline has not yet been drilled end-to-end, it says so explicitly with a ⏳.


1. AlloyDB pause-by-default — the cost-control plane

Both AlloyDB clusters (prod live; staging deferred) are paused by default. They resume on demand when an authorised request needs them, stay awake while load continues, and pause again after a configurable idle window (default 15 min). The 3–5 minute cold-start latency is an explicit trade-off: at our pre-launch scale (10 test users, no live external traffic), always-on AlloyDB at ~€450/mo is unjustified; pause-by-default lands at ~€95–100/mo combined and re-tightens to always-on automatically when real traffic continuously fires the heartbeat. Foundation v1 §F-3.13 commits this; the implementation lives at services/nexus-alloydb-controller/, services/nexus-alloydb-auto-pause/, and src/lib/alloydb-warming.ts.

Five components

  1. nexus-alloydb-controller Cloud Run service. Single source of truth for cluster state. Holds a Firestore document per cluster (system_status/alloydb_{prod,staging}) with fields state (PAUSED | RESUMING | READY | PAUSING), last_resume_at, last_heartbeat_at, pause_after_idle_sec (default 900), pinned_until, pin_reason. Endpoints: POST /resume, POST /pause, POST /heartbeat, GET /status, POST /pin {duration, reason}, POST /unpin. The service runs paused-by-default itself (min-instances=0; no resources held when not invoked).
  2. nexus-alloydb-auto-pause Cloud Run Job. Triggered by Cloud Scheduler every 2 minutes. Per cluster: if pinned_until > now, skip; if now - last_heartbeat_at > pause_after_idle_sec AND state == READY, pause. Logs every decision; emits an alloydb_pause_decision Cloud Monitoring metric for the cost dashboard.
  3. Wake-up middleware library in every AlloyDB-touching Cloud Run service. Two modes — chosen per-endpoint:
    • Block-and-wait for background / admin / batch endpoints: call /resume, poll status every 10 s, proceed when READY (up to 5 min).
    • Fail-fast 503 for user-facing synchronous endpoints: if state != READY, return 503 with Retry-After: 180 header and {error: "warming_up", eta_seconds: N} body; trigger /resume async so the next retry succeeds. Every successful DB transaction posts an async heartbeat — one line in the data-access path, no perf impact.
  4. Frontend warming UX in src/lib/alloydb-warming.ts (208 LOC). Central handler for 503 + warming_up: toast "Database is warming up — this takes ~3 minutes. Refreshing automatically when ready."; polls /api/system/alloydb every 15 s; auto-retries the original action when READY. Initial page load fires a "warm" probe in layout.tsx so a user clicking through finds the DB ready by the time they reach an AlloyDB-backed view.
  5. Operator CLI at scripts/nexus-alloydb.ps1 and scripts/nexus-alloydb-pin.ps1. Single-purpose commands: wake [staging|prod], pin [staging|prod] --duration 4h --reason active-dev, sleep [staging|prod], status. Used by the operator at session start and by the smoke-test runner.

Operator pinning patterns

SituationRecommended pin
Active development sessionpin staging --duration 4h --reason active-dev
Smoke test runAuto-pinned 30 min by the smoke-test runner
Funding-review windowpin prod --duration 8h --reason funding-review (business hours of each review day)
Phase-2 calibration corpus runpin prod --duration 7d --reason phase2-corpus
OtherwiseNothing pinned; auto-pause after 15 min idle

The full operator playbook lives at docs/operations/alloydb_lifecycle.md.

Edge-case discipline (built in by-design, not retrofitted)

  • Datastream CDC during pause. Datastream replication slot stays valid; catches up automatically on resume. AlloyDB's max safe pause is ~7 days; a "max-idle-pause" safety override re-resumes briefly every 5 days to keep the slot fresh.
  • Connection-pool reconnect. Every service uses pgbouncer in transaction-pooling mode with server_reset_query; first query after resume retries-once on connection_reset.
  • Scheduled jobs that need DB. Every scheduled job starts with a block-and-wait /resume call. Schedule includes a 5-minute wake-buffer.
  • Catastrophic-fail safety. If /resume fails 3 times in 10 min, the controller emits a Pub/Sub alert AND falls back to "stay-awake" mode for 1 hr to prevent cascading outage. Better to overspend €0.50 than to drop traffic.
  • Concurrent resume calls. Controller serialises resume / pause requests per cluster (Firestore transaction on the state doc); a flood of simultaneous calls from 10 services results in exactly one resume.

Drilled evidence

The auto-pause loop has demonstrated successful idle-pause and successful resume cycles in prod (the only live cluster as of 2026-05-10). The fail-fast-503 → frontend retry → success path has been integration-tested against the prod cluster via scripts/nexus-alloydb.ps1 sleep followed by an AlloyDB-backed page load. Staging-mirror drill is ⏳ deferred until the staging project is provisioned.

The cost-control plane is structural, not optional. AlloyDB at €450/mo always-on with no live traffic would have cut runway in half. €95–100/mo with the same architecture is the correct pre-launch posture; the same code re-tightens to always-on automatically when real traffic justifies it.

2. Per-route backend cutover — BACKEND_ROUTING

The Foundation v1 compute commitment (§F-3.4) is to decompose the 1,461-line functions/main.py monolith into per-domain Cloud Run services. The cutover is per-route and per-deploy, not big-bang. The single source of truth lives in src/lib/api-helpers.ts as the BACKEND_ROUTING map: keys are <METHOD> <path> exactly as the Next.js route is shaped; values are "new" (per-domain Cloud Run service in neumatics-prod, eu-west4) or "legacy" (historical: the old-org Cloud Functions — decommissioned 2026-06, all routes are now "new").

Frontend React → /api/<route> → Next.js handler → callBackend(...)
                                                     │
                          ┌──────────────────────────┴──────────────────┐
                          │                                             │
                   BACKEND_ROUTING[key] = "new"            BACKEND_ROUTING[key] = "legacy"
                          │                                             │
                   API_BASE_URL_NEW                          API_BASE_URL_LEGACY
                   (per-service Cloud Run                    (Cloud Functions URLs;
                    URL on neumatics-prod)                decommissioned)

Every callBackend response is asserted against the X-Backend-Origin header — new- prefix for new-infra, legacy- prefix for legacy. Mismatches log to the server console so routing bugs surface loudly instead of silently corrupting data.

Per-route status (as of 2026-05-10)

RouteBackendService / function
POST /api/consent/subjectnewnexus-consent
GET /api/consent/subjectnewnexus-consent
GET /api/consent/catalognewnexus-consent
POST /api/budget/pollnewnexus-budget
GET /api/audit/eventsnewnexus-audit-alerter
POST /api/translatenewnexus-translate
GET /api/vocabularynewnexus-vocabulary
POST /api/embeddingnewembedding
GET /api/feature-registrynewnexus-feature-registry
POST /api/echo/startnewnexus-echo + Vertex Echo Reasoning Engine
POST /api/echo/turnnewnexus-echo
GET /api/profile/currentnewnexus-analyst /v1/subject_current (AlloyDB)
POST /api/profile/filmnewnexus-analystnexus-substrate-api /v1/query
POST /api/cohort/driftnewnexus-analystnexus-substrate-api /v1/query
POST /api/echo/discardlegacylegacy session cleanup
POST /api/echo/pauselegacy
POST /api/echo/resumelegacy
POST /api/marketplace/quotelegacyout of demo scope
POST /api/wallet/paymentlegacyout of demo scope
POST /api/compensation/scorelegacy
POST /api/analyst/querylegacy

Adding a new endpoint MUST add an entry; silent fall-through to legacy is exactly the bug the constant is supposed to prevent (routeBackend() throws on an unrouted key). Flipping a route from "legacy" to "new" requires no apphosting redeploy if both env vars are already set — the per-service graduation gate is iteration-B + 24-hour dual-stack soak per docs/operations/apphosting_cutover.md.

One known caveat — Vertex AI Agent Engine

The three Vertex Reasoning Engine resource IDs in apphosting.yaml (NEXUS_REASONING_AGENT_RESOURCE_NAME, NEXUS_ECHO_AGENT_RESOURCE_NAME, NEXUS_SYNTHESIS_AGENT_RESOURCE_NAME) point at the eu-west4 deployment inside neumatics-prod — ✅ redeployed at the 2026-06 org migration; the legacy eu-west1 engines are gone with the old org.


3. PAM-mediator — per-grant elevation delays

Foundation v1 §S-4.10 + OD-19 commit to a Privileged Access Manager (PAM) policy with per-grant delays before activation: deployer 5 min, DBA 30 min, KMS-admin 30 min + mandatory justification + per-action notification email. Native PAM does not document a per-grant delay knob; the spec resolves this with a thin Cloud Run mediator that intermediates between the elevation request and the actual PAM grant.

The mediator is deployed at services/nexus-pam-mediator/ (~318 LOC of Terraform plus the service code). The flow:

  1. The operator (or Claude Code, or any tool) requests an elevation through the mediator endpoint.
  2. The mediator records the request to audit, sleeps the configured delay (5 / 30 / 30 min depending on the grant class), then issues the actual PAM grant request via the PAM API.
  3. The operator receives the activation when the delay completes.
  4. For KMS-admin: a per-action notification email fires inside the delay window, so the operator can cancel the request if they didn't intend it.

The PAM entitlements themselves (per-class grant scopes, justification requirements, audit-log routing) are configured outside Terraform via the GCP console; the mediator service exists to bridge the spec's delay commitment to PAM's missing knob. The elevation flow is documented in full at docs/security/access_patterns.md and the Claude-Code-specific snippet at docs/security/claude_code_access.md.


4. Audit alerter — curated security event routing

Foundation v1 §S-4.5 + S-4.9 commit to a curated event list that routes to Pub/Sub topic nexus-security-alert.v1, fanned out to the operator. The operator does not read raw audit logs; the event list is what surfaces. Most weeks none fire.

The nexus-audit-alerter Cloud Run job is deployed at services/nexus-audit-alerter/ and scheduled by Cloud Scheduler. It subscribes to the Pub/Sub topic and processes the following event types out of the aggregated org-level audit sink in neumatics-audit-logs:

  • Service-account-key creation attempts (denied by iam.disableServiceAccountKeyCreation org policy, but log the attempt).
  • KMS key access from a service account not on the expected list.
  • VPC Service Controls perimeter violations.
  • IAM grants to principals outside the neumatics.eu domain.
  • setIamPolicy on neumatics-prod.
  • bigquery.dataAccess on tables tagged art9_status:true|inferred ⏳ (Knowledge Catalog tagging baseline not yet applied).
  • storage.buckets.setIamPolicy on the audit-log archive bucket.
  • cloudkms.cryptoKeyVersions.destroy outside the rotation flow.
  • Failed Binary Authorization deploys (more than 3 per day → alert).
  • Severity-HIGH or CRITICAL Security Command Center findings ⏳ (SCC Premium not yet enabled).
  • Datastream replication slot lag > 30 minutes (data-loss risk).
  • Apphosting deploy failures (operational, not strictly security).

Routes are read-only to the audit dataset; only nexus-audit-readers workforce group can query the dataset directly. The full operator weekly / monthly / quarterly cadence is in docs/security/operator_playbook.md.


5. EU residency boundary

The hard line: every BigQuery dataset, Cloud Storage bucket, Cloud Run service, Cloud Run Job, Cloud Workflow, AlloyDB cluster, KMS keyring, and Vertex AI Custom Training Job runs in europe-west4 or in the EU BigQuery multi-region. Two documented exceptions:

  1. The Vertex publisher model gemini-3-flash-preview is /global/-only for the Phase-2 calibration corpus generation. The corpus is fully synthetic — every persona is sampled from the copula on gs://neumatics-prod-corpus/personas/library_v2_copula.jsonl, every scenario is from the locked library gs://neumatics-prod-corpus/scenarios/library_v1.jsonl. The Vertex Batch input that crosses the residency boundary contains only synthetic personas + synthetic scenarios + a structured JSON response schema — no personal data, no consumer text, no API keys. Smoke #8 surfaced the /global/-only constraint when an europe-west1-pinned submission failed with 400 location in model name doesn't match.
  2. Vertex AI Agent Engine resource IDs — ✅ redeployed to europe-west4 inside neumatics-prod at the 2026-06 org migration; apphosting.yaml carries the new IDs.

The org-policy bundle (infra/org-policies/) enforces gcp.resourceLocations in:europe-west4-locations with a BigQuery-EU exception. New resources outside this constraint cannot be created — the API call is denied org-wide.


6. Idempotency under retry (calibration pipeline)

Every operation in the calibration pipeline is keyed by a stable dedupe identifier; resubmission of the same key produces zero duplicate work or zero duplicate spend.

OperationDedupe keyMechanism
Vertex AI Batch shard(iteration, shard_index)Resubmit produces identical input JSONL + identical output GCS prefix; output overwrite is benign.
BigQuery sessions row(iteration, persona_id, session_id, attempt)Per-shard staging table → MERGE on key → drop staging.
BigQuery cost_ledger rowvertex_call_uidPer-shard staging → MERGE on Vertex's unique request ID → drop staging.
Cloud Run Job invocationCloud Run Job execution nameJob-level dedup via Workflow execution ID.
Vertex Custom TrainingdisplayName = calibration-iter{N}-fam{F}Re-submission picks up where a failed job left off.
Workflow executionworkflow_execution_idLogs / heartbeats / Firestore docs all keyed off it.

The mechanism on the BigQuery side: the shard worker writes a per-shard JSONL file to GCS, runs a BigQuery load job into a per-shard staging table, runs a MERGE from staging into the canonical table on the dedupe key, drops staging. Duplicate rows in staging — say, because the worker retried a partial batch — collapse into one canonical row at the MERGE step. Free up to 1,500 load jobs per day per project; we are three orders of magnitude under that limit.

Drilled evidence (idempotency)

Smoke #11 produced this drill incidentally. A Vertex Batch SUCCEEDED, the downstream MERGE failed on a STRING-vs-FLOAT64 schema drift, and the Cloud Run Job auto-retry kicked a fresh batch. The retry-and-MERGE path absorbed the duplicate work cleanly: zero duplicate cost_ledger rows, zero duplicate sessions rows, the cost being one extra Vertex Batch's worth of compute (~$25, attributed in the cost-engineering page).

Smoke #16 hit the same path with a different upstream cause — flatten_session_for_bq was writing a json.dumps()'d string into a JSON-typed column. After the fix, the worker retried; the MERGE absorbed the duplicates cleanly.

A deliberate idempotency stress drill — re-fire the same (iteration, shard_index) six times in parallel and assert one MERGE'd row per vertex_call_uid across all six attempts — has not been run. The infrastructure to drill it is in place. The drill is on the Phase-2 hardening backlog (R5 ⏳ #4).


7. Calibration auto-pause — two pathways

Two pathways guard against calibration runaway spend (in addition to the platform-wide AlloyDB pause-by-default and the per-project budget alerts described in cost engineering). One is live and exercised by every iteration; one is shipped infrastructure but not yet drilled end-to-end.

7.1 Live: cost-ledger-driven gate

workflows/iteration_runner.yaml step check_budget_pre calls Cloud Function budget_check at iteration start. The function:

  1. Reads the Firestore flag calibration_runtime/budget_state.paused.
  2. Queries cost_ledger for current iteration spend and lifetime spend.
  3. Returns {paused: true} if iteration spend ≥ 90 % of iteration_budget_usd or lifetime spend ≥ 90 % of HARD_CAP_USD (default $25,000).

If paused == true, the workflow short-circuits to pause_iteration without submitting LLM calls. This pathway is live. Every shakedown smoke called check_budget_pre. The cost-ledger query is the binding safety net.

7.2 Shipped: Cloud Billing alert pathway

Auto-pause control flow. Soft alerts (< 90 %) emit operator notifications without pausing; hard alerts (≥ 90 %) set the Firestore flag.

The Cloud Billing alert pathway has not been driven end-to-end. Configuring Cloud Billing budgets requires roles/billing.user, which the project's operator account currently lacks. The deferral is recorded in the shakedown ledger; the drill is on the Phase-2 hardening backlog (R5 ⏳ #1).

What this means in practice: until the role is granted, the cost-ledger gate (§7.1) is the binding safety net. It catches every iteration-boundary case and is exercised by every smoke. What it does not catch is mid-iteration runaway — a single shard going hot inside an already-started iteration, blowing the per-iteration budget before check_budget_pre runs again. The Cloud Billing pathway exists specifically to close that gap.

Operator override

gcloud firestore documents delete calibration_runtime/budget_state clears the pause flag; the next iteration re-runs the spend check with fresh data. Top up the budget first; then clear the flag.


8. Iteration namespace + cohort-keyed profile registry (calibration pipeline)

Two post-shakedown disciplines that prevent Phase-1 smoke from colliding with Phase-2 production calibration in BigQuery, and prevent smoke configs from silently inheriting into production runs.

Iteration namespace

Iteration rangeUseEnforced where
0–9Phase-1 smoke + ad-hoc debuggingscripts/forge_persona_library.py validators; workflow input validators
10+Phase-2 production calibrationSame validators

Every BigQuery row in sessions, cost_ledger, iteration_summary, calibration_metrics, and invariance is keyed by iteration as part of its MERGE clause. If a Phase-1 retry tried to MERGE into a row keyed at iteration 7, and Phase-2 happened to have a production row keyed at iteration 7, the MERGE would silently update production data with smoke data. The convention closes the class of bug; the validators enforce it; the local QA harness test_workflow_safety.py checks that workflow inputs honour it.

Cohort-keyed profile registry

functions/persona_factory/calibration_profile.py exposes get_profile(cohort: str). Phase-1 smoke uses an explicit cohort named phase1_smoke with permissive defaults — reduced thinking budget, abbreviated scenario library, looser convergence gates. Phase-2 cohorts must be explicitly registered by name. An unregistered cohort raises ValueError rather than silently falling back to defaults.

This closes a class of bug we had not anticipated before shakedown: a Phase-1 smoke configuration silently inheriting into a Phase-2 production calibration run via default-argument fall-through. The smoke had a reduced thinking budget (T1 = 2048, T2/T3 = 1024 — tuned down from production); the default fallback would have shipped that thinking budget into a production iteration, which would have produced low-quality LLM outputs and invalidated the calibration fit.

The fix is to make defaults raise. Every workflow input that drives a calibration parameter set carries a cohort field; get_profile(cohort) either returns the registered profile or raises. There is no silent fallback to "the default smoke config." Smoke configs are only reachable by passing cohort="phase1_smoke" explicitly.

The R1 dataset design report (§8.1) carries the methodology framing for this discipline; R3 §3.2 carries the operational narrative; R7 §13.1 logs it as an OK post-shakedown addition that closes one of the original audit's flagged gaps.


What is observable

Every operation above produces an observable trace:

SurfaceWhat you can read
Cloud Logging (per project)Cloud Run service stdout/stderr; Cloud Run Job stdout/stderr; Custom Job stdout/stderr; Workflow execution log per step.
Aggregated org log sink → neumatics-audit-logs BigQueryEvery Data Access log on Firestore / AlloyDB / BigQuery / Secret Manager / KMS. Read-only to humans. 7y retention.
Cloud MonitoringCustom metrics under corpus.*, calibration.*, alloydb.* namespaces. alloydb_pause_decision for cost-control.
Firestore system_status/alloydb_{prod,staging}Live AlloyDB cluster state + pin reason + last heartbeat.
Firestore factory_loop_runs/{workflow_exec_id}/shards/{shard_index}Per-shard heartbeat with timestamps + status.
Firestore calibration_runtime/budget_stateCurrent pause flag + reason + percent + setter.
BigQuery nexus_calibration_corpus.iteration_summaryOne row per (iteration, component) with spend, sessions, gate verdicts, calibration deltas.
BigQuery nexus_calibration_corpus.cost_ledgerPer-Vertex-call cost line with iteration / component / tier / cohort labels.
gcloud workflows executions list iteration_runner --location=europe-west4Every iteration's execution log + status.
scripts/nexus-alloydb.ps1 statusOne-liner cluster state across both clusters.

The combination is what makes the operational discipline verifiable end-to-end: every dollar in cost_ledger is attributed to an iteration × component × cohort triple; every iteration's outcome lands in iteration_summary; every operator action that clears a pause flag leaves a Firestore audit trail; every AlloyDB resume / pause / pin is in the controller's structured logs and surfaces to Cloud Monitoring.

The Gate-10 measurement-invariance check that lands in invariance follows the Cheung & Rensvold ΔCFI ≤ 0.01 threshold; methodology details are in /about/science/methodology and the gated reports tier.


What is not drilled yet

To be honest about the surface:

  • Auto-pause end-to-end — Cloud Billing pathway. ⏳ blocked on roles/billing.user.
  • Mid-shard worker crash with CHAOS_FAULT injection — chaos image not built. ⏳ Phase-2 hardening sprint.
  • Vertex 429 backoff under live quota pressure_RateLimiter fully built, but shakedown never produced a 429. ⏳ Phase-2 corpus may incidentally exercise.
  • Idempotency stress with six concurrent re-fires of the same (iteration, shard_index) — smokes #11 and #16 are supportive but not equivalent. ⏳ Phase-2 hardening sprint.
  • Knowledge Catalog tag-coverage report — depends on a Knowledge Catalog tagging baseline that is ⏳ not yet applied across the BigQuery + AlloyDB + Firestore surfaces.
  • AlloyDB pause/resume cycle on neumatics-staging — ⏳ deferred until the staging project is provisioned (billing-quota approval pending).
  • Vertex Reasoning Engine redeploy in europe-west4 — ⏳ still pointing at europe-west1 IDs in the legacy project.
  • Firebase Stream-Firestore-to-BQ extension — replaces the legacy nightly Firestore-to-BQ export. ⏳ extension installation pending; docs/operations/firestore_extension.md is the runbook.
  • Workflows: erasure-cascade.yaml, calibration-promote.yaml, cohort-freeze.yaml — ⏳ reserved real-estate per F-3.9 / S-4.8; not yet implemented.

The robustness report carries the full per-drill execution status with infrastructure citations and the Phase-2 hardening checklist. None of the calibration-pipeline gaps blocks Phase-2 milestone-review approval; the platform-level gaps are scheduled work as the foundation refactor's later stages land.