Architecture

There are two layers in the picture. The platform is what Foundation v1 built: a Workspace-rooted Google Cloud organisation with three live workload projects, a single CMEK keyring, an AlloyDB cluster that pauses by default, BigQuery datasets that the substrate writes to, a Shared VPC with a VPC-SC perimeter on prod, an aggregated audit-log sink in a sealed project, and per-domain Cloud Run services that replace the old functions/main.py monolith. The calibration pipeline is the original tenant: a Cloud Workflows DAG that fans out to Vertex AI Batch Prediction for corpus generation and to Vertex AI Custom Training for the hierarchical Bayesian GRM fit, with BigQuery as the canonical store. Probe 5 — the final shakedown probe — converged this pipeline at R-hat = 1.002, n_eff_min = 1,318, zero divergences in 9.0 seconds on real PSL-derived data.

This page walks both layers. Operational guarantees (idempotency, EU residency, AlloyDB pause-by-default, the BACKEND_ROUTING cutover) sit at operations. Cost engineering sits at cost engineering. The shakedown narrative + the foundation receipts sit at receipts.

The platform layer (Foundation v1)

Workspace org: neumatics.eu
├── Folder: platform/
│   ├── Project: neumatics-prod      ← live workload, customer-facing
│   └── Project: neumatics-audit-logs        ← sealed sink for org audit logs
└── Folder: shared-services/
    └── Project: neumatics-network-host       ← Shared VPC host project

Org policies enforced (infra/org-policies/):
  gcp.resourceLocations in:europe-west4-locations (BQ-EU exception)
  iam.disableServiceAccountKeyCreation
  cloudkms.minimumKeyRotationPeriod = 90d
  storage.uniformBucketLevelAccess required
  essentialcontacts.allowedContactDomains in:neumatics.eu
  cloudbuild.allowedIntegrations in:GitHub

The folder structure, projects, org-policy bundle, and essential-contacts wiring all live as Terraform under infra/projects/, infra/org-policies/, and infra/bootstrap/. The legacy consumer-Gmail-owned project has been decommissioned (2026-06): the old org is gone and every route in BACKEND_ROUTING now targets the new estate.

Workload project — `neumatics-prod`

neumatics-prod (europe-west4)
├── KMS keyring: nexus-foundation
│   ├── alloydb-key, bigquery-key, gcs-key, pubsub-key
│   ├── secret-manager-key, firestore-key, artifact-registry-key
│   └── HSM keys: evidence-hsm-key, tier3-demographics-hsm-key, audit-hsm-key
│
├── Firestore: soulmap-v4 (database name unchanged)
│   - users / sessions (transactional)
│   - calibration governance plane (admin-only via firestore.rules)
│   - system_status/alloydb_prod (cluster state for the cost-control plane)
│
├── AlloyDB: nexus-prod cluster (regional, primary, paused-by-default)
│   - CMEK via alloydb-key
│   - Continuous backups 14d
│   - Columnar engine flag enabled
│   - pgvector extension enabled
│   - Private IP only (PSC, no public endpoint)
│
├── BigQuery (EU multi-region, all CMEK-encrypted):
│   - nexus_calibration_corpus     ← Phase-2 corpus + cost ledger
│   - nexus_warehouse              ← modelled views (analytics surface)
│   - nexus_substrate              ← reserved for the F-3.9 profile film
│   - nexus_synth_substrate        ← synthetic-realm parity (F-3.10)
│   - nexus_alloydb_cdc            ← Datastream landing
│   - nexus_firestore_changelog    ⏳ Firebase Stream-Firestore-to-BQ pending
│
├── Cloud Run services (eu-west4) — see service catalogue below
│
└── Cloud Workflows: iteration_runner, bq_inspect, iam_probe
   ⏳ erasure-cascade.yaml, calibration-promote.yaml, cohort-freeze.yaml
      reserved real-estate per F-3.9 / S-4.8; not yet implemented

The workload project replaces the legacy Gmail-owned project for everything new. Project IDs that were already taken globally (e.g. nexus-prod) are suffixed for uniqueness — the actual GCP project ID is neumatics-prod; the Terraform local name is nexus-prod.

Audit project — `neumatics-audit-logs`

neumatics-audit-logs (europe-west4)
├── Aggregated org log sink → BigQuery: org_audit_logs (CMEK; 7y retention)
├── GCS object-lock archive (10y retention, immutable bucket lock)
└── Pub/Sub topic: nexus-security-alerts → nexus-audit-alerter Cloud Run job

Sealed: no application service has read access to the audit BigQuery dataset. Only nexus-audit-readers workforce group can query it; only the org-level sink can write.

Network host — `neumatics-network-host`

Shared VPC host project; service projects (neumatics-prod) consume the network. VPC Service Controls perimeter wraps neumatics-prod plus neumatics-network-host; the perimeter restricts BigQuery, Storage, Pub/Sub, Secret Manager, AlloyDB, KMS, Vertex AI, Logging, and Run. Datastream and AlloyDB connect via Private Service Connect, not public IP.

The Cloud Run service catalogue

The 1,461-line functions/main.py monolith is being decomposed into per-domain Cloud Run services under services/:

Domain	Service	Replaces
Substrate	`nexus-substrate-api`	`nexus_api_v1` HTTPS CF; the substrate query surface
Echo session	`nexus-echo`	`echo_start`, `echo_turn` HTTPS CFs + the EchoPipeline
Synthesis	`nexus-synthesis`	`synthesize` HTTPS CF + `on_session_completed` trigger
Translate	`nexus-translate`	`translate` HTTPS CF
Quality	`nexus-quality`	`quality_score_v1` HTTPS CF
Budget	`nexus-budget`	`functions/budget_check` independent CF
Reasoning	`nexus-reasoning`	PyReason rule engine wrapper
Embedding	`embedding`	Affective + intent detection (already its own Cloud Run service; rebuilt in eu-west4)
Vocabulary	`nexus-vocabulary`	LinkML-backed atom registry surface
Feature registry	`nexus-feature-registry`	Feature flag / capability surface
Consent	`nexus-consent`	Per-user consent catalogue + write path
Catalog	`nexus-catalog` ⏳	Knowledge Catalog wrapper (infra deployed; surface pending)
Journey monitor	`nexus-journey-monitor`	End-to-end probe of the user journey
Corpus shard worker	`nexus-corpus-shard-worker`	Existing Cloud Run Job for persona-corpus shard processing
Synth-substrate ingest	`nexus-synth-substrate-ingest`	NEW; post-hoc ETL from `calibration_corpus.sessions` to `nexus_synth.*` (per F-3.10 two-phase pattern)
Foundation tests	`nexus-foundation-tests`	Smoke surface that exercises the foundation control plane
AlloyDB controller	`nexus-alloydb-controller`	NEW; per F-3.13 cost-control plane (resume / pause / pin / status)
AlloyDB auto-pause	`nexus-alloydb-auto-pause`	NEW; Cloud Run Job, Cloud Scheduler 2-min cadence
AlloyDB bootstrap	`nexus-alloydb-bootstrap`	One-shot schema/role bootstrap for the cluster
Audit alerter	`nexus-audit-alerter`	Pub/Sub-triggered notifier on the curated security event list
PAM mediator	`nexus-pam-mediator`	NEW per OD-19; injects deployer 5min / DBA 30min / KMS-admin 30min delays before issuing PAM elevation

Each service has its own Dockerfile, its own service account, its own deploy pipeline, and its own minimum-instance count tuned to its load profile. The cutover from legacy Cloud Functions to per-domain services is per-route via BACKEND_ROUTING in src/lib/api-helpers.ts — the operations page walks the cutover discipline.

The calibration pipeline (the original tenant)

The calibration pipeline is what the science tier describes — a single Cloud Workflows DAG (workflows/iteration_runner.yaml) that fans out persona shards to Vertex AI Batch Prediction, joins the corpus through BigQuery, fits ten parallel Bayesian GRM trainers on Vertex Custom Training, and gates the result through three measurement-quality verdicts before deciding whether to schedule the next iteration. It pre-dated the foundation refactor and survived it unchanged; it now reads from and writes to the foundation's BigQuery datasets (nexus_calibration_corpus) inside the foundation's prod project (neumatics-prod).

The architecture is shipped, deployed, and shaken down. Probe 5 converged R-hat = 1.002, n_eff_min = 1,318, zero divergences in 9.0 seconds on real PSL-derived data. Eighteen workflow smokes and five trainer probes uncovered twenty-seven bugs across eight categories during shakedown; every fix is either in the production code path or in the eight-check local QA harness that gates Phase-2 readiness.

Phase-2 calibration pipeline; runs inside neumatics-prod alongside the foundation services in eu-west4 except where the publisher model forces /global/ routing.

The diagram has three loops:

The corpus loop. iteration_runner.yaml step shard_personas fans out N Cloud Run Jobs (default 12, capped by concurrency_limit: 8); each job submits one Vertex Batch input JSONL keyed by (iteration, shard_index) and writes results to BigQuery sessions + cost_ledger via load-job + MERGE.
The calibration loop. Step calibrate fans out ten Vertex AI Custom Training jobs (one per construct family, capped by concurrency_limit: 10); each fits a hierarchical Bayesian GRM via NumPyro NUTS on n2-highmem-16, plus a conformal recalibration, plus a Bayesian belief network CPT fit, plus a Gate-10 measurement-invariance verdict. Calibration metrics MERGE back into BigQuery.
The gate loop. Step score_drift calls a BigQuery procedure that aggregates per-iteration spend, sessions generated, calibration deltas, and Gate 8 / 9 / 10 Boolean verdicts into one iteration_summary row. The gate step reads that row and returns green / red. The orchestrator scheduling the workflow decides whether to schedule the next iteration; convergence requires green for two consecutive iterations.

Out-of-band, Cloud Billing budget alerts at 25 / 50 / 75 / 90 / 100 % of the iteration budget publish to Pub/Sub topic calibration-budget-alerts. A Cloud Function budget_alert_handler receives the message and writes a paused: true flag to Firestore document calibration_runtime/budget_state; the next iteration's check_budget_pre step reads this flag and short-circuits to pause_iteration without submitting LLM calls.

Six service-choice rationales (calibration pipeline)

Six choices on the calibration diagram are non-default; each is documented below with the alternative we considered and why we rejected it.

1. Cloud Workflows over Composer, Vertex AI Pipelines, or Cloud Tasks

The orchestration layer fans out shard workers, polls long-running Vertex Batch operations, parallelises the calibration trainers, and gates the next iteration. The candidates were Cloud Composer (managed Airflow), Vertex AI Pipelines (KFP-based ML pipelines), Cloud Tasks (a queue with retry semantics), and Cloud Workflows (a serverless YAML DAG).

We picked Cloud Workflows for three reasons:

No control-plane cost. Composer runs three GKE worker nodes plus a Cloud SQL backend twenty-four-seven, billing roughly $400/month before any DAG runs. Cloud Workflows bills per internal step at $0.01 per thousand steps with five thousand free per month — at our duty cycle (one workflow execution per iteration, plus a small constant of pre-flight smokes), the orchestration layer costs less than a dollar per month. For a project whose largest steady-state line item is GCS storage at the same scale, paying $400/month for orchestration would be a category error.
Native Vertex AI + BigQuery + Cloud Run connectors. The googleapis.aiplatform.v1.projects.locations.customJobs.create connector handles the Vertex Custom Training submission natively; the googleapis.run.v2.projects.locations.jobs.run connector handles Cloud Run Jobs; the googleapis.bigquery.v2.jobs.query connector handles the BigQuery procedure call. We do not need to manage HTTP retry logic, OAuth tokens, or polling loops in worker code; the workflow does all of that.
Vertex AI Pipelines is a poor fit. KFP is built for the case where each step produces a typed artifact consumed by the next step, with the artifacts living in a metadata store that the UI surfaces. Our pipeline has a small number of long-running steps with side effects (Vertex Batch, Vertex Custom Training, BigQuery MERGE) rather than a typed-artifact graph. KFP's metadata layer would be dead weight.

The trade-off Cloud Workflows imposes is YAML expression syntax. Smoke #12 surfaced an expression colon-in-string parse error that needed single-quote wrapping; smoke #10 surfaced a connector default 1,800 s timeout that was too short for a 25 min Vertex Batch (we now set connector_params.timeout: 7200); smoke #13 surfaced the BigQuery connector's body-wrapper drift (corpus_count.body.rows vs corpus_count.rows). All three are now flagged by test_workflow_safety.py in the local QA harness.

2. Vertex AI Batch Prediction at `/global/` with regional BigQuery

The corpus generation submits one Vertex Batch row per full Echo session — twelve conversational turns inside a single structured-output response. This was the largest cost-engineering decision in the project: running each session as one Vertex Batch row instead of twelve sequential per-turn calls is what makes the input prefix amortise once across all twelve turns instead of being re-sent per turn. The seventy-percent implicit-cache hit on the stable persona prefix (≥ 4,096 tokens FIRST in cache) further reduces input-token cost.

Vertex Batch Prediction over the online endpoint buys two things:

Fifty percent discount on every billable token (input, cached input, output) versus the online list price. At our duty cycle this saves approximately $5,440 across a Phase-2 base case (ten iterations, two thousand personas × fifty sessions each). The cost-engineering page carries the full sensitivity tornado.
No preset RPM / TPM quotas. The online endpoint enforces project-level requests-per-minute and tokens-per-minute caps; Batch enforces neither, because the underlying job runs against a long-lived dedicated SKU rather than against the shared online inference pool. We have a _RateLimiter in the corpus shard worker (functions/persona_factory/runner_llm.py:140-241) that adapts to 429 ResourceExhausted errors with halve-on-first-429 then 25 % shrinkage every fifth 429 thereafter, but the entire eighteen-smoke shakedown never produced a single 429 — Batch did its job.

The publisher model gemini-3-flash-preview is /global/-only; smoke #8 surfaced this when the original europe-west1-pinned submission failed with 400 location in model name doesn't match. The corpus is fully synthetic (no PII; every persona is sampled from the copula, every scenario is from the locked library), so /global/ routing for the inference call is acceptable. The EU residency boundary is preserved by keeping every BigQuery dataset, Cloud Storage bucket, Cloud Run Job, Cloud Workflow, and Vertex Custom Training Job inside the prod project's region. The live Echo product (the consumer-facing surface, where a real user types into a real conversation) is europe-west1-only with no /global/ traffic; the calibration pipeline is the only /global/ consumer in the project.

The block-t copula on the Dark-Triad triplet at ν = 5 — the source of our joint-extremes recovery — runs entirely in the persona-forging step before any Vertex call; the LLM never sees the latent layer.

3. Cloud Run Jobs for shard workers (idempotency + ephemeral cost)

The shard worker is a containerised process: read N personas from a GCS-hosted persona library, build a Vertex Batch input JSONL keyed by (iteration, persona_id, session_id, attempt), submit the batch job, poll for completion, parse outputs, write to BigQuery. The candidates were Cloud Run Jobs (containerised batch tasks), Cloud Run services (HTTP-triggered containers), Cloud Functions (function-as-a-service), and a long-lived Compute Engine VM.

We picked Cloud Run Jobs for three reasons:

No always-on cost. Cloud Run Jobs bill per vCPU-second only while running; one shard worker invocation costs cents at our duty cycle. A long-lived VM would bill twenty-four-seven for compute we use for thirty minutes per iteration.
Idempotency at the job level. The Cloud Run Job execution name doubles as a per-invocation correlation key; combined with the deterministic shard slicing (SHARD_INDEX/N_SHARDS) and the deterministic dedupe key (iter={N}/persona={P}/sess={S}/attempt={A}), every retry of a failed job restarts from the top of worker.py:main() and produces the same Vertex Batch input file, the same output GCS prefix, and the same BigQuery MERGE keys. Failures absorb cleanly: smoke #11 had a Vertex Batch SUCCEED, the downstream MERGE fail on a STRING-vs-FLOAT64 drift, and the Cloud Run Job auto-retry kick a fresh batch. The dedupe path produced zero duplicate rows.
Container image discipline matches the trainer. The Vertex Custom Training trainer is also a container image; using Cloud Run Jobs for the shard worker keeps both production code paths on Docker, which means one Dockerfile pattern and one Cloud Build YAML per worker, and one local QA harness check that walks the container's import graph for missing pip packages.

The trade-off is Cloud Run Jobs' 2 GiB memory ceiling and one-vCPU default. Our shard worker fits comfortably (it is Vertex-bound, not compute-bound), but a CPU-heavy shard worker would have to either bump the CPU/memory limits or move to Vertex Custom Training. We could move there if needed; we have not.

4. Vertex AI Custom Training with NumPyro/JAX on `n2-highmem-16` (CPU, not GPU)

The calibration trainer fits a hierarchical Bayesian graded-response model via NumPyro NUTS, plus a conformal recalibration, plus a Bayesian belief network CPT fit, plus a Gate-10 measurement-invariance verdict. The candidates for the compute substrate were:

GPU (T4 or A100) on Vertex Custom Training. The natural reach for "Bayesian deep" workflows.
CPU n2-highmem-16 on Vertex Custom Training. The boring choice.
Cloud Run Job with high CPU + memory. Same container as the shard worker.

We picked CPU n2-highmem-16 for one reason: at our problem size, NumPyro/JAX is CPU-saturating, not GPU-bound. The hierarchical GRM has on the order of thousands of latent variables (per-person θ × per-item discrimination + threshold parameters), four chains × 1,000 warmup × 1,000 samples is the production target, and the JAX trace fits in CPU L2/L3 cache without paging. A T4 GPU would idle most of the trace cycles waiting for the host to feed it parameters; we measured the trade-off and CPU came out 60 % cheaper at this scale, with comparable wallclock.

This is the most counterintuitive choice on the diagram, and the one most likely to draw the question "but Bayesian inference at scale runs on GPU." The answer is "at some scale," and we are below it. If Phase-3 expands the corpus to N = 10,000 with 113-construct fits, we will revisit; until then, CPU is the right tool.

The graded-response model itself is Samejima's classic, fitted with the standard hierarchical-Bayesian extensions (per-family pooling on log-discrimination, sorted thresholds via cumulative softplus, non-centered parameterisation on per-item log-discrimination to sidestep the funnel pathology that this model class can exhibit at small N).

Probe 5 — the final shakedown probe — converged this trainer in 9.0 seconds wallclock on ten personas × six openness constructs at R-hat = 1.002, n_eff_min = 1,318, zero divergences, on real PSL-derived data. The full Phase-2 fit (two thousand personas × 113 constructs across ten families parallel) is targeted at approximately thirty minutes per family per iteration, comfortably inside the iteration budget.

5. Cloud Run Jobs for the calibration trainer? No — Vertex Custom Training

A reasonable question is whether the calibration trainer should run on Cloud Run Jobs alongside the shard worker. We tried this. The answer is no:

Cloud Run Jobs caps memory at 2 GiB and vCPU at 8. The hierarchical GRM trainer wants more. n2-highmem-16 is 16 vCPU + 128 GiB RAM, which fits the largest production fits we have benchmarked; a 2 GiB ceiling forces aggressive in-memory chunking that is not free in development cost.
Vertex Custom Training has native checkpoint/resume. A trainer that crashes mid-NUTS-warmup can resume from the last checkpoint; Cloud Run Jobs does not have a first-class checkpoint primitive.
Vertex billing labels flow into Cloud Billing detailed export. Each Custom Job carries iteration, component, family_index labels via the labels block; these flow to Cloud Billing detailed export and let us cross-validate per-iteration spend against the official invoice. Cloud Run Jobs labelling is supported but the integration is less tight.

The result is two production code paths on two different compute services, both containerised. This is one of the clean architectural decisions: each service runs the workload it is good at; the orchestration layer (Cloud Workflows) does not care which is which.

6. BigQuery load-job + MERGE for ingest (not Storage Write API, not `tabledata.insertAll`)

Every persona session and every Vertex API call lands in BigQuery: sessions carries the per-session conversational record, cost_ledger carries the per-Vertex-call cost line. The candidates for ingest were:

tabledata.insertAll — the legacy single-row streaming API. Free up to a quota; trivially simple to call from anywhere; used by the pre-shakedown draft of the workflow (and explicitly rejected by the methodology plan as an anti-pattern).
Storage Write API — the modern committed-stream replacement. Exactly-once semantics; proto-schema-based; throughput-priced.
Load-job + MERGE — write a per-shard staging table from a JSONL or CSV file, then MERGE into the canonical table on the dedupe key, then drop the staging table.

We picked load-job + MERGE for three reasons:

Free up to 1,500 load jobs per day per project. Our duty cycle is roughly twelve Cloud Run Jobs per iteration × ten iterations per Phase-2 campaign = 120 load jobs total. We are three orders of magnitude under the free-tier limit.
No proto-schema complexity tax. Storage Write API requires a proto descriptor for every table you write to; the descriptors must be synchronised with the table DDL or the writes silently misbehave. Load-job + MERGE accepts JSON natively (with the canonical schema explicitly specified, after smoke #11 surfaced the autodetect drift on a staging table).
Idempotent on the MERGE key. The MERGE clause keys on (iteration, persona_id, session_id, attempt) for sessions and vertex_call_uid for cost_ledger; duplicate ingest produces zero duplicate canonical rows. This is the same dedupe semantic Storage Write API offers, at none of the proto-schema cost.

The pre-shakedown plan called Storage Write API the mandated path. Implementation found load-job + MERGE strictly cheaper, simpler, and equivalent on the dedupe semantics; the deviation is documented at docs/shakedown_ledger.md:91 ("Load-job + MERGE for BQ ingest" lever, "NOT Storage Write API, NOT tabledata.insertAll") and in R7 §13.4 of the implementation audit (which tracks the three remaining legacy tabledata.insertAll references in Phase-2 hardening backlog).

The BigQuery datasets — nexus_calibration_corpus for the corpus + cost-ledger + iteration-summary tables, nexus_warehouse for the modelled views, nexus_alloydb_cdc for the Datastream landing — are partitioned + clustered + MERGE-keyed. Aggregate description: partitioning on iteration date for time-series queries, clustering on (iteration, persona_id) or (iteration, family_index) depending on table, MERGE keys as enumerated above. The full DDL excerpts are in R3 §5 of the gated reports tier.

Out-of-band: the budget guardrail (calibration pipeline)

The auto-pause guardrail is wired but not yet drilled end-to-end (R5 ⏳ #1). Cloud Billing alerts at 25 / 50 / 75 / 90 / 100 % of the per-iteration budget publish to Pub/Sub topic calibration-budget-alerts; a Cloud Function budget_alert_handler receives the message and writes paused: true to Firestore document calibration_runtime/budget_state; the workflow's check_budget_pre step reads the flag at iteration start and short-circuits to pause_iteration.

A safer interim posture is in place: the workflow's check_budget_pre step also queries cost_ledger directly on every iteration boundary. So even without the Cloud Billing pathway live, the workflow refuses to start an iteration whose cumulative cost_ledger spend already exceeds 90 % of iteration_budget_usd. This was exercised by every shakedown smoke (every smoke called check_budget_pre and read the cost ledger); the path is live and warm. What it does not yet cover is mid-iteration runaway — a single shard going hot inside an already-started iteration. That gap is what the Cloud Billing pathway will close once roles/billing.user is granted and the drill is run.

Operations carries the full state-machine for the auto-pause path, plus the broader platform-level cost guardrails (project budgets, AlloyDB pause-by-default, the operator pinning patterns).

What is not on this diagram

Two production code paths sit outside the calibration loop and the foundation control plane, and are deliberately excluded from this view:

The live Echo product. The consumer-facing surface, where a real user types into a real conversation, is the Next.js app on Firebase App Hosting. It is the consumer of the calibration model that this pipeline produces, not part of the calibration pipeline itself. The two systems share a code substrate (the functions/synthesis/ package fits the calibration parameters in this pipeline and reads them in the live product) but no runtime traffic.
The persona forge and the scenario library. Both are built once, locked into versioned files in GCS (personas/library_v2_copula.jsonl, scenarios/library_v1.jsonl, correlations/sigma_v1.yaml), and consumed read-only by the calibration loop. The forge is a numpy/scipy script that runs in seconds on a developer laptop; the scenario library was built once via the existing scenario generator infrastructure at zero LLM cost. Both are documented at /about/science.

Verifiable

Every architectural claim on this page corresponds to a file in the repository (infra/<module>/*.tf for the platform layer, services/<service>/ for the Cloud Run services, workflows/iteration_runner.yaml and functions/persona_factory/ for the calibration pipeline) or a row in BigQuery. The reviewer-grade reports tier — passkey-gated, available on request — carries the full per-component receipts including the IAM bindings, the BigQuery DDL excerpts, the deploy script, and the per-smoke ledger. The science tier at /about/science carries the methodology that this infrastructure exists to serve.

The Phase-1 shakedown converged. Foundation v1 substantially landed. Phase-2 awaits a roles/billing.user grant; nothing else blocks iteration 10.