Cost engineering

Two cost stories ride on this infrastructure. The platform is what Foundation v1 stands up: a Workspace org with three live workload projects, AlloyDB pause-by-default, per-domain Cloud Run services on min-instances=0, BigQuery physical-billing tables, KMS keys, and a sealed audit project. At pre-launch scale (10 test users, no live external traffic) this lands at roughly €100–150/mo combined burn — dominated by AlloyDB awake-hours and a small constant of always-on services like the audit alerter. The calibration pipeline is the original tenant; it spent ~$140 of the $500 Start Tier budget across the Phase-1 shakedown and forecasts $4,060–$5,800 across the ten Phase-2 iterations, comfortably under the $25,000 AI Tier cap.

This page covers both. The platform side is short — the cost-control plane (F-3.13) does the work and the discipline lives in operations. The calibration-pipeline side is the eight engineering levers, the verified Phase-1 actuals, the re-derived Phase-2 forecast, and the sensitivity tornado at the base case, plus the local QA harness as the unconventional ninth lever.


Part 1 — The platform run-rate after Foundation v1

The pre-foundation worry was that AlloyDB at full sizing would burn ~€450/mo continuously — over half the Tier-A funding window — to serve no live traffic. The Foundation v1 cost-control posture (F-3.13 / OD-20) commits to pause-by-default with operator pinning. Concrete numbers:

Wake driverHours awake/month
Active dev sessions (4 hr × 5 days × 4 weeks)80
Scheduled jobs (~30 min/day × 3 jobs)45
Pre-launch user traffic (sporadic)~10
Smoke tests + reviewer-click buffer20
Total awake hours/month per cluster~155
× ~€0.31/hr per 2-vCPU cluster~€48/mo per cluster

One cluster live (prod); staging deferred. Once both clusters exist, ~€95–100/mo combined for AlloyDB. Storage adds <€2/mo. Saves €350/mo (€4,200/year) vs always-on at full sizing.

The other platform-level lines, also at our pre-launch scale:

ComponentRun-rateNotes
AlloyDB (prod, paused-by-default)~€48/moOne cluster live; staging adds ~€48/mo when provisioned
AlloyDB storage + backups<€2/moContinuous backups 14d
Per-domain Cloud Run services (24×)<€5/mo combinedAll min-instances=0; cold-start is fine for pre-launch
BigQuery storage (7 datasets, physical billing)<€2/moCalibration corpus is the largest dataset; long-term-storage tier kicks in after 90d
BigQuery compute$0–$5/modryRun + maximumBytesBilled ceilings prevent runaway
KMS key operations<€2/moOne CMEK keyring; auto-rotation every 90d
Cloud Workflows<$1/moPer-step billing with 5,000 free/month
Cloud Logging (org-aggregated to audit-logs)~€10–20/mo7-year BigQuery retention; volume bounded by curated event filter
Datastream~€5–10/moOne stream live; idle most of the time at pre-launch
GCS (corpus + audit archive)~€2/moObject-locked archive grows slowly
Vertex Agent Engine (3× Reasoning Engines)~€20–30/moThree managed agents; pay-per-call
Apphosting (Next.js frontend)~€10–15/moStandard tier
Translate API<€5/moCached; rare-use
Cloud Build<€2/moFree tier covers our build volume
Total platform run-rate~€100–150/moAt 10 test users, no live external traffic

These are sustained-state numbers. Phase-2 calibration runs (when they fire) are an additive line item handled in Part 2. Once real user traffic arrives at GA, AlloyDB pause-by-default automatically becomes effectively always-on (the heartbeat fires continuously); a single Firestore document update can pin prod always-awake explicitly post-launch.

Per-project budget alerts

Every workload project has a Cloud Billing budget configured via Terraform (infra/projects/main.tf) with monthly thresholds at 50 / 75 / 90 / 100 %. The 90 %+ alerts publish to a Pub/Sub topic that fans out to the operator. The same roles/billing.user deferral that blocks the calibration-pipeline auto-pause drill (Part 2 §below) blocks fully end-to-end testing of these alerts; until that grant lands, the budgets are configured and monitored manually via the Cloud Console.

The €350/mo savings from pause-by-default isn't a coupon — it's the binary difference between "the platform consumes half the funding window with zero traffic" and "the platform burns one-quarter of one funding tranche each month while we build." That's the point.

Part 2 — The calibration pipeline (Phase-1 actuals, Phase-2 forecast)

The Neumatics Soulmap calibration pipeline runs on Google for Startups Cloud credits. Phase-1 — the wire-test smoke plus the shakedown that hardened it — spent approximately $140 of the $500 Start Tier budget, ~28 %. Phase-2 — the full corpus generation plus ten calibration iterations — forecasts at $4,060–$5,800 across the campaign, comfortably under the $8,000 expected ceiling and 16–23 % of the $25,000 AI Tier cap.

The eight engineering levers that produce that envelope, the verified Phase-1 actuals, the re-derived Phase-2 forecast, the sensitivity tornado at the base case, and the local QA harness as the unconventional ninth lever — all sourced from the cost-engine memo and from docs/shakedown_ledger.md (canonical Phase-1 end-state); every number agrees with cost_ledger rows visible to anyone with project access.

The eight cost levers (calibration pipeline)

Eight engineering choices control the campaign envelope. All eight survived shakedown unchanged; three were sharpened by shakedown evidence.

#LeverDecisionEffect on Phase-2 base case
1Vertex Batch vs onlineBatch−$5,440 vs online for ten-iteration Phase-2
2Implicit cache, 70 % hit on stable prefix ≥ 4,096 tokens FIRSTTuned prompt structure−$3,200 vs uncached
3Reduced thinking budget (tuned down vs production)Plan §6.1; per-cohort overridable−$2,000 vs production thinking
4maxOutputTokens hard capPer shakedown smoke #1 (80 % MAX_TOKENS rate uncapped)8× reduction per MAX_TOKENS failure
5Family-parallel calibration (concurrency 10)Workflow parallel-for5 h → 40 min wallclock, same compute cost
6EU regional pinningEU residency hard line+$10 vs us-central1, residency-compliant
7BQ load-job + MERGE for ingestPer smoke #11 schema-drift workFree up to 1,500 load jobs / day
8n2-highmem-16 CPU not GPUNumPyro / JAX is CPU-saturating−60 % vs T4 GPU at this batch size

Three of these (Levers 4, 5, 7) were sharpened by shakedown evidence. The maxOutputTokens cap was set after smoke #1 hit an 80 % MAX_TOKENS rate from runaway thinking on a smoke scenario. The family-parallel concurrency_limit: 10 was tuned by Probe 5's wallclock measurement. The load-job + MERGE pattern replaced an earlier Storage Write API plan after smoke #11 surfaced schema-drift on the staging table.

The other five (Levers 1, 2, 3, 6, 8) shipped from the cost-engine memo as designed and required no in-flight tuning.

What the levers do, plainly

Vertex Batch over online inference is the largest single cost lever — fifty percent off the list price across the entire campaign, with no preset RPM or TPM quota.

Lever 1 — Batch over online. Vertex AI Batch Prediction takes fifty percent off every billable token (input, cached input, output) versus the online endpoint. The trade-off is latency: batch jobs run on a long-lived dedicated SKU and complete in roughly 25 minutes rather than seconds; the corpus generation does not need real-time responses. At our duty cycle, this saves approximately $5,440 across ten Phase-2 iterations.

Lever 2 — Implicit cache. Cached input is billed at ten percent of the regular input rate. The implicit cache hits on the stable prefix of a request — the persona description, the scenario context, the response schema — that is identical across many calls. We tuned the prompt structure so that all stable content sits in the first ≥ 4,096 tokens of the request, which is the cache-eligibility threshold; observed hit rate across shakedown smokes was 60–80 %. At the Scenario B base case this saves approximately $3,200 versus an uncached run.

Lever 3 — Reduced thinking budget. Gemini bills thinking tokens as output tokens, so thinking is expensive. The cost-engine memo specifies a tuned-down thinking budget for offline corpus generation (T1, T2/T3 values are gated to the reports tier). The trade-off is per-call output quality; we A/B'd this in Phase-1 and the calibrator-substrate quality stayed within tolerance. Saves approximately $2,000 versus full production thinking. Per-cohort overridable: a calibration cohort that needs full thinking can request it explicitly via the cohort registry.

Lever 4 — maxOutputTokens hard cap. Smoke #1 produced an 80 % MAX_TOKENS rate from runaway thinking on a small persona × scenario combination. The fix is a hard ceiling on output tokens (8,192) that prevents a single bad combination from blowing eight times its expected cost. This is an asymmetric lever: under normal operating conditions, the cap is never hit; under pathological conditions, it caps the damage at 8× the expected cost rather than letting a runaway eat the iteration budget.

Lever 5 — Family-parallel calibration. The hierarchical Bayesian GRM trainer fits ten construct families. Running them sequentially would take roughly 5 hours per iteration on n2-highmem-16; the workflow's concurrency_limit: 10 runs all ten in parallel as separate Vertex Custom Training jobs, completing in roughly 40 minutes. Same total compute, dramatically faster wallclock. The constraint is the project's quota for concurrent Custom Training jobs; we have headroom.

Lever 6 — EU regional pinning. Everything except the /global/-only publisher model runs in EU regions (eu-west4 for foundation services, eu-west1 for the legacy CFs that haven't yet flipped, EU multi-region for BigQuery). The premium over us-central1 is approximately $10 per Phase-2 iteration. EU data residency is a hard line for the consumer-facing Echo product; we run the calibration pipeline in the same EU posture for operational simplicity, even though the synthetic corpus contains no PII. The architecture page carries the residency rationale.

Lever 7 — BQ load-job + MERGE for ingest. Free up to 1,500 load jobs per day per project. Our duty cycle is on the order of 12 Cloud Run Jobs per iteration × 10 iterations = 120 load jobs per Phase-2 campaign, three orders of magnitude under the limit. The pre-shakedown plan called for Storage Write API; load-job + MERGE turned out strictly cheaper, simpler, and equivalent on the dedupe semantics. The architecture page carries the trade-off.

Lever 8 — n2-highmem-16 CPU. At our problem size, NumPyro/JAX is CPU-saturating; a T4 GPU would idle most of the trace cycles waiting for the host to feed it parameters. CPU is approximately 60 % cheaper at this scale with comparable wallclock. The most counterintuitive choice on the diagram, and the one most likely to draw "but Bayesian inference at scale runs on GPU." The answer is "at some scale," and we are below it.


The ninth lever: the local QA harness

scripts/local_test.sh runs eight static checks in under 60 seconds on a developer laptop for $0. The checks cover every bug class hit during Phase-1 shakedown:

CheckCatchesSmokes / probes covered
Container transitive-importMissing pip packages reachable from worker / trainer#5, #14, #18, Probes 1 & 3
BQ row-shapejson.dumps() into JSON-typed columns; STRING-vs-FLOAT64 drift#11, #16, #17
Attribute-existence safetyDataclass attribute access on non-existent fieldsProbe 2
Workflow safetyEnv var coverage, expression syntax, COPY paths, LRO timeouts#5, #10, #12
Silent-except scannerexcept: pass patterns swallowing real errors(defensive)
Workflow YAML deploygcloud workflows deploy --validate-only for three workflow files(defensive)

The harness is the Phase-2 readiness gate: smoke does not fire unless the harness exits 0. Two consequences follow:

Counterfactual cost. Without the harness, Phase-2 would re-discover bugs from this list with live Vertex Batch retries. We estimate $50–$150 per iteration in retry spend the static checks now prevent. At ten iterations: $500–$1,500. That is between 9 % and 30 % of the entire Phase-2 base-case budget, recovered as a cost lever.

Bug-class reachability. Each test file is bound to a specific shakedown smoke (annotated in source). Adding a new bug class is a tight loop: write a failing test against the production module, fix the production module, watch the test pass, watch the harness exit code drop back to 0.

The harness costs zero dollars to operate. It is the durable engineering output of the shakedown, in the same way pytest-benchmark is the durable output of a perf-tuning sprint. It will protect Phase-2 spend even if no other artifact from Phase-1 is reused.


Phase-1 actuals — what every dollar bought (calibration pipeline)

Phase-1 spent approximately $140 of the $500 Start Tier budget. The wire-test corpus itself was $22 as forecast; the remaining $118 was bug-discovery spend across 18 workflow smokes and 5 trainer probes.

Spend bucketForecastActual
Open-data download (network egress to laptop)$0$0
Σ assembly + Higham PD projection (local CPU)$0$0
Persona library v2 forge (N=2,000, copula sampler, no LLM)$0$0
K = 200 Lewis-Linzer sensitivity (local CPU, ~30 min)$0$0
Phase-1 wire-test smoke$22~$3
Custom Training stub / Probe 5 (n2-highmem-16, 9.0 s)$0.10~$2
GCS + BQ + Workflows + Cloud Run<$5 (free tier)<$2
Wire-test subtotal~$22~$7
Shakedown debugging (18 smokes + 5 probes)not forecast~$133
Total Phase-1 spend~$22~$140

The shakedown debugging line is non-recurring. Every bug class that surfaced has either a fix in the production code path or a static check in the eight-check local QA harness, so Phase-2 should not re-pay this $140. The harness is the binding mechanism.

The bug-class breakdown of the ~$133 debugging spend, sourced from docs/shakedown_ledger.md:31-55:

Bug classCount of bugsCost
Local toolchain1$0
GCP IAM bindings (4 bugs across smokes #2–#4)4$0
Container build / requirements (5 bugs across #5–#7, #14, Probe 3)5~$0 (build-time, no LLM)
Vertex AI quirks (/global/-only, cross-location reject, metadata field reject, Cloud Run auto-retry)4~$50
Cloud Workflows YAML (timeout, off-by-one, expression colon, body wrapper drift)4~$3
BigQuery schemas (5 bugs across #11, #15–#17 + namespace)5~$70
Trainer logic (3 bugs across Probes 1, 2, 4)3~$10
LLM output quality (1 bug, smoke #1: 80 % MAX_TOKENS rate)1~$25
Probe 5 (converged trainer run)n/a~$2
Other (workflow expression debugging)n/a~$5

The largest bug-class cost was BigQuery schemas at ~$70 — the iteration of STRING-vs-FLOAT64 autodetect drift, JSON dict-vs-string mismatch, JSON_EXTRACT ambiguity on JSON-typed columns, LIKE on JSON column, and namespace collision risk. All five sub-classes are now caught by test_bq_row_shape.py in 60 seconds for $0.

The second-largest was Vertex AI quirks at ~$50 — the discovery that gemini-3-flash-preview is /global/-only, that Vertex Batch enforces same-location for job + model, that the metadata field is rejected by the validator, and that Cloud Run Job auto-retries can spawn duplicate batches. All four are now baked into the worker config and workflow YAML.

The buffer at the Phase-2 ask: ~$360 of the $500 Start Tier remaining, and ~$1,860 of the $2,000 Start Tier window after accounting for the $140 spent.


Phase-2 forecast — re-derived post-shakedown

The pre-shakedown forecast for Phase-2 (Scenario B at $5,560 per iteration) was derived from a per-turn LLM-call model that assumed each conversational turn re-sent the persona prefix and incurred a fresh round of input tokens. The actual data path landed differently: a single Vertex Batch row carries one full Echo session — twelve conversational turns inside one structured-output response. The input prefix amortises once across all twelve turns instead of being re-sent per turn. The seventy-percent implicit-cache hit on the stable prefix further reduces input-token cost.

Re-derived Phase-2 forecast: $406–$580 per iteration; $4,060–$5,800 across ten iterations. Roughly 10× more favourable than the per-turn model assumed.

Scenarion personassessions / personaturns / sessionthinkingcacheitersTotal USDvs cap
A — Phase-1 wire-test10512tuned-down70 %1$22 forecast / ~$140 actual incl. shakedown28 % of $500
B — Phase-2 base case2,0005012tuned-down70 %10$4,060–$5,80051–73 % of $8,000
C — Pessimistic2,0005012full prod30 %15~$8,00032 % of $25,000

The Scenario B re-derivation is the binding update. The new range encompasses uncertainty in the actual implicit-cache hit rate (60–80 % observed across shakedown smokes) and in the per-iteration calibration trainer cost (which depends on chain count and target_accept tuning). Inference still dominates ≥ 85 % of every scenario; the binding cost line is output tokens, including thinking tokens.

Scenario C — full production thinking, 30 % cache, 15 iterations, with retry overhead — lands at approximately $8,000, comfortably under the $25,000 AI Tier cap with $17,000 of buffer.


Sensitivity tornado at Scenario B (calibration pipeline)

Per-iteration delta vs the Scenario B base case ($493). Mode batch → online and Provisioned Throughput are the dominant levers; thinking budget is the second-largest controllable lever; everything else is sub-$200.

The tornado has one obvious top: mode batch → online doubles the per-iteration cost. If Vertex Batch availability ever degrades for the locked publisher model, the fallback to online roughly doubles inference, taking Scenario B from approximately $5,000 to approximately $10,000 across the campaign. Still under the $25,000 cap, but consumes the buffer.

The second-largest is Provisioned Throughput at +$3,000 per iteration. Provisioned Throughput is approximately seven times more expensive than batch at our duty cycle and remains the wrong tool for bursty calibration runs.

The third is thinking budget. A 50 % increase from the tuned-down setting adds $190 per iteration; a 50 % decrease saves $102. The asymmetry comes from output-token billing: thinking tokens are output, so adding thinking adds output cost faster than removing it saves. Per-cohort overridable: a calibration cohort that needs full thinking can request it explicitly.

The remaining levers are all sub-$200 per iteration: cache hit rate ±10 pp, residency violation (which we reject), cache TTL drift from one mid-iteration prefix change.


What every label tells us

Every Vertex AI call, Cloud Run Job, and Custom Job is labelled with iteration={N}, component={corpus_gen | calibration | scoring | forge}, tier={start | ai}, cohort={phase1_smoke | full_corpus | test_retest | ablation}. These labels flow into Cloud Billing detailed export to BigQuery; cross-validation joins cost_ledger.cost_total_usd against the official invoice grouped by these labels, producing a per-iteration audit trail.

The Cloud Billing export half of this is gated on roles/billing.user (deferred to operator); the cost_ledger half is live and was exercised across all 18 shakedown smokes. Operations carries the full observability surface; this page closes with the fiscal slice.

The receipts tier — gated at /about/reports — carries the full per-call rate table snapshot, the per-iteration ledger schema, the K=200 Lewis-Linzer sensitivity over Σ uncertainty, and the per-bug spend reconciliation.


Bottom line. The platform run-rate after Foundation v1 is ~€100–150/mo at pre-launch scale, with AlloyDB pause-by-default saving ~€350/mo vs always-on. Phase-1 calibration spent $140 / $500. Phase-2 forecasts $4,060–$5,800 across ten iterations, 51–73 % of the $8,000 expected ceiling. Pessimistic Scenario C remains under the $25,000 AI Tier cap with $17,000 of buffer. The local QA harness is worth $500–$1,500 across the Phase-2 window as a retry-prevention cost lever, while costing zero dollars to operate. Eight engineering levers — three sharpened by shakedown evidence, all eight surviving shakedown unchanged — control the calibration envelope; the foundation cost-control plane controls the platform envelope; neither requires further architectural change before Phase-2.