Cost engineering

Two cost stories ride on this infrastructure. The platform is what Foundation v1 stands up: a Workspace org with three live workload projects, AlloyDB pause-by-default, per-domain Cloud Run services on min-instances=0, BigQuery physical-billing tables, KMS keys, and a sealed audit project. At current scale this lands at roughly €100–150/mo combined burn — dominated by AlloyDB awake-hours and a small constant of always-on services like the audit alerter. The calibration pipeline is the original tenant; it spent ~$140 of its $500 Phase-1 budget cap across the shakedown and forecasts $4,060–$5,800 across the ten Phase-2 iterations, comfortably under the $25,000 hard cap.

This page covers both. The platform side is short — the cost-control plane (F-3.13) does the work and the discipline lives in operations. The calibration-pipeline side is the eight engineering levers, the verified Phase-1 actuals, the re-derived Phase-2 forecast, and the sensitivity tornado at the base case, plus the local QA harness as the unconventional ninth lever.

Part 1 — The platform run-rate after Foundation v1

The pre-foundation worry was that AlloyDB at full sizing would burn ~€450/mo continuously to serve a workload that needs the cluster awake only a fraction of each day. The Foundation v1 cost-control posture (F-3.13 / OD-20) commits to pause-by-default with operator pinning. Concrete numbers:

Wake driver	Hours awake/month
Active dev sessions (4 hr × 5 days × 4 weeks)	80
Scheduled jobs (~30 min/day × 3 jobs)	45
User traffic outside dev + job windows	~10
Smoke tests + reviewer-click buffer	20
Total awake hours/month per cluster	~155
× ~€0.31/hr per 2-vCPU cluster	~€48/mo per cluster

That lands AlloyDB at ~€48/mo. Storage adds <€2/mo. Saves ~~€350/mo (~~€4,200/year) vs always-on at full sizing.

The other platform-level lines:

Component	Run-rate	Notes
AlloyDB (prod, paused-by-default)	~€48/mo	~155 awake-hours/mo at ~€0.31/hr
AlloyDB storage + backups	<€2/mo	Continuous backups 14d
Per-domain Cloud Run services	<€5/mo combined	All `min-instances=0`; cold-start fits each service's load profile
BigQuery storage (7 datasets, physical billing)	<€2/mo	Calibration corpus is the largest dataset; long-term-storage tier kicks in after 90d
BigQuery compute	$0–$5/mo	dryRun + `maximumBytesBilled` ceilings prevent runaway
KMS key operations	<€2/mo	One CMEK keyring; auto-rotation every 90d
Cloud Workflows	<$1/mo	Per-step billing with 5,000 free/month
Cloud Logging (org-aggregated to audit-logs)	~€10–20/mo	7-year BigQuery retention; volume bounded by curated event filter
Datastream	~€5–10/mo	One stream live; idle most of the time
GCS (corpus + audit archive)	~€2/mo	Object-locked archive grows slowly
Apphosting (Next.js frontend)	~€10–15/mo	Standard tier
Translate API	<€5/mo	Cached; rare-use
Cloud Build	<€2/mo	Free tier covers our build volume
Total platform run-rate	~€100–150/mo	Sustained state at current scale

These are sustained-state numbers. Phase-2 calibration runs (when they fire) are an additive line item handled in Part 2. As sustained traffic grows, AlloyDB pause-by-default automatically becomes effectively always-on (the heartbeat fires continuously); a single Firestore document update can pin prod always-awake explicitly.

Per-project budget alerts

Every workload project has a Cloud Billing budget configured via Terraform (infra/projects/main.tf) with monthly thresholds at 50 / 75 / 90 / 100 %. The 90 %+ alerts publish to a Pub/Sub topic that fans out to the operator. The same roles/billing.user deferral that blocks the calibration-pipeline auto-pause drill (Part 2 §below) blocks fully end-to-end testing of these alerts; until that grant lands, the budgets are configured and monitored manually via the Cloud Console.

The €350/mo savings from pause-by-default isn't a coupon — it's the binary difference between paying for an always-on database around the clock and paying only for the hours the workload actually needs it awake. That's the point.

Part 2 — The calibration pipeline (Phase-1 actuals, Phase-2 forecast)

The calibration pipeline runs against explicit per-phase budget caps. Phase-1 — the wire-test smoke plus the shakedown that hardened it — spent approximately $140 of its $500 cap, ~28 %. Phase-2 — the full corpus generation plus ten calibration iterations — forecasts at $4,060–$5,800 across the campaign, comfortably under the $8,000 expected ceiling and 16–23 % of the $25,000 hard cap.

The eight engineering levers that produce that envelope, the verified Phase-1 actuals, the re-derived Phase-2 forecast, the sensitivity tornado at the base case, and the local QA harness as the unconventional ninth lever — all sourced from the cost-engine memo and from docs/shakedown_ledger.md (canonical Phase-1 end-state); every number agrees with cost_ledger rows visible to anyone with project access.

The eight cost levers (calibration pipeline)

Eight engineering choices control the campaign envelope. All eight survived shakedown unchanged; three were sharpened by shakedown evidence.

#	Lever	Decision	Effect on Phase-2 base case
1	Vertex Batch vs online	Batch	−$5,440 vs online for ten-iteration Phase-2
2	Implicit cache, 70 % hit on stable prefix ≥ 4,096 tokens FIRST	Tuned prompt structure	−$3,200 vs uncached
3	Reduced thinking budget (tuned down vs production)	Plan §6.1; per-cohort overridable	−$2,000 vs production thinking
4	`maxOutputTokens` hard cap	Per shakedown smoke #1 (80 % MAX_TOKENS rate uncapped)	8× reduction per MAX_TOKENS failure
5	Family-parallel calibration (concurrency 10)	Workflow `parallel-for`	5 h → 40 min wallclock, same compute cost
6	EU regional pinning	EU residency hard line	+$10 vs `us-central1`, residency-compliant
7	BQ load-job + MERGE for ingest	Per smoke #11 schema-drift work	Free up to 1,500 load jobs / day
8	`n2-highmem-16` CPU not GPU	NumPyro / JAX is CPU-saturating	−60 % vs T4 GPU at this batch size

Three of these (Levers 4, 5, 7) were sharpened by shakedown evidence. The maxOutputTokens cap was set after smoke #1 hit an 80 % MAX_TOKENS rate from runaway thinking on a smoke scenario. The family-parallel concurrency_limit: 10 was tuned by Probe 5's wallclock measurement. The load-job + MERGE pattern replaced an earlier Storage Write API plan after smoke #11 surfaced schema-drift on the staging table.

The other five (Levers 1, 2, 3, 6, 8) shipped from the cost-engine memo as designed and required no in-flight tuning.

What the levers do, plainly

Vertex Batch over online inference is the largest single cost lever — fifty percent off the list price across the entire campaign, with no preset RPM or TPM quota.

Lever 1 — Batch over online. Vertex AI Batch Prediction takes fifty percent off every billable token (input, cached input, output) versus the online endpoint. The trade-off is latency: batch jobs run on a long-lived dedicated SKU and complete in roughly 25 minutes rather than seconds; the corpus generation does not need real-time responses. At our duty cycle, this saves approximately $5,440 across ten Phase-2 iterations.

Lever 2 — Implicit cache. Cached input is billed at ten percent of the regular input rate. The implicit cache hits on the stable prefix of a request — the persona description, the scenario context, the response schema — that is identical across many calls. We tuned the prompt structure so that all stable content sits in the first ≥ 4,096 tokens of the request, which is the cache-eligibility threshold; observed hit rate across shakedown smokes was 60–80 %. At the Scenario B base case this saves approximately $3,200 versus an uncached run.

Lever 3 — Reduced thinking budget. Gemini bills thinking tokens as output tokens, so thinking is expensive. The cost-engine memo specifies a tuned-down thinking budget for offline corpus generation (T1, T2/T3 values are gated to the reports tier, available on request). The trade-off is per-call output quality; we A/B'd this in Phase-1 and the calibrator-substrate quality stayed within tolerance. Saves approximately $2,000 versus full production thinking. Per-cohort overridable: a calibration cohort that needs full thinking can request it explicitly via the cohort registry.

Lever 4 — maxOutputTokens hard cap. Smoke #1 produced an 80 % MAX_TOKENS rate from runaway thinking on a small persona × scenario combination. The fix is a hard ceiling on output tokens (8,192) that prevents a single bad combination from blowing eight times its expected cost. This is an asymmetric lever: under normal operating conditions, the cap is never hit; under pathological conditions, it caps the damage at 8× the expected cost rather than letting a runaway eat the iteration budget.

Lever 5 — Family-parallel calibration. The hierarchical Bayesian GRM trainer fits ten construct families. Running them sequentially would take roughly 5 hours per iteration on n2-highmem-16; the workflow's concurrency_limit: 10 runs all ten in parallel as separate Vertex Custom Training jobs, completing in roughly 40 minutes. Same total compute, dramatically faster wallclock. The constraint is the project's quota for concurrent Custom Training jobs; we have headroom.

Lever 6 — EU regional pinning. Everything except the /global/-only publisher model runs in EU regions (eu-west4 for foundation services, eu-west1 for the legacy CFs that haven't yet flipped, EU multi-region for BigQuery). The premium over us-central1 is approximately $10 per Phase-2 iteration. EU data residency is a hard line for the consumer-facing Echo product; we run the calibration pipeline in the same EU posture for operational simplicity, even though the synthetic corpus contains no PII. The architecture page carries the residency rationale.

Lever 7 — BQ load-job + MERGE for ingest. Free up to 1,500 load jobs per day per project. Our duty cycle is on the order of 12 Cloud Run Jobs per iteration × 10 iterations = 120 load jobs per Phase-2 campaign, three orders of magnitude under the limit. The pre-shakedown plan called for Storage Write API; load-job + MERGE turned out strictly cheaper, simpler, and equivalent on the dedupe semantics. The architecture page carries the trade-off.

Lever 8 — n2-highmem-16 CPU. At our problem size, NumPyro/JAX is CPU-saturating; a T4 GPU would idle most of the trace cycles waiting for the host to feed it parameters. CPU is approximately 60 % cheaper at this scale with comparable wallclock. The most counterintuitive choice on the diagram, and the one most likely to draw "but Bayesian inference at scale runs on GPU." The answer is "at some scale," and we are below it.

The ninth lever: the local QA harness

scripts/local_test.sh runs eight static checks in under 60 seconds on a developer laptop for $0. The checks cover every bug class hit during Phase-1 shakedown:

Check	Catches	Smokes / probes covered
Container transitive-import	Missing pip packages reachable from worker / trainer	#5, #14, #18, Probes 1 & 3
BQ row-shape	`json.dumps()` into JSON-typed columns; `STRING`-vs-`FLOAT64` drift	#11, #16, #17
Attribute-existence safety	Dataclass attribute access on non-existent fields	Probe 2
Workflow safety	Env var coverage, expression syntax, COPY paths, LRO timeouts	#5, #10, #12
Silent-except scanner	`except: pass` patterns swallowing real errors	(defensive)
Workflow YAML deploy	`gcloud workflows deploy --validate-only` for three workflow files	(defensive)

The harness is the Phase-2 readiness gate: smoke does not fire unless the harness exits 0. Two consequences follow:

Counterfactual cost. Without the harness, Phase-2 would re-discover bugs from this list with live Vertex Batch retries. We estimate $50–$150 per iteration in retry spend the static checks now prevent. At ten iterations: $500–$1,500. That is between 9 % and 30 % of the entire Phase-2 base-case budget, recovered as a cost lever.

Bug-class reachability. Each test file is bound to a specific shakedown smoke (annotated in source). Adding a new bug class is a tight loop: write a failing test against the production module, fix the production module, watch the test pass, watch the harness exit code drop back to 0.

The harness costs zero dollars to operate. It is the durable engineering output of the shakedown, in the same way pytest-benchmark is the durable output of a perf-tuning sprint. It will protect Phase-2 spend even if no other artifact from Phase-1 is reused.

Phase-1 actuals — what every dollar bought (calibration pipeline)

Phase-1 spent approximately $140 of its $500 budget cap. The wire-test corpus itself was $22 as forecast; the remaining $118 was bug-discovery spend across 18 workflow smokes and 5 trainer probes.

Spend bucket	Forecast	Actual
Open-data download (network egress to laptop)	$0	$0
Σ assembly + Higham PD projection (local CPU)	$0	$0
Persona library v2 forge (N=2,000, copula sampler, no LLM)	$0	$0
K = 200 Lewis-Linzer sensitivity (local CPU, ~30 min)	$0	$0
Phase-1 wire-test smoke	$22	~$3
Custom Training stub / Probe 5 (`n2-highmem-16`, 9.0 s)	$0.10	~$2
GCS + BQ + Workflows + Cloud Run	<$5 (free tier)	<$2
Wire-test subtotal	~$22	~$7
Shakedown debugging (18 smokes + 5 probes)	not forecast	~$133
Total Phase-1 spend	~$22	~$140

The shakedown debugging line is non-recurring. Every bug class that surfaced has either a fix in the production code path or a static check in the eight-check local QA harness, so Phase-2 should not re-pay this $140. The harness is the binding mechanism.

The bug-class breakdown of the ~$133 debugging spend, sourced from docs/shakedown_ledger.md:31-55:

Bug class	Count of bugs	Cost
Local toolchain	1	$0
GCP IAM bindings (4 bugs across smokes #2–#4)	4	$0
Container build / requirements (5 bugs across #5–#7, #14, Probe 3)	5	~$0 (build-time, no LLM)
Vertex AI quirks (`/global/`-only, cross-location reject, metadata field reject, Cloud Run auto-retry)	4	~$50
Cloud Workflows YAML (timeout, off-by-one, expression colon, body wrapper drift)	4	~$3
BigQuery schemas (5 bugs across #11, #15–#17 + namespace)	5	~$70
Trainer logic (3 bugs across Probes 1, 2, 4)	3	~$10
LLM output quality (1 bug, smoke #1: 80 % MAX_TOKENS rate)	1	~$25
Probe 5 (converged trainer run)	n/a	~$2
Other (workflow expression debugging)	n/a	~$5

The largest bug-class cost was BigQuery schemas at ~$70 — the iteration of STRING-vs-FLOAT64 autodetect drift, JSON dict-vs-string mismatch, JSON_EXTRACT ambiguity on JSON-typed columns, LIKE on JSON column, and namespace collision risk. All five sub-classes are now caught by test_bq_row_shape.py in 60 seconds for $0.

The second-largest was Vertex AI quirks at ~$50 — the discovery that gemini-3-flash-preview is /global/-only, that Vertex Batch enforces same-location for job + model, that the metadata field is rejected by the validator, and that Cloud Run Job auto-retries can spawn duplicate batches. All four are now baked into the worker config and workflow YAML.

The buffer at the Phase-2 ask: ~$360 of the $500 Phase-1 envelope remaining after accounting for the $140 spent.

Phase-2 forecast — re-derived post-shakedown

The pre-shakedown forecast for Phase-2 (Scenario B at $5,560 per iteration) was derived from a per-turn LLM-call model that assumed each conversational turn re-sent the persona prefix and incurred a fresh round of input tokens. The actual data path landed differently: a single Vertex Batch row carries one full Echo session — twelve conversational turns inside one structured-output response. The input prefix amortises once across all twelve turns instead of being re-sent per turn. The seventy-percent implicit-cache hit on the stable prefix further reduces input-token cost.

Re-derived Phase-2 forecast: $406–$580 per iteration; $4,060–$5,800 across ten iterations. Roughly 10× more favourable than the per-turn model assumed.

Scenario	n personas	sessions / persona	turns / session	thinking	cache	iters	Total USD	vs cap
A — Phase-1 wire-test	10	5	12	tuned-down	70 %	1	$22 forecast / ~$140 actual incl. shakedown	28 % of $500
B — Phase-2 base case	2,000	50	12	tuned-down	70 %	10	$4,060–$5,800	51–73 % of $8,000
C — Pessimistic	2,000	50	12	full prod	30 %	15	~$8,000	32 % of $25,000

The Scenario B re-derivation is the binding update. The new range encompasses uncertainty in the actual implicit-cache hit rate (60–80 % observed across shakedown smokes) and in the per-iteration calibration trainer cost (which depends on chain count and target_accept tuning). Inference still dominates ≥ 85 % of every scenario; the binding cost line is output tokens, including thinking tokens.

Scenario C — full production thinking, 30 % cache, 15 iterations, with retry overhead — lands at approximately $8,000, comfortably under the $25,000 hard cap with $17,000 of buffer.

Sensitivity tornado at Scenario B (calibration pipeline)

Per-iteration delta vs the Scenario B base case ($493). Mode batch → online and Provisioned Throughput are the dominant levers; thinking budget is the second-largest controllable lever; everything else is sub-$200.

The tornado has one obvious top: mode batch → online doubles the per-iteration cost. If Vertex Batch availability ever degrades for the locked publisher model, the fallback to online roughly doubles inference, taking Scenario B from approximately $5,000 to approximately $10,000 across the campaign. Still under the $25,000 cap, but consumes the buffer.

The second-largest is Provisioned Throughput at +$3,000 per iteration. Provisioned Throughput is approximately seven times more expensive than batch at our duty cycle and remains the wrong tool for bursty calibration runs.

The third is thinking budget. A 50 % increase from the tuned-down setting adds $190 per iteration; a 50 % decrease saves $102. The asymmetry comes from output-token billing: thinking tokens are output, so adding thinking adds output cost faster than removing it saves. Per-cohort overridable: a calibration cohort that needs full thinking can request it explicitly.

The remaining levers are all sub-$200 per iteration: cache hit rate ±10 pp, residency violation (which we reject), cache TTL drift from one mid-iteration prefix change.

What every label tells us

Every Vertex AI call, Cloud Run Job, and Custom Job is labelled with iteration={N}, component={corpus_gen | calibration | scoring | forge}, cohort={phase1_smoke | full_corpus | test_retest | ablation}. These labels flow into Cloud Billing detailed export to BigQuery; cross-validation joins cost_ledger.cost_total_usd against the official invoice grouped by these labels, producing a per-iteration audit trail.

The Cloud Billing export half of this is gated on roles/billing.user (deferred to operator); the cost_ledger half is live and was exercised across all 18 shakedown smokes. Operations carries the full observability surface; this page closes with the fiscal slice.

The reviewer-grade receipts tier — gated, available on request — carries the full per-call rate table snapshot, the per-iteration ledger schema, the K=200 Lewis-Linzer sensitivity over Σ uncertainty, and the per-bug spend reconciliation.

Bottom line. The platform run-rate after Foundation v1 is ~€100–150/mo at current scale, with AlloyDB pause-by-default saving ~€350/mo vs always-on. Phase-1 calibration spent $140 / $500. Phase-2 forecasts $4,060–$5,800 across ten iterations, 51–73 % of the $8,000 expected ceiling. Pessimistic Scenario C remains under the $25,000 hard cap with $17,000 of buffer. The local QA harness is worth $500–$1,500 across the Phase-2 window as a retry-prevention cost lever, while costing zero dollars to operate. Eight engineering levers — three sharpened by shakedown evidence, all eight surviving shakedown unchanged — control the calibration envelope; the foundation cost-control plane controls the platform envelope; neither requires further architectural change before Phase-2.