Validation
The 2026-05 Phase-1 validation run is the receipts of what the methodology page commits to. Eighteen workflow smokes plus five trainer probes uncovered twenty-seven bugs across eight categories during the shakedown; every fix landed in either the production code path or a new eight-check local QA harness. The final probe — Probe 5 — converged a NumPyro hierarchical NUTS GRM in 9.0 seconds at R-hat = 1.002, n_eff_min = 1,318, zero divergences, on real PSL-derived data. Six calibration_metrics rows + three invariance rows MERGE'd into BigQuery as the canonical artifact.
This page is the public-facing summary. The full reviewer-grade narrative — per-smoke ledger, per-probe diagnostics, the validation battery output — sits behind a passkey at the reports tier. The infrastructure side of the story (what was deployed, what failed, what shipped) sits at /about/cloud-infrastructure.
What we promised, observed in the wild
The methodology page commits to four primary KPIs plus a multivariate Σ-fidelity check plus a targeted test-retest sub-corpus. The Phase-1 numbers, headline form:
Achieved on every trait reported; the standard psychometric threshold (Nunnally 1978)
Achieved across all item pairs in the production substrate
Both within ±2 pp of target — the conformal layer is producing honest uncertainty bands
Across age × sex × education subgroups in the synthetic corpus; Phase-3 will rerun against real users
On natural-scale empirical R vs latent-Pearson Σ; tolerance is 0.20
All four KPIs green; Σ-fidelity comfortably inside tolerance. The targeted test-retest sub-corpus runs as a Phase-2 deliverable; Phase-1 produced the methodology + harness for it but did not yet run the paired-session corpus.
What we promised in the methodology page, observed in the wild: four KPIs green, Σ-fidelity inside tolerance, hierarchical GRM converged on real data at R-hat = 1.002.The strongest single receipt: Probe 5
Probe 5 was the final shakedown probe before Phase-2 readiness was declared. It ran the full pipeline — synthetic persona corpus → Vertex AI Batch Prediction → BigQuery MERGE → Vertex Custom Training → NumPyro NUTS GRM → BQ MERGE — on ten personas × six openness-family constructs scored from real PSL evaluator output. Not synthetic responses; real PSL output that the system had never seen during training.
The diagnostics:
| Diagnostic | Value | Verdict |
|---|---|---|
| R-hat (max across all parameters) | 1.002 | Convergence — well inside the R-hat < 1.01 threshold |
| n_eff_min | 1,318 | Healthy effective sample size; no parameter is starved |
| Divergent transitions | 0 | Non-centered parameterisation is doing its job |
| Wallclock | 9.0 s | On a single CPU node; full Phase-2 fit targets ~30 minutes per family per iteration |
calibration_metrics rows MERGE'd | 6 | One per construct |
invariance rows MERGE'd | 3 | Sex × age × education axes |
What this means in plain English: the Bayesian sampler converged cleanly on real data. The trait parameters it produced are well-conditioned, the uncertainty bands are honest, and the chain is ready to run at production scale. Probe 5 is the small-scale receipt that the pipeline runs end-to-end on data the model had not seen before.
The full posterior diagnostics (HDI95 credible intervals on every parameter), the per-construct fits, and the JSON-roundtrippable grm.json artifact are in the gated R2 sampler validation report.
The shakedown narrative, compressed
Eighteen workflow smokes plus five trainer probes is what the journey looked like. The bug taxonomy covered eight categories:
| Category | Count | Cost (est.) |
|---|---|---|
| Local toolchain | 1 | $0 |
| GCP IAM bindings | 4 | $0 |
| Container build / requirements | 5 | ~$0 |
| Vertex AI quirks | 4 | ~$50 |
| Cloud Workflows YAML | 4 | ~$3 |
| BigQuery schemas | 5 | ~$70 |
| Trainer logic | 3 | ~$10 |
| LLM output quality | 1 | ~$25 |
Every fix landed either in the production code path or in the eight-check local QA harness that gates Phase-2 readiness. Total Phase-1 spend: ~$140 of the $500 Start Tier budget (~28 %); ~$360 buffer remaining.
The most permanent artifact of the shakedown is the local QA harness (scripts/local_test.sh + six Python tests). At 60 seconds wallclock and zero dollars per run, it catches every bug class hit during shakedown — container transitive imports, BQ column-shape drift, dataclass attribute drift, workflow YAML expression syntax, silent except-pass patterns, workflow YAML deploy validation. The counterfactual cost — what the project would have paid in Phase-2 retries without the harness — is $50–$150 per iteration, between 9 % and 30 % of the entire Phase-2 base-case budget.
What the validation does not claim
To set expectations honestly:
- Probe 5 is small scale. Ten personas × six openness constructs is a smoke-scale fit, not the full Phase-2 two-thousand-persona × 113-construct fit. Convergence at small N is necessary but not sufficient for the production scale.
- Phase-1 calibration is on synthetic data. The trait parameters Probe 5 produced are the small-scale receipt that the pipeline runs end-to-end. Phase-3 will recalibrate against real-user data and produce the population-level numbers we will ship to research-buyer customers.
- Two of the four R5 robustness drills have not yet been formally executed. Auto-pause and mid-shard crash drills are gated on a
roles/billing.usergrant and a chaos-image build, respectively. The other two (429 backoff, idempotency stress) are partially evidenced through shakedown side-effects. The robustness report carries the honest execution status. - Phase-3 invariance recalibration is the more rigorous test. Phase-1 ran Gate 10 on the synthetic corpus; Phase-3 will run it on real-user data. The two answer different questions, and the Phase-3 answer is the load-bearing one.
Forward to Phase-2
The receipts on this page are sufficient for the Phase-2 milestone-review approval. What remains:
- Grant
roles/billing.userto an operator account → configure Cloud Billing budget alerts → run the auto-pause drill. - Build the
CHAOS_FAULT-flagged worker image and run the mid-shard crash drill. - Resolve the three BigQuery write-path divergences flagged in R7 (legacy
tabledata.insertAllcalls in pre-shakedown scaffolding). - Cut iteration 10 — the first Phase-2 production iteration. Full corpus (2,000 personas × 50 sessions × 12 turns), all ten calibration families parallel, gate verdicts driven from green/red per
compute_iteration_summary.
The pipeline is shaken down. The receipts are filed. Phase-2 is unblocked on the milestone-review approval and the four hardening items above.
The next page, roadmap, is the honest dates and scope for Phase-2 and Phase-3.