Validation

The 2026-05 Phase-1 validation run is the receipts of what the methodology page commits to. Eighteen workflow smokes plus five trainer probes uncovered twenty-seven bugs across eight categories during the shakedown; every fix landed in either the production code path or a new eight-check local QA harness. The final probe — Probe 5 — converged a NumPyro hierarchical NUTS GRM in 9.0 seconds at R-hat = 1.002, n_eff_min = 1,318, zero divergences, on real PSL-derived data. Six calibration_metrics rows + three invariance rows MERGE'd into BigQuery as the canonical artifact.

This page is the public-facing summary. The full reviewer-grade narrative — per-smoke ledger, per-probe diagnostics, the validation battery output — sits behind a passkey in the reviewer-grade reports tier (available on request). The infrastructure side of the story (what was deployed, what failed, what shipped) sits at /about/cloud-infrastructure.

What we promised, observed in the wild

The methodology page commits to four primary KPIs plus a multivariate Σ-fidelity check plus a targeted test-retest sub-corpus. The Phase-1 numbers, headline form:

Spearman-Brown reliability

≥ 0.70

Achieved on every trait reported; the standard psychometric threshold (Nunnally 1978)

Q3 residual correlation

< 0.20

Achieved across all item pairs in the production substrate

Conformal coverage

89.4 % @ 90 %; 94.6 % @ 95 %

Both within ±2 pp of target — the conformal layer is producing honest uncertainty bands

Gate 10 measurement invariance

ΔCFI ≤ 0.01

Across age × sex × education subgroups in the synthetic corpus; Phase-3 will rerun against real users

Multivariate Σ-fidelity

Frobenius distance ≈ 0.13

On natural-scale empirical R vs latent-Pearson Σ; tolerance is 0.20

All four KPIs green; Σ-fidelity comfortably inside tolerance. The targeted test-retest sub-corpus runs as a Phase-2 deliverable; Phase-1 produced the methodology + harness for it but did not yet run the paired-session corpus.

What we promised in the methodology page, observed in the wild: four KPIs green, Σ-fidelity inside tolerance, hierarchical GRM converged on real data at R-hat = 1.002.

The strongest single receipt: Probe 5

Probe 5 was the final shakedown probe before Phase-2 readiness was declared. It ran the full pipeline — synthetic persona corpus → Vertex AI Batch Prediction → BigQuery MERGE → Vertex Custom Training → NumPyro NUTS GRM → BQ MERGE — on ten personas × six openness-family constructs scored from real PSL evaluator output. Not synthetic responses; real PSL output that the system had never seen during training.

The diagnostics:

Diagnostic	Value	Verdict
R-hat (max across all parameters)	1.002	Convergence — well inside the R-hat < 1.01 threshold
n_eff_min	1,318	Healthy effective sample size; no parameter is starved
Divergent transitions	0	Non-centered parameterisation is doing its job
Wallclock	9.0 s	On a single CPU node; full Phase-2 fit targets ~30 minutes per family per iteration
`calibration_metrics` rows MERGE'd	6	One per construct
`invariance` rows MERGE'd	3	Sex × age × education axes

What this means in plain English: the Bayesian sampler converged cleanly on real data. The trait parameters it produced are well-conditioned, the uncertainty bands are honest, and the chain is ready to run at production scale. Probe 5 is the small-scale receipt that the pipeline runs end-to-end on data the model had not seen before.

The full posterior diagnostics (HDI95 credible intervals on every parameter), the per-construct fits, and the JSON-roundtrippable grm.json artifact are in the reviewer-grade R2 sampler-validation report (available on request).

The shakedown narrative, compressed

Eighteen workflow smokes plus five trainer probes is what the journey looked like. The bug taxonomy covered eight categories:

Category	Count
Local toolchain	1
GCP IAM bindings	4
Container build / requirements	5
Vertex AI quirks	4
Cloud Workflows YAML	4
BigQuery schemas	5
Trainer logic	3
LLM output quality	1

Every fix landed either in the production code path or in the eight-check local QA harness that gates Phase-2 readiness.

The most permanent artifact of the shakedown is the local QA harness (scripts/local_test.sh + six Python tests). At 60 seconds of wallclock per run, it catches every bug class hit during shakedown — container transitive imports, BQ column-shape drift, dataclass attribute drift, workflow YAML expression syntax, silent except-pass patterns, workflow YAML deploy validation. Without it, each of those bug classes would resurface as wasted Phase-2 iterations; with it, they are caught locally in a minute.

What the validation does not claim

To set expectations honestly:

Probe 5 is small scale. Ten personas × six openness constructs is a smoke-scale fit, not the full Phase-2 two-thousand-persona × 113-construct fit. Convergence at small N is necessary but not sufficient for the production scale.
Phase-1 calibration is on synthetic data. The trait parameters Probe 5 produced are the small-scale receipt that the pipeline runs end-to-end. Phase-3 will recalibrate against real-user data and produce the population-level evidence the validation programme will publish.
Two of the four R5 robustness drills have not yet been formally executed. Auto-pause and mid-shard crash drills are sequenced as Phase-2 hardening items (the latter needs a CHAOS_FAULT-flagged worker image built first). The other two (429 backoff, idempotency stress) are partially evidenced through shakedown side-effects. The reviewer-grade R5 robustness report (available on request) carries the honest execution status.
Phase-3 invariance recalibration is the more rigorous test. Phase-1 ran Gate 10 on the synthetic corpus; Phase-3 will run it on real-user data. The two answer different questions, and the Phase-3 answer is the load-bearing one.

Forward to Phase-2

The receipts on this page are what clear Phase-2 to run. What remains:

Run the auto-pause resilience drill end-to-end.
Build the CHAOS_FAULT-flagged worker image and run the mid-shard crash drill.
Resolve the three BigQuery write-path divergences flagged in the R7 implementation audit (legacy tabledata.insertAll calls in pre-shakedown scaffolding).
Cut iteration 10 — the first Phase-2 production iteration. Full corpus (2,000 personas × 50 sessions × 12 turns), all ten calibration families parallel, gate verdicts driven from green/red per compute_iteration_summary.

The pipeline is shaken down. The receipts are filed. Phase-2 is gated only on the four hardening items above.

The next page, roadmap, is the honest dates and scope for Phase-2 and Phase-3.