Validation

The 2026-05 Phase-1 validation run is the receipts of what the methodology page commits to. Eighteen workflow smokes plus five trainer probes uncovered twenty-seven bugs across eight categories during the shakedown; every fix landed in either the production code path or a new eight-check local QA harness. The final probe — Probe 5 — converged a NumPyro hierarchical NUTS GRM in 9.0 seconds at R-hat = 1.002, n_eff_min = 1,318, zero divergences, on real PSL-derived data. Six calibration_metrics rows + three invariance rows MERGE'd into BigQuery as the canonical artifact.

This page is the public-facing summary. The full reviewer-grade narrative — per-smoke ledger, per-probe diagnostics, the validation battery output — sits behind a passkey at the reports tier. The infrastructure side of the story (what was deployed, what failed, what shipped) sits at /about/cloud-infrastructure.


What we promised, observed in the wild

The methodology page commits to four primary KPIs plus a multivariate Σ-fidelity check plus a targeted test-retest sub-corpus. The Phase-1 numbers, headline form:

Spearman-Brown reliability
≥ 0.70

Achieved on every trait reported; the standard psychometric threshold (Nunnally 1978)

Q3 residual correlation
< 0.20

Achieved across all item pairs in the production substrate

Conformal coverage
89.4 % @ 90 %; 94.6 % @ 95 %

Both within ±2 pp of target — the conformal layer is producing honest uncertainty bands

Gate 10 measurement invariance
ΔCFI ≤ 0.01

Across age × sex × education subgroups in the synthetic corpus; Phase-3 will rerun against real users

Multivariate Σ-fidelity
Frobenius distance ≈ 0.13

On natural-scale empirical R vs latent-Pearson Σ; tolerance is 0.20

All four KPIs green; Σ-fidelity comfortably inside tolerance. The targeted test-retest sub-corpus runs as a Phase-2 deliverable; Phase-1 produced the methodology + harness for it but did not yet run the paired-session corpus.

What we promised in the methodology page, observed in the wild: four KPIs green, Σ-fidelity inside tolerance, hierarchical GRM converged on real data at R-hat = 1.002.

The strongest single receipt: Probe 5

Probe 5 was the final shakedown probe before Phase-2 readiness was declared. It ran the full pipeline — synthetic persona corpus → Vertex AI Batch Prediction → BigQuery MERGE → Vertex Custom Training → NumPyro NUTS GRM → BQ MERGE — on ten personas × six openness-family constructs scored from real PSL evaluator output. Not synthetic responses; real PSL output that the system had never seen during training.

The diagnostics:

DiagnosticValueVerdict
R-hat (max across all parameters)1.002Convergence — well inside the R-hat < 1.01 threshold
n_eff_min1,318Healthy effective sample size; no parameter is starved
Divergent transitions0Non-centered parameterisation is doing its job
Wallclock9.0 sOn a single CPU node; full Phase-2 fit targets ~30 minutes per family per iteration
calibration_metrics rows MERGE'd6One per construct
invariance rows MERGE'd3Sex × age × education axes

What this means in plain English: the Bayesian sampler converged cleanly on real data. The trait parameters it produced are well-conditioned, the uncertainty bands are honest, and the chain is ready to run at production scale. Probe 5 is the small-scale receipt that the pipeline runs end-to-end on data the model had not seen before.

The full posterior diagnostics (HDI95 credible intervals on every parameter), the per-construct fits, and the JSON-roundtrippable grm.json artifact are in the gated R2 sampler validation report.


The shakedown narrative, compressed

Eighteen workflow smokes plus five trainer probes is what the journey looked like. The bug taxonomy covered eight categories:

CategoryCountCost (est.)
Local toolchain1$0
GCP IAM bindings4$0
Container build / requirements5~$0
Vertex AI quirks4~$50
Cloud Workflows YAML4~$3
BigQuery schemas5~$70
Trainer logic3~$10
LLM output quality1~$25

Every fix landed either in the production code path or in the eight-check local QA harness that gates Phase-2 readiness. Total Phase-1 spend: ~$140 of the $500 Start Tier budget (~28 %); ~$360 buffer remaining.

The most permanent artifact of the shakedown is the local QA harness (scripts/local_test.sh + six Python tests). At 60 seconds wallclock and zero dollars per run, it catches every bug class hit during shakedown — container transitive imports, BQ column-shape drift, dataclass attribute drift, workflow YAML expression syntax, silent except-pass patterns, workflow YAML deploy validation. The counterfactual cost — what the project would have paid in Phase-2 retries without the harness — is $50–$150 per iteration, between 9 % and 30 % of the entire Phase-2 base-case budget.


What the validation does not claim

To set expectations honestly:

  • Probe 5 is small scale. Ten personas × six openness constructs is a smoke-scale fit, not the full Phase-2 two-thousand-persona × 113-construct fit. Convergence at small N is necessary but not sufficient for the production scale.
  • Phase-1 calibration is on synthetic data. The trait parameters Probe 5 produced are the small-scale receipt that the pipeline runs end-to-end. Phase-3 will recalibrate against real-user data and produce the population-level numbers we will ship to research-buyer customers.
  • Two of the four R5 robustness drills have not yet been formally executed. Auto-pause and mid-shard crash drills are gated on a roles/billing.user grant and a chaos-image build, respectively. The other two (429 backoff, idempotency stress) are partially evidenced through shakedown side-effects. The robustness report carries the honest execution status.
  • Phase-3 invariance recalibration is the more rigorous test. Phase-1 ran Gate 10 on the synthetic corpus; Phase-3 will run it on real-user data. The two answer different questions, and the Phase-3 answer is the load-bearing one.

Forward to Phase-2

The receipts on this page are sufficient for the Phase-2 milestone-review approval. What remains:

  • Grant roles/billing.user to an operator account → configure Cloud Billing budget alerts → run the auto-pause drill.
  • Build the CHAOS_FAULT-flagged worker image and run the mid-shard crash drill.
  • Resolve the three BigQuery write-path divergences flagged in R7 (legacy tabledata.insertAll calls in pre-shakedown scaffolding).
  • Cut iteration 10 — the first Phase-2 production iteration. Full corpus (2,000 personas × 50 sessions × 12 turns), all ten calibration families parallel, gate verdicts driven from green/red per compute_iteration_summary.

The pipeline is shaken down. The receipts are filed. Phase-2 is unblocked on the milestone-review approval and the four hardening items above.

The next page, roadmap, is the honest dates and scope for Phase-2 and Phase-3.