Methodology

The calibration loop produces calibrated trait parameters. The methodology wrapped around it is what proves those parameters are doing what they claim. This page walks the five disciplines that anchor what we will and will not ship: measurement invariance, the four primary KPIs, the multivariate Σ-fidelity check, the targeted test-retest sub-corpus, and the holdout discipline. Frame: the same statistical playbook the best-respected academic teams use, applied to conversational evidence rather than self-report scales.


Measurement invariance, in plain English

Measurement invariance is the discipline that asks: does our model fit equivalently across demographic subgroups? If the same trait is being measured the same way for women and men, for younger and older adults, for different education levels — then the model is invariant. If the trait is being measured in a way that systematically advantages or disadvantages a subgroup, the model is not invariant, and the trait estimates are not comparable across subgroups.

This matters because the alternative — fitting a separate calibration per subgroup — would mean different "true" trait scores for different demographics. That is the kind of thing that goes wrong in many production ML systems, and it is what measurement-invariance testing exists to prevent.

The methodology we use is from Cheung & Rensvold (2002): test the model's fit indices (specifically ΔCFI, the change in comparative fit index) when constraints are added across subgroups. If ΔCFI ≤ 0.01 across each constraint level — configural, metric, scalar invariance — the model is judged invariant at that level.

We run this check as Gate 10 in the calibration pipeline. Gate 10 is one of three gates that have to be green for an iteration to be considered converged; the other two cover holdout-set predictive accuracy and Q3 residual correlations.

The post-shakedown discipline that closed the configuration-leak class of bugs — the cohort-keyed calibration profile registry (R1 §8.1) — is the operational counterpart to measurement-invariance testing: just as Gate 10 prevents subgroup drift in the parameters, the profile registry prevents subgroup drift in the cohort configurations that produce the parameters.

Measurement invariance is the discipline that prevents per-subgroup parameter sets. Demographic-blind scoring is the policy; Gate 10 is the proof.

The four primary KPIs

Calibration is a continuous discipline. Four KPIs run on every iteration; all four must pass for an iteration to ship.

KPI 1: Spearman-Brown reliability ≥ 0.70

The standard psychometric threshold for "this measurement is reliable enough to report." (Nunnally 1978.) For each user with at least eight conversations, we compute split-half reliability — split the conversations into two halves, compute trait estimates from each half independently, measure how strongly the two halves agree, then apply the Spearman-Brown correction to project to the full conversation set. We require ≥ 0.70 on every trait we report; below 0.70, the trait estimate is marked provisional and not surfaced as an authoritative number.

KPI 2: Q3 residual correlation < 0.20

The Q3 statistic is the residual correlation between item responses after the IRT model has accounted for the trait. If Q3 is large for some pair of items, those items share variance the IRT model is not capturing — typically because they tap the same construct plus something else (a method effect, a wording overlap, a specific behaviour the model is not yet measuring separately). We require |Q3| < 0.20 for every pair of items in the production substrate; pairs above the threshold flag the item for refinement.

KPI 3: Conformal coverage within ±2 percentage points of target

The conformal layer guarantees the uncertainty band covers the truth at a target rate (90 % or 95 % depending on the use case). We measure observed coverage on a held-out validation set; if observed coverage drifts more than 2 percentage points from target, the calibration is rejected and the iteration does not ship. The Phase-1 shakedown produced 89.4 % observed coverage at 90 % target and 94.6 % at 95 % target — both within tolerance.

KPI 4: Gate 10 measurement invariance — ΔCFI ≤ 0.01

The Cheung & Rensvold threshold above. Tested at configural, metric, and scalar invariance levels across age × sex × education subgroups. The Phase-1 shakedown ran this on the synthetic corpus and produced three invariance rows in BigQuery; Phase-3 will run it against real-user data, which is the more rigorous test and the source of any future recalibration.


Multivariate Σ-fidelity check

The four KPIs above are univariate — they check each trait dimension independently. There is also a multivariate check: does the joint correlation structure of the calibration corpus match the target Σ that we built into the persona forge?

This check is called Σ-fidelity. We compare the empirical correlation matrix of the persona corpus's actual trait values against the target Σ from the forge. The relevant statistic is Frobenius distance: how much does the empirical matrix differ from the target, summed across all entries.

We require relative Frobenius distance ≤ 0.20 on the natural-scale empirical correlation. The Phase-1 result: ≈ 0.13. Comfortably inside tolerance.

The Σ-fidelity check is what catches drift between intent and reality at the corpus level. If the persona forge produced a corpus that did not match the target Σ — because the conditioning logic was wrong, or because the forge sampling missed a tail — the empirical correlation would diverge and Σ-fidelity would catch it.


Targeted test-retest sub-corpus

A separate validity check runs on a small targeted sub-corpus: 100 personas × 2 paired sessions, where the two sessions are drawn from a deliberately stratified slice of the scenario library that exercises the same constructs in different conversational frames.

The check: for the same persona across the two paired sessions, do the trait estimates correlate as well as we would expect from the underlying trait stability? This is a test-retest reliability check; it is the published gold standard for trait stability over short timescales. We require ≥ 0.65 test-retest correlation on every trait we report.

The test-retest sub-corpus is small enough to run inexpensively (~$25 of LLM cost per Phase-2 iteration), but informative enough to catch drift between scenarios that the bulk corpus cannot easily isolate.


Holdout discipline

The most important methodological discipline is what we don't train on. Two layers of holdout protect the calibration:

Cross-validation holdout. During each iteration's IRT fit, we hold out 20 % of the persona corpus from the parameter fit and use it to compute Gate-8 predictive accuracy. If a calibration update degrades held-out accuracy, the iteration is rejected and the previous calibration stays in production. This is the gate that prevents shipping regressions.

Cross-iteration holdout. Some personas appear in multiple iterations' training corpora; others are reserved across iterations. The cross-iteration holdout lets us measure whether calibration updates are improving on stable benchmarks rather than overfitting to whichever personas happen to be in the current iteration.

Both holdouts run automatically. Both produce numbers visible in iteration_summary in BigQuery. Neither is a manual operator step that can be skipped under deadline pressure.

The most important methodological discipline is what we don't train on. Two layers of holdout protect every calibration iteration from shipping a regression.

Where Phase-3 fits

Phase-1 ships a synthetic-corpus calibration as the first stable artifact. Phase-3 will recalibrate against real-user data with explicit measurement-invariance testing across the most consequential demographic boundaries.

A short honesty note here: demographic invariance is recalibrated against real-user data, not asserted from synthetic data. The synthetic corpus uses meta-analytic sex and age effects to produce a realistic distribution at the persona level; that distribution is the input to the calibration, not the validation. The Phase-3 recalibration is what proves the calibration is fair across demographics.

This is the standard psychometric playbook (Embretson & Reise 2000 chapter 12). The synthetic-data step is the warm-up; the real-user step is where measurement invariance graduates from a synthetic-data check to a population-level claim. We commit to that explicitly.


A KPI dashboard view

The four primary KPIs run on every iteration and produce a row in iteration_summary. The view:

The four primary KPIs as they appear after each iteration. All four green is the gate for shipping; if any is red, the iteration is rejected and the previous calibration stays in production.

What the methodology does not claim

To set expectations honestly:

  • The four KPIs are necessary, not sufficient. Passing all four is the floor, not the ceiling. Phase-3 will add real-user invariance as a more rigorous fifth gate.
  • The KPIs run on the synthetic corpus at Phase-1. Real-user numbers will replace these in Phase-3 once the recalibration runs.
  • The Σ-fidelity check is at the corpus level, not the per-conversation level. A single conversation is too thin to support a multivariate dependence claim; the corpus is the unit of analysis.
  • Test-retest is over short timescales. Roberts, Walton & Viechtbauer (2006) document that personality traits show systematic mean-level change across the life course; we are not measuring decades-long stability with a paired-session check. Long-timescale stability is a Phase-3 concern.

The next page, validation, is the receipts of the 2026-05 Phase-1 validation run — what the methodology above produced when we ran it end-to-end on real PSL-derived data.