The calibration loop

Once we have a corpus of personas with conversational evidence, we need to turn that evidence into calibrated trait parameters with uncertainty bounds you can trust. Three statistical disciplines do this work, in sequence: an item-response theory model that estimates trait scores from rule firings; a conformal recalibration that produces uncertainty bands with guaranteed coverage; and a Bayesian belief network that handles cross-trait inference. Probe 5 — the final shakedown probe — converged this entire chain on real PSL data at R-hat = 1.002, n_eff_min = 1,318, zero divergences in 9.0 seconds.

This page walks each discipline in plain English, then explains how they fit together.


Item-response theory, in plain English

The foundational tool is item-response theory (IRT) — specifically the graded-response model (GRM). The GRM is the same family of statistical tools used to score the SAT and the LSAT.

The basic idea: every "item" — every PSL rule that can fire — has two properties we want to learn:

  1. How informative the rule is for the trait it measures. A rule that fires only when the speaker is genuinely high-conscientiousness (and rarely otherwise) is more informative than a rule that fires across a wide range of conscientiousness levels. Statisticians call this the rule's discrimination.
  2. Where on the trait scale the rule's threshold sits. A rule that fires only at very high conscientiousness levels has a different threshold than a rule that fires at moderate conscientiousness levels. We need to know each rule's threshold to interpret what its firing tells us.

The GRM fits both — discrimination and thresholds — to every rule simultaneously, given a corpus of rule firings across a population of personas with known trait values. The math is well-established and well-validated; the contribution we add is wiring it into the four-layer hierarchy described on the data layers page.

The output of the GRM fit, for each trait, is:

  • A set of per-rule discrimination + threshold parameters that anyone can verify.
  • A trait estimate function: given a new conversation's rule firings, compute the trait score with explicit uncertainty.

The trait estimate is not a single point. It is a posterior distribution — we report its centre (the most likely trait score) and its width (how confident we are in the centre). The width matters: it tells you whether the conversational evidence is thin or rich for the particular trait the conversation exercised.

The trait estimate is not a single number. It is a calibrated distribution — a centre and a width, where the width tells you whether the conversational evidence was thin or rich.

Why hierarchical Bayesian

We fit the GRM with a hierarchical Bayesian approach using NumPyro and NUTS (the No-U-Turn Sampler). "Hierarchical" means the model shares strength across related constructs the way meta-analyses do: if the rules for "empathic concern" all behave similarly to each other, the model lets that similarity inform each individual rule's parameters, instead of fitting each in isolation. This is the modern equivalent of the random-effects approach common in meta-analysis.

"Bayesian" means the output is a full posterior distribution (not just a point estimate), with explicit uncertainty on every parameter. The uncertainty is the load-bearing piece: it is what flows up the chain to produce the calibrated trait estimate the user sees.


Conformal coverage, in plain English

The IRT model gives us a posterior distribution. The next step is making sure the uncertainty band on the trait estimate is honest — that when we say "we are 90 % confident the trait score is between 47 and 56," the truth really is in that band 90 % of the time.

The discipline that provides this guarantee is conformal prediction. Conformal prediction is a method that wraps any underlying point estimator (like our IRT model) and produces calibrated prediction intervals — intervals that are guaranteed to cover the truth at the rate you promise, with no asymptotic hand-waving. The guarantee holds with finite samples, no distributional assumptions, no convergence-to-infinity arguments.

The way it works is simple to describe: take a held-out validation set, compute the conformal residuals (how wrong your point estimator was on each held-out persona), find the quantile of the residuals corresponding to your desired coverage rate, and use that quantile to build the prediction band. Because the validation set was held out, the residuals are an honest estimate of the model's true error distribution; because they are quantile-based, the resulting band has the coverage rate you asked for.

We run conformal recalibration as a layer on top of the IRT model. The IRT posterior gives us a centre; the conformal layer gives us an uncertainty band that is guaranteed to cover the truth at the rate we promise.

Conformal prediction guarantees the uncertainty band covers the truth at the rate we promise — no asymptotic hand-waving, no distributional assumptions, finite-sample guarantees.

The Phase-1 validation produced conformal coverage rates within tolerance on every trait we report. The exact rates are in the gated R2 sampler validation report; the headlines are: 90 % conformal coverage achieved at 89.4 % observed rate (within tolerance); 95 % conformal coverage achieved at 94.6 % observed rate.


The Bayesian belief network for cross-trait inference

The IRT + conformal pair handles each trait independently. The third discipline is a Bayesian belief network (BBN) that handles cross-trait inference: knowing about a person's Conscientiousness does inform our estimate of their Honesty-Humility (the two are correlated in the population), and the BBN is the layer that uses that correlation honestly.

The BBN is fit from the same persona corpus as the IRT model. It learns conditional probability tables (CPTs) over the joint trait distribution — what Conscientiousness scores look like conditional on Honesty-Humility being above a threshold, etc. When a new conversation produces a strong signal on Conscientiousness and a weak signal on Honesty-Humility, the BBN can use the strong signal to refine the weak signal's posterior, without over-claiming.

This sounds like a place where things could go wrong (and many production ML systems do go wrong here). Two disciplines keep us honest:

  • The CPTs are Dirichlet-smoothed with a non-trivial prior, so they cannot collapse to "if A then certainly B" on a small sample. The prior expresses "we believe most cross-trait correlations are modest" and the data overrides the prior only when there is enough evidence to do so.
  • The cross-trait inference is ablation-tested: we run the validation battery with and without the BBN and confirm that the BBN improves predictions on held-out data. If the BBN ever degraded held-out predictions, we would not ship it.

How they fit together

The full chain, in order, on a single new conversation:

  1. The four-layer hierarchy (data layers) extracts signals, fires rules, aggregates constructs.
  2. The IRT model produces per-trait posterior distributions from the construct firings.
  3. The conformal layer wraps the IRT posteriors with calibrated uncertainty bands.
  4. The BBN refines weak-evidence traits using cross-trait dependence learned from the calibration corpus.

The output is a per-trait estimate (centre + uncertainty band) that has been through three independent statistical safeguards. If any of them flags low confidence, the whole estimate is reported as low-confidence; we err on the side of wider bands rather than narrower ones.


Validation receipts

The Phase-1 calibration shakedown converged this chain on real PSL-derived data. The strongest single receipt is Probe 5 — the final shakedown probe before Phase-2 readiness:

R-hat (max)
1.002

The Bayesian convergence threshold is R-hat < 1.01; we are well inside

n_eff_min
1,318

Effective sample size on the worst-converged parameter — comfortably healthy

Divergent transitions
0

No funnel-trap warnings; the non-centered parameterisation is doing its job

Wallclock
9.0 s

On a standard CPU node; full Phase-2 fit targets ~30 minutes per family per iteration

What this means in plain English: the Bayesian sampler converged cleanly on real data. The trait parameters it produced are well-conditioned, the uncertainty bands are honest, and the chain is ready to run at production scale.

The exact convergence diagnostics, the per-construct fits, and the validation battery are all in the gated R2 sampler validation report. The headline — Probe 5 converged R-hat = 1.002 — is the one we publish at this tier.


Four KPIs we report

Calibration is a continuous discipline, not a one-time event. Four KPIs run on every iteration and are the gates that decide whether a calibration update ships:

KPIWhat it measuresThreshold
Spearman-Brown reliabilityInternal consistency of trait estimates across conversations≥ 0.70
Q3 residual correlationWhether the IRT model is capturing all the systematic variance in rule firings< 0.20
Conformal coverageWhether the uncertainty bands cover the truth at the rate we promisewithin ±2 pp of target
Measurement invariance (Gate 10)Whether the model fits equivalently across age × sex × education subgroupsΔCFI ≤ 0.01

All four ran during Phase-1 shakedown. All four passed at the design-target tolerances. The exact thresholds and the per-construct verdicts are in the gated R2 sampler validation report; the headlines are what this page commits to.


What the calibration loop does not try to do

To set expectations honestly:

  • Calibration does not personalise. The output is per-trait population-level parameters that anyone can use to score any new conversation. There is no per-user model.
  • Calibration is global, not per-subgroup. Demographic-blind scoring is the policy (personas page); per-subgroup parameter sets would mean different "true" trait scores for different demographics.
  • Calibration runs offline. When you have a conversation, the system uses the already-calibrated parameters; the calibration itself runs as a periodic batch job against the synthetic corpus. New calibrations replace old ones only if they pass all four KPIs.
  • Calibration does not learn from individual users. Phase-1 ships on synthetic data; Phase-3 will refresh against real-user data with explicit consent. We do not run online learning on individual user trait scores.

The next page, methodology, walks through the validity infrastructure that wraps this calibration loop — measurement invariance, the four primary KPIs in deeper detail, the multivariate Σ-fidelity check, and the targeted test-retest sub-corpus.