Synthetic personas + the scenarios engine
Phase-1 calibration runs on a corpus of two thousand synthetic personas — sampled from a statistical model of how traits cluster in the human population — having short reflective conversations across seventy-five scenarios drawn from a stratified library. This page explains why synthetic personas are scientifically defensible at this stage, how we model the trait dependence structure, and the demographic-blind scoring policy that is the strongest single pitch we make to research-buyer customers.
Why synthetic personas
Two reasons: ethics and scale.
Ethics. The conversational evidence we calibrate against is reflective and emotional. Building a Phase-1 calibration corpus from real users would mean asking thousands of people to share intimate reflections specifically so we can fit a statistical model — a use of their data they did not sign up for, and one we are not yet willing to ask for. Synthetic personas let us validate the inferential machinery before any real user data is involved. Phase-3 will recalibrate against real-user data with an explicit consent flow; Phase-1 ships on synthetic only.
Scale. A defensible IRT calibration needs enough personas across enough constructs to fit stable item parameters. The published thresholds are roughly N = 1,500–2,000 for a hierarchical graded-response model with our number of items. Synthetic personas are cheap to produce in this volume; real-user corpora are not — both because of cost (an IRB-approved real-user study at this scale takes a year or more) and because the Phase-1 ask is to demonstrate that the machinery works, before we spend the budget on the human study.
The synthetic-persona corpus is the warm-up, not the destination. The destination is real-user calibration with measurement-invariance testing across demographic subgroups; Phase-3 will run that. Phase-1 is the receipt that the inferential pipeline produces well-calibrated parameters on data the model has never seen.
We model how traits cluster realistically — not as independent dice rolls
The single largest decision in the persona forge is how the 29 trait dimensions correlate with each other.
A naïve approach would draw each trait independently — sample Conscientiousness from one normal distribution, Agreeableness from another, Extraversion from a third, all independent. The problem with the naïve approach is that real human personality does not work that way: people who score high on Conscientiousness also tend to score moderately high on Agreeableness; people who score high on Honesty-Humility tend to score low on the Dark Triad; people with anxious attachment styles tend to score lower on Emotional Stability. These patterns are not noise — they are the structure that any honest model of human personality has to capture.
We model this dependence structure with a Gaussian copula. A Gaussian copula is a way of saying: draw the traits from a multivariate normal distribution that has whatever correlation structure you want, then map each dimension's distribution into its trait-specific shape (skewed for Honesty-Humility, slightly skewed-positive for Emotional Stability, roughly normal for Conscientiousness). The dependence (which traits cluster together) and the per-trait shape (the marginal distribution) are decoupled — we get to specify each independently.
The Gaussian copula is the right starting point for the bulk of the trait dependence. It is not the right tool for the joint extremes of the Dark Triad — Machiavellianism, Narcissism, Psychopathy — which co-occur at higher rates than a pure Gaussian copula would predict. For the Dark Triad triplet specifically, we use a block-t copula at degrees-of-freedom ν = 5, which produces the heavier joint upper tails we observe in real Dark Triad data. The result: a dependence model that captures both the bulk-of-the-distribution clustering and the tail-clustering that the Dark Triad exhibits.
The four-tier Σ hierarchy, in one paragraph
The Gaussian copula needs a 29 × 29 correlation matrix Σ — the dependence structure we want to recover. Where does Σ come from? Where empirical data exists, we use it; where peer-reviewed meta-analyses exist, we use those; where neither exists, we build statistical bridges; where even the bridges are missing, we use explicit conservative uncertainty. The four tiers, in order of priority: Tier 0 is empirical correlations from five open multi-instrument datasets (Anglim 2017 + 2020 + 2024, the Open-Source Psychometrics SD3 dataset, and Revelle's SAPA Personality Inventory; 88 cells of Σ are filled this way); Tier A is direct meta-analytic estimates from the published literature (29 cells; Schmitt 2008, Lee & Ashton 2020, Roberts 2006, etc.); Tier B is Wright (1934) single-bridge path-implied estimates with a 0.7 attenuation hedge (120 cells; what we get when we have A↔B and B↔C but not A↔C directly); Tier C is a weakly-informative shrinkage prior, N(0, 0.10) truncated to (−0.30, 0.30) (169 cells; what we get when we have nothing else and the honest answer is "we believe the correlation is small"). The assembled 29 × 29 matrix is then projected to nearest positive-definite via Higham 2002, which guarantees a mathematically valid copula at relative Frobenius distance ≈ 0.05 from the source-of-truth correlations. Every cell of Σ carries its tier and source in a provenance manifest; reviewers can trace any correlation back to the dataset or meta-analysis it came from.
Demographic conditioning: at persona generation, never at scoring
This is the section that matters most for B2B research-buyer customers and for legal compliance.
At persona generation, we condition on age and sex on the latent layer before transforming the multivariate normal draw into trait values. Why: real human personality varies systematically with age and sex. Women score slightly higher on Agreeableness on average; older adults score higher on Conscientiousness; the Dark Triad shows different patterns by sex. A persona corpus that ignored these patterns would not be a realistic distribution; the synthetic-of-synthetic validation would catch this immediately.
At scoring time, demographic information is never an input feature to the downstream scoring engine. The PSL → IRT → conformal → BBN chain that produces a trait estimate from a conversation does not see the speaker's age or sex. Calibration is global, with measurement-invariance testing as the discipline that proves the global model is fair across demographic boundaries.
This is the demographic-blind scoring policy. It is the standard psychometric playbook (Cheung & Rensvold 2002 for the invariance methodology; Embretson & Reise 2000 chapter 12 for the recalibration discipline) and it is the right answer for both scientific and commercial reasons:
- Scientifically: per-subgroup parameter sets would mean different "true" trait scores for different demographics, which is exactly what measurement-invariance testing exists to prevent.
- Commercially: the "same trait inferences regardless of who your user base is" guarantee is the strongest single pitch we make to enterprise customers. It is also what keeps the system off the high-risk-profiling list of the EU AI Act.
The Phase-3 recalibration against real-user data is what proves this discipline. We will run measurement-invariance testing across age × sex × education and report the results. The policy is asymptotic: we condition at generation to make the corpus realistic, we measure invariance at scoring to prove the fits are fair, and we recalibrate against real users to refine the parameters.
The same logic applies to attachment-cluster modelling: we use the published Mickelson, Shaver & Kessler (1997) base rates as the target distribution for the four-style attachment taxonomy at persona generation, but the scoring engine does not predict attachment style — it estimates the two continuous attachment dimensions (Anxiety, Avoidance) and lets the user (or the application layer) decide whether to bin them into clusters.
The seventy-five-template scenario library
A persona is half the corpus. The other half is a scenario — the conversational situation that exercises the persona's traits.
Our scenario library is approximately seventy-five stratified templates. Each scenario is structured: it sets a conversational context (a workplace conflict, a moral dilemma, a self-reflective check-in, a social-coordination problem) and elicits roughly twelve turns of dialogue from the persona. The stratification matters: every persona sees a deliberate slice of life situations rather than a random sample.
Why stratification? Because different scenarios exercise different constructs. A scenario about "deciding whether to break a promise to help someone in need" exercises moral foundations (Care vs Loyalty), Honesty-Humility, and emotional intelligence; a scenario about "planning a complex project with shifting priorities" exercises Conscientiousness, cognitive flexibility, and resilience. If we used a random sample of scenarios, some constructs would receive thin evidence and others would be over-represented. The stratification ensures the calibration corpus has approximately balanced coverage across constructs.
The scenario library is locked. Phase-1, Phase-2, and Phase-3 all use the same stratified set. Locking the scenarios means the calibration parameters we fit are about the measurement model, not about scenario drift. If we changed scenarios mid-flight, we could not tell whether a calibration shift was because the model is learning or because the inputs changed.
Citation roster
This page draws on the methodology that earned its place in published psychometrics over the past sixty-plus years:
- Sklar (1959), Joe (2014) — copula theory generally
- Demarta & McNeil (2005) — t-copula tail dependence
- Embrechts, McNeil & Straumann (2002) — why correlation alone is insufficient outside elliptical distributions
- Higham (2002) — nearest-PD projection
- Schmitt et al. (2008), Lee & Ashton (2020), Friesdorf, Conway & Gawronski (2015), Roberts, Walton & Viechtbauer (2006) — sex / age conditioning meta-analyses
- Mickelson, Shaver & Kessler (1997) — adult attachment base rates
- Anglim et al. (2017, 2020, 2024), Condon & Revelle (2017) — open empirical datasets backboning Tier 0 of Σ
- Cheung & Rensvold (2002), Embretson & Reise (2000) — measurement invariance + IRT discipline
The complete bibliography with primary-source URLs and one-sentence "what this paper gave us" annotations sits at /about/science/citations.
What this design does not claim
To set expectations honestly:
- Synthetic personas are not real users. Phase-1 calibration parameters will need recalibration against real-user data in Phase-3. We commit to that explicitly.
- The dependence model is the best one we can build from the published evidence. It is not a perfect representation of reality; it is a defensible representation of what the published evidence supports.
- The block-t copula on the Dark Triad uses ν = 5. This is the published value (Demarta & McNeil 2005); we ship Phase-1 at ν = 5 and hold a sensitivity check at ν = 8 in reserve for Phase-2 if the data suggests it.
- Age conditioning uses a uniform 18–70 prior at Phase-1. Real-user demographics will replace this in Phase-3.
- Cultural conditioning is deferred. Cross-cultural personality variation is real (Schmitt 2008 is itself a fifty-five-culture study), but adding cultural axes adds another five-to-ten dimensions to the conditioning structure. We defer this to Phase-3 once the real-user cohort tells us which subgroups actually matter for our customer base.
The next page, the calibration loop, explains how we turn this corpus into calibrated trait parameters using the same item-response theory that scores the SAT and the LSAT.