Roadmap

Three phases, with honest dates and honest scope. Phase 1 — the foundational validation — is complete, and the readings SoulMap produces today ship on the pipeline it proved out. Phase 2 is the large-scale calibration campaign; Phase 3 is the first real-user recalibration. Each phase raises the evidential authority of the readings; none of them is a precondition for the readings being useful now. This page also names what is explicitly not on the roadmap, because what we are not building matters as much as what we are.

Phase 1 — done (foundational validation)

Status: complete, end-state recorded at docs/shakedown_ledger.md (calibration pipeline) and the Foundation v1 receipts (platform refactor).

What Phase 1 produced:

The Σ-target correlation matrix on 29 trait dimensions across 10 instruments, four-tier hierarchy with full per-cell provenance (88 empirical + 29 meta-analytic + 120 path-implied + 169 shrinkage-prior cells; relative Frobenius distance ≈ 0.05 from source; Higham 2002 nearest-PD projection).
The persona library v2 — 2,000 personas drawn from the Gaussian-copula architecture with block-t copula on the Dark-Triad triplet at ν = 5, additive sex + age conditioning on the latent layer, skew-normal margins on Honesty-Humility and Dark Triad.
The scenario library v1 — ~75 stratified templates locked across all phases.
The Vertex AI batch-prediction pipeline shipped end-to-end on Google Cloud, EU-residency-preserving except for the /global/-only publisher model, with idempotent BigQuery ingest via load-job + MERGE.
The hierarchical Bayesian GRM trainer (NumPyro / JAX, NUTS, non-centered parameterisation), converged on real PSL-derived data at R-hat = 1.002, n_eff_min = 1,318, zero divergences (Probe 5).
The eight-check local QA harness that catches every bug class hit during the 18-smoke + 5-probe shakedown.
The cohort-keyed calibration profile registry that prevents Phase-1 smoke configurations from inheriting into Phase-2 production.
The Foundation v1 platform refactor (nexus_foundation_v1.md execution spec). Substantially landed: Workspace org neumatics.eu rooted; three workload projects live (neumatics-prod, neumatics-audit-logs, neumatics-network-host); CMEK keyring with HSM keys for high-sensitivity paths; AlloyDB regional cluster paused-by-default in prod (auto-idle by design); 24 per-domain Cloud Run services replacing the functions/main.py monolith with per-route cutover via BACKEND_ROUTING; Shared VPC + VPC-SC perimeter on prod; Datastream CDC AlloyDB → BigQuery; aggregated org-level audit sink to a sealed audit project; PAM mediator for per-grant elevation delays; audit alerter Cloud Run job. Receipts at /about/cloud-infrastructure/receipts; remaining ⏳ items (Firestore region cutover, Vertex Reasoning Engine redeploy in eu-west4, Knowledge Catalog tagging baseline) sequenced into the Phase-2 window.
This documentation surface — three audience tiers across /about/science, /about/cloud-infrastructure, and the gated /about/reports.

Phase 1 is the receipt that the inferential machinery works end-to-end on data it has never seen. The readings SoulMap ships today run on this validated pipeline.

Phase 2 — large-scale calibration, ready to execute

Status: design locked; the next stage of the validation programme. No further architectural work is needed.

What Phase 2 will produce:

One full corpus generation: 2,000 personas × 50 sessions × 12 turns = 1.2 M turns of conversational evidence; ~3 M LLM calls; 100,000 (persona, session) cells.
Ten calibration iterations: PSL re-score + family-parallel hierarchical GRM Bayesian fit + conformal recalibration + BBN CPT fit + Gate 8 / 9 / 10 verdicts.
Per-construct IRT parameters with reliability bounds + DIF (differential-item-functioning) flags.
Targeted test-retest sub-corpus: 100 personas × 2 paired sessions during Phase 2 for intra-persona consistency ICC.
Final reliability memo + production-deployable calibration model.

What Phase 2 unlocks: the jump from pipeline-proven to corpus-calibrated. Today's readings ship on the Phase-1-validated inferential machinery; Phase 2 replaces the small-scale fits with per-construct IRT parameters estimated on the full 1.2 M-turn corpus. That moves every trait estimate to the next level of evidential authority — tighter reliability bounds, DIF screening on every item, and a production calibration model with published receipts.

Phase-2 timing: roughly two weeks of wall-clock once the campaign starts. The bottleneck is the calibration trainer wallclock (~30 minutes per family per iteration) and the operator review between iterations.

Phase-2 hardening items — the ⏳ marked items from R5 — should land before iteration 10 cuts:

Drill the automated auto-pause pathway end-to-end.
Build a CHAOS_FAULT-flagged worker image → drill the mid-shard crash recovery.
Drill the idempotency stress test (six concurrent re-fires of the same (iteration, shard_index)).
Resolve the three BigQuery write-path divergences flagged in R7.

The 429 backoff drill is gated on Vertex AI quota actually saturating, which Phase-2's full corpus may incidentally exercise; if not, we will drill it deliberately.

Phase 3 — real-user invariance recalibration

Status: planning; gated on the live SoulMap user base growing to a defensible recalibration cohort.

What Phase 3 will produce:

A recalibration of the Phase-2 IRT parameters against real-user conversational evidence, with explicit consent flow.
Measurement-invariance testing across age × sex × education × geography, with subgroup-level diagnostic outputs.
A real-user cohort distribution to replace the Phase-2 uniform 18–70 age prior.
Updated trait estimates with real-population calibration anchors.
A second-generation calibration model that the consumer product ships against.

Phase-3 timing: dependent on SoulMap user growth; the IRB-equivalent consent flow and the cohort assembly are the long lead-time items. Earliest expected start: late 2026; realistic: 2027.

The Phase-3 design intentionally inherits from the Phase-2 design. We do not plan to redesign the IRT model, the conformal layer, or the BBN; we plan to refit them on real-user data and report what changed.

What's NOT on the roadmap

Boundaries:

No clinical assessment claims, ever. The 29 traits are normal-range psychological tendencies, not psychiatric diagnoses. The system is not a screening instrument; we do not plan to build one.
No cultural conditioning at Phase-2. Cross-cultural personality variation is real (Schmitt 2008 documents it across 55 cultures). Adding cultural axes adds 5–10 dimensions to the conditioning structure. We defer this to Phase 3+ once the real-user cohort tells us which subgroups actually matter for the SoulMap user base.
No per-user model. Calibration is global. Per-user models would mean different "true" trait scores per user, which is not what the system is for. The conformal layer plus the IRT machinery already produce per-user estimates with uncertainty without per-user parameter fitting.
No personalisation pipelines. SoulMap is a measurement instrument, not a recommender system. The system produces calibrated trait estimates; it does not predict "what the user wants next."
No real-time online learning. Calibration runs offline as a periodic batch job against the synthetic corpus (Phase-2) and against the real-user cohort (Phase-3). New calibrations replace old ones only if they pass all four KPIs.
No dimensional expansion beyond 29 in the foreseeable future. Adding traits costs both calibration data and inferential complexity. The 29 dimensions cover the well-established personality structure; expanding would require a defensible reason and the data to support it.
No demographic features as inputs to the scoring engine. This is the demographic-blind scoring policy (personas page). Hard line.

What would falsify the plan

A note we owe ourselves and reviewers: not every Phase plan survives contact with reality. The conditions under which we would change the plan:

If Phase-2 calibration produces R-hat > 1.05 or n_eff_min < 200 on a converged-by-design model, we abort the iteration and investigate the trainer rather than ship the parameters.
If Phase-2 conformal coverage drifts more than 5 percentage points from target on more than two traits, we hold the calibration at the prior version and investigate the conformal layer rather than ship.
If the targeted test-retest sub-corpus produces ICC < 0.50 on more than two traits, we revisit the scenario library.
If Phase-3 real-user invariance testing produces ΔCFI > 0.02 on age × sex × education on more than four traits, we add invariance constraints to the IRT fit and hold the recalibration until the constraints land.
If a measurement-invariance violation surfaces that the methodology cannot fix without per-subgroup parameters, we will publish the violation rather than ship per-subgroup parameters.

These conditions are explicit because they are how we keep the plan honest. The most useful methodology-level discipline is naming the falsifiers.

The next page, citations, is the alphabetical bibliography that backs every footnote on every page in this tier.