Roadmap

Three phases, with honest dates and honest scope. Phase 1 is done; Phase 2 is the milestone-review ask; Phase 3 is the first real-user calibration. This page also names what is explicitly not on the roadmap, because what we are not building matters as much as what we are.


Phase 1 — done (milestone-ask deliverable)

Status: complete, end-state recorded at docs/shakedown_ledger.md (calibration pipeline) and the Foundation v1 receipts (platform refactor).

What Phase 1 produced:

  • The Σ-target correlation matrix on 29 trait dimensions across 10 instruments, four-tier hierarchy with full per-cell provenance (88 empirical + 29 meta-analytic + 120 path-implied + 169 shrinkage-prior cells; relative Frobenius distance ≈ 0.05 from source; Higham 2002 nearest-PD projection).
  • The persona library v2 — 2,000 personas drawn from the Gaussian-copula architecture with block-t copula on the Dark-Triad triplet at ν = 5, additive sex + age conditioning on the latent layer, skew-normal margins on Honesty-Humility and Dark Triad.
  • The scenario library v1 — ~75 stratified templates locked across all phases.
  • The Vertex AI batch-prediction pipeline shipped end-to-end on Google Cloud, EU-residency-preserving except for the /global/-only publisher model, with idempotent BigQuery ingest via load-job + MERGE.
  • The hierarchical Bayesian GRM trainer (NumPyro / JAX, NUTS, non-centered parameterisation), converged on real PSL-derived data at R-hat = 1.002, n_eff_min = 1,318, zero divergences (Probe 5).
  • The eight-check local QA harness that catches every bug class hit during the 18-smoke + 5-probe shakedown.
  • The cohort-keyed calibration profile registry that prevents Phase-1 smoke configurations from inheriting into Phase-2 production.
  • The Foundation v1 platform refactor (nexus_foundation_v1.md execution spec). Substantially landed: Workspace org neumatics.eu rooted; three workload projects live (neumatics-prod, neumatics-audit-logs, neumatics-network-host); CMEK keyring with HSM keys for high-sensitivity paths; AlloyDB regional cluster paused-by-default in prod (~€95–100/mo combined run-rate at pre-launch scale, ~€350/mo saved vs always-on); 24 per-domain Cloud Run services replacing the functions/main.py monolith with per-route cutover via BACKEND_ROUTING; Shared VPC + VPC-SC perimeter on prod; Datastream CDC AlloyDB → BigQuery; aggregated org-level audit sink to a sealed audit project; PAM mediator for per-grant elevation delays; audit alerter Cloud Run job. Receipts at /about/cloud-infrastructure/receipts; remaining ⏳ items (staging project, Firestore region cutover, Vertex Reasoning Engine redeploy in eu-west4, Knowledge Catalog tagging baseline) sequenced into Phase-2-window work.
  • This documentation surface — three audience tiers across /about/science, /about/cloud-infrastructure, and the gated /about/reports.

Phase-1 spend: ~$140 of the $500 Start Tier budget (~28 %) for the calibration pipeline, plus the Foundation v1 platform run-rate which sustains at ~€100–150/mo at pre-launch scale. Buffer remaining: ~$360 of the Start Tier credit window.


Phase 2 — milestone-review ask, ready to execute

Status: awaits milestone-review approval and a roles/billing.user grant. No further architectural work is needed.

What Phase 2 will produce:

  • One full corpus generation: 2,000 personas × 50 sessions × 12 turns = 1.2 M turns of conversational evidence; ~3 M LLM calls; 100,000 (persona, session) cells.
  • Ten calibration iterations: PSL re-score + family-parallel hierarchical GRM Bayesian fit + conformal recalibration + BBN CPT fit + Gate 8 / 9 / 10 verdicts.
  • Per-construct IRT parameters with reliability bounds + DIF (differential-item-functioning) flags.
  • Targeted test-retest sub-corpus: 100 personas × 2 paired sessions during Phase 2 for intra-persona consistency ICC.
  • Final reliability memo + production-deployable calibration model.

Phase-2 forecast (re-derived post-shakedown): $406–$580 per iteration; $4,060–$5,800 across the ten-iteration campaign. Pessimistic worst case: ~$8,000. All scenarios stay under the $25,000 AI Tier cap with $17,000+ of buffer. The full forecast and sensitivity tornado sit at the cost engineering page; the per-iteration cost lever attribution sits in the gated R4 FinOps report.

Phase-2 timing: roughly two weeks of wall-clock once approval lands. The bottleneck is the calibration trainer wallclock (~30 minutes per family per iteration) and the operator review between iterations.

Phase-2 hardening items — the ⏳ marked items from R5 — should land before iteration 10 cuts:

  1. Grant roles/billing.user → configure Cloud Billing budgets → drill the auto-pause pathway.
  2. Build a CHAOS_FAULT-flagged worker image → drill the mid-shard crash recovery.
  3. Drill the idempotency stress test (six concurrent re-fires of the same (iteration, shard_index)).
  4. Resolve the three BigQuery write-path divergences flagged in R7.

The 429 backoff drill is gated on Vertex AI quota actually saturating, which Phase-2's full corpus may incidentally exercise; if not, we will drill it deliberately.


Phase 3 — real-user invariance recalibration

Status: planning; gated on the consumer Echo product reaching enough users for a defensible recalibration cohort.

What Phase 3 will produce:

  • A recalibration of the Phase-2 IRT parameters against real-user conversational evidence, with explicit consent flow.
  • Measurement-invariance testing across age × sex × education × geography, with subgroup-level diagnostic outputs.
  • A real-user cohort distribution to replace the Phase-2 uniform 18–70 age prior.
  • Updated trait estimates with real-population calibration anchors.
  • A second-generation calibration model that the consumer product ships against.

Phase-3 timing: dependent on Echo product growth; the IRB-equivalent consent flow and the cohort assembly are the long lead-time items. Earliest expected start: late 2026; realistic: 2027.

The Phase-3 design intentionally inherits from the Phase-2 design. We do not plan to redesign the IRT model, the conformal layer, or the BBN; we plan to refit them on real-user data and report what changed.


What's NOT on the roadmap

Boundaries:

  • No clinical assessment claims, ever. The 29 traits are normal-range psychological tendencies, not psychiatric diagnoses. The system is not a screening instrument; we do not plan to build one.
  • No cultural conditioning at Phase-2. Cross-cultural personality variation is real (Schmitt 2008 documents it across 55 cultures). Adding cultural axes adds 5–10 dimensions to the conditioning structure. We defer this to Phase 3+ once the real-user cohort tells us which subgroups actually matter for our customer base.
  • No per-user model. Calibration is global. Per-user models would mean different "true" trait scores per user, which is not what the system is for. The conformal layer plus the IRT machinery already produce per-user estimates with uncertainty without per-user parameter fitting.
  • No personalisation pipelines. We are a measurement vendor, not a recommender system. The system produces calibrated trait estimates; it does not predict "what the user wants next."
  • No real-time online learning. Calibration runs offline as a periodic batch job against the synthetic corpus (Phase-2) and against the real-user cohort (Phase-3). New calibrations replace old ones only if they pass all four KPIs.
  • No dimensional expansion beyond 29 in the foreseeable future. Adding traits costs both calibration data and inferential complexity. The 29 dimensions cover the well-established personality structure; expanding would require a defensible reason and the data to support it.
  • No demographic features as inputs to the scoring engine. This is the demographic-blind scoring policy (personas page). Hard line.

What would falsify the plan

A note we owe ourselves and reviewers: not every Phase plan survives contact with reality. The conditions under which we would change the plan:

  • If Phase-2 calibration produces R-hat > 1.05 or n_eff_min < 200 on a converged-by-design model, we abort the iteration and investigate the trainer rather than ship the parameters.
  • If Phase-2 conformal coverage drifts more than 5 percentage points from target on more than two traits, we hold the calibration at the prior version and investigate the conformal layer rather than ship.
  • If the targeted test-retest sub-corpus produces ICC < 0.50 on more than two traits, we revisit the scenario library.
  • If Phase-3 real-user invariance testing produces ΔCFI > 0.02 on age × sex × education on more than four traits, we add invariance constraints to the IRT fit and hold the recalibration until the constraints land.
  • If a measurement-invariance violation surfaces that the methodology cannot fix without per-subgroup parameters, we will publish the violation rather than ship per-subgroup parameters.

These conditions are explicit because they are how we keep the plan honest. The most useful methodology-level discipline is naming the falsifiers.


The next page, citations, is the alphabetical bibliography that backs every footnote on every page in this tier.