Data ethics — ML training & EU compliance posture

The most common question we get from research buyers and AI labs is: "Can I license your data to train models, EU-compliantly?" The short answer is yes — through three convergent compliance paths. This page walks through them in plain language. The /privacy and /terms pages cover the user-facing legal mechanics; this one covers the architectural and regulatory reasoning underneath.


The non-negotiables

Before any of the paths below, four invariants hold across every Neumatics product:

  • 18+ only. Minors are excluded from trait-scoring distributions entirely.
  • No clinical diagnosis surface. Outputs are research and personalisation grade; they are not medical advice and we do not market them as such.
  • No hiring or educational-assessment licensing. These uses are EU AI Act high-risk and are explicitly gated out of the consumer marketplace.
  • No-train guarantee on the API. Buyer API request data never trains shared models. This is a contractual commitment on top of the architectural separation.

Path 1 — Granular, separate, withdrawable consent

GDPR Articles 6(1)(a) and 9(2)(a) recognise explicit consent as a lawful basis, including for any special-category inferences. Neumatics exposes a dedicated, separable consent toggle in user settings — distinct from product-operation consent:

"License my anonymized profile data for AI / ML training research."

The toggle is opt-in by default, withdrawable at any time, and revoking it cascades:

  1. Future dataset exports exclude that user immediately.
  2. Already-licensed datasets are tagged with the user's consent vintage; license terms with buyers require re-checking before any model retrain.
  3. The user's raw data is removed from the BigQuery warehouse on the next DSAR cycle.

This is the most legally defensible basis available. It is also operationally the weakest of the three — consent can be withdrawn at any time, which is why we combine it with the next two.


Path 2 — Anonymization or synthesis (the GDPR Recital 26 path)

GDPR Recital 26 takes truly anonymous data out of GDPR scope entirely. The bar is "no reasonable means of re-identification." The EDPB's Opinion 28/2024 on AI-model anonymity extends this to the model itself: a model is anonymous if it is very unlikely to directly or indirectly identify any training subject, including through queries against the model.

The Neumatics pipeline is built to meet this bar:

  • Salted SHA-256 hashing of user identifiers before any export to the BigQuery analytics warehouse.
  • Aggregation to construct-level distributions for marketplace samples — buyers see distributions across cohorts, not raw individual responses.
  • k-anonymity and differential-privacy controls on demographic facets before licensing. Cells below a minimum group size are suppressed.
  • Synthetic-data generation for use cases where downstream re-identification risk would otherwise be non-trivial.

A dataset that has passed the above pipeline sits outside GDPR scope by Recital 26. We do not rely on this alone — every dataset license also requires Path 1 consent vintages — but it is the second layer of defence.


Path 3 — EU AI Act Article 10 training-data governance

The EU AI Act becomes enforceable for high-risk AI systems on 2 August 2026. Article 10 — data governance — requires training-data documentation: sources, preprocessing steps, anonymization methodology, bias and representativeness assessment, and known limitations.

Neumatics already has the architectural lineage that makes this paperwork tractable:

  • Consent registry — every Echo carries an explicit consent vector at write time.
  • Audit log — every dataset export references the underlying consent vintages, anonymization parameters, and aggregation thresholds.
  • Deletion cascades — DSAR requests propagate across Firestore, the BigQuery warehouse, and trace storage with an audit record.
  • Methodology releases — quarterly publication of measurement parameters, reliability metrics, and invariance proofs (see /about/science/methodology).

Buyers receive an Article 10–ready data card with every dataset export — sources, preprocessing, anonymization methodology, bias assessment, known limitations.


Tension to be honest about

The /developers page advertises a no-train guarantee for API request data. The ML-training-dataset product is not a contradiction — it is a separable opt-in surface, with separate consent, separate provenance, and separate billing. We label them clearly on the landing page so a reader who skims is not confused.

We also explicitly avoid the trap of arguing that psychometric inferences are never special-category data under GDPR Article 9. A regulator could reasonably take a more conservative view if the inferences imply mental-health condition. Our non-goals (no clinical diagnosis, no political orientation on the primary surface) and Article 9(2)(a) explicit-consent posture together mitigate this — but we treat any unresolved ambiguity as a reason to be more conservative, not less.


Want to discuss a dataset license?

We are accepting a small number of design partners — AI labs, alignment teams, academic consortia, and EU-funded research consortia. Reach out via /contact with a one-paragraph use case and your data governance contact.