AI training data for regulated industries (HIPAA)

AI training data for regulated industries is data prepared so it can train models without exposing protected information — meeting standards like HIPAA, GDPR, and the EU AI Act. Teams have two compliant paths: de-identify the real data they already hold, redacting or synthetically replacing PII and PHI so records stay useful but no longer identify anyone, or generate synthetic training data that carries no real personal information at all. In healthcare and finance, most teams combine the two: de-identifying existing records where real-world signal matters, and generating synthetic data to fill gaps or avoid touching sensitive sources entirely.

What "compliant" AI training data actually means

The moment real patient records, insurance claims, or payment transactions enter a training pipeline, they bring legal obligations with them — and those obligations don't lapse because the data is "only" feeding a model. Training is a use of the data, and regulators treat it like any other. Before you can make AI training data compliant, then, you have to be precise about what you're holding.

Two kinds of sensitive information drive almost every rule that applies. PII, or personally identifiable information, is any data that picks out a specific person — directly, through something like a name, an email address, or a Social Security number, or indirectly, through a combination like a birth date plus a ZIP code plus a gender. PHI, or protected health information, is the health-specific subset HIPAA governs: medical information tied to an identifiable individual, down to the bare fact that someone is a patient. Healthcare, insurance, and financial records tend to be dense with both, often in the same document.

Compliant training data resolves to one of two end-states. Either the data has been de-identified — transformed so it no longer identifies anyone, to a standard a regulation recognizes — or it is synthetic, generated so that no real person is in it to begin with. Both give you something you can train on without carrying a live privacy liability into your model. Knowing what separates good training data from bad keeps either path from quietly degrading the dataset, and most regulated teams reach for both.

The rules that govern training data: HIPAA, GDPR, CCPA, and the EU AI Act

Four regulatory regimes do most of the work in healthcare and finance, and each bites on training data in its own way. None of them forbids training on sensitive data outright; each sets the conditions under which it's lawful. Treating those conditions as a compliance requirement to design for, not a box to check at the end, keeps a training set usable.

HIPAA

HIPAA governs PHI, and it recognizes two ways to turn patient data into something you can train on freely. Safe Harbor requires removing 18 specified identifiers — names, geographic detail finer than a state, dates more precise than a year, and contact, record, and account numbers among them — after which the data is no longer treated as PHI. Expert Determination is the alternative: a qualified expert applies statistical methods, concludes the risk of re-identifying anyone is very small, and documents how. The official HHS guidance sets out both.

GDPR and CCPA

GDPR and CCPA turn on purpose and consent rather than a fixed identifier list. Under GDPR, purpose limitation means data gathered for one reason can't be freely repurposed to train a model for another, and data-subject rights — access, deletion, objection — keep applying as long as the records identify real people. Fully anonymized data falls outside the regulation; pseudonymized data does not. CCPA works similarly for California residents, with a comparable carve-out for de-identified data.

The EU AI Act

For high-risk AI systems — many medical and credit-decision models among them — Article 10 of the EU AI Act sets quality and governance duties for the data itself. Training, validation, and test sets must be relevant, sufficiently representative, and as free of errors as the purpose allows, with attention to bias and to the records that show how the data was sourced and prepared. The official text of Article 10 treats those records as part of the obligation, not an afterthought.

Path one: de-identify the data you already have

De-identification is the path to choose when the signal in your existing records is the point — when a model needs to learn from how real clinicians actually write, how real claims are actually coded, or how real transactions actually flow — and you have lawful access to that data inside a controlled environment. The goal is to keep everything that makes the data useful while removing everything that ties it to a real person.

Sensitive data hides in two very different places. In structured sources it sits in known database fields — a patient-ID column, an account-number field — where it's relatively easy to locate and transform. In unstructured sources it's scattered through free text, which is where much of the hardest and highest-value training data lives:

Clinical notes and discharge summaries
Claims correspondence and prior-authorization letters
Call-center and support transcripts
Contracts, statements, and other financial documents

There are two ways to transform a sensitive value once you've found it. Redaction removes it — the value is blacked out or replaced with a generic tag. Synthetic replacement swaps it for a realistic fake: a real patient name becomes a different, invented name of the same shape, so the text still reads naturally and a model can still learn its structure. For training data, synthetic replacement usually wins, because redaction leaves holes that distort the very patterns you're trying to teach.

Tonic Textual is built for exactly this work on unstructured text. Its proprietary NER models — named entity recognition, the task of locating and classifying spans like names, medical record numbers, account numbers, and domain-specific identifiers — detect sensitive entities in free text, then either redact or synthesize each one. Because detection is model-based rather than a fixed pattern list, it catches the messy, in-context identifiers that regular expressions miss, and it supports Expert Determination for teams that need to clear the HIPAA bar on the result. Much of the work of preparing unstructured sources like documents, transcripts, and notes for training comes down to getting this detection-and-transformation step right.

The Tonic Advantage: keep the text, lose the identifiers. The hard part of de-identifying free text isn't removing names — it's removing them without flattening the language a model needs to learn from. Tonic Textual detects sensitive entities with proprietary NER models and replaces each one with a realistic synthetic substitute of the same type, so a de-identified clinical note still reads like a clinical note. The sentence structure, the clinical vocabulary, and the statistical texture survive; only the real person leaves.

Path two: generate synthetic training data with no real PII

Generation is the path when you can't or shouldn't touch real data at all — production is locked down and no copy may leave it, the application is greenfield and the data doesn't exist yet, or the cases you most need (a rare fraud pattern, an uncommon diagnosis) are too scarce in real records to train on. Synthetic data is produced by a model or algorithm rather than collected from real events, so it sidesteps the compliance question by construction: there is no real person in it to protect.

Tonic Fabricate generates this kind of data two ways. You can build a dataset from scratch, describing the schema and rules you need, or you can connect to an existing database with Live Connect and have Fabricate model new data on its real structure, distributions, and relationships — without copying the underlying rows. Either way, because you produced the data, the ground-truth labels are built in: you already know the correct answer for each record because you defined the scenario that created it, which removes the separate annotation step real data demands. The same approach that produces synthetic training data in general is what makes it viable in a regulated setting in particular.

The vertical examples are concrete. In healthcare, you can generate a synthetic patient cohort — demographics, encounters, lab results, and notes that hold together as a coherent record — without exposing a single real patient. In finance, you can generate synthetic transaction streams seeded with the fraud patterns a detection model needs to see far more often than real traffic supplies them. And when real unstructured examples are scarce but not entirely off-limits, the two products pair: de-identify what you have with Textual, then point Fabricate at that safe set to generate more in the same shape — augmenting a small, sensitive corpus into a larger one without reintroducing the original PII.

Choosing and combining the two paths in healthcare and finance

The choice between de-identifying and generating comes down to what you have and what you need. De-identify when the real-world distribution is what you're after and you have lawful access to the data — there's no substitute for how real people actually behave. Generate when the data doesn't exist, can't be touched, or has to be reshaped: when you need controllable volume, balanced classes, or coverage of edge cases real records underrepresent. Blend the two when you want real-world signal as a foundation and synthetic data to fill what it's missing.

In practice, most regulated teams blend. A healthcare team might de-identify its archive of real clinical notes to capture how clinicians actually document, then generate synthetic patient records to balance rare conditions the archive barely contains. A finance team might de-identify real transaction histories for their genuine spending patterns, then generate synthetic fraud cases to give a detection model enough positive examples to learn from. The de-identified data carries the truth of what happened; the synthetic data supplies training data built for a specific clinical or financial domain where real data runs thin.

Wellthy, a healthcare care-management company, put previously unusable sensitive unstructured data to work in its AI workflow after de-identifying it, and reported a 50% reduction in flagged care team actions — a concrete sign that compliant data, prepared well, still carries enough signal to change how a model performs.

Criterion	De-identify real data	Generate synthetic
Best when	You need the true real-world distribution and have lawful access to the data	The data doesn't exist, can't be touched, or needs controllable volume and edge-case coverage
Data source	Real records you already hold, transformed in place	A specification, or the structure of an existing database modeled without copying it
Compliance basis	Removes the tie to a real person (Safe Harbor or Expert Determination under HIPAA)	No real personal data is present to begin with
Built-in labels / ground truth	Inherited from the real records and any existing annotations	Built in — you defined the scenario, so the correct answer ships with the data
Tonic product	Tonic Textual	Tonic Fabricate

Validating compliance and utility before you train

Whichever path produced your data, the last step before training is the same: confirm it is genuinely safe and still genuinely useful. Neither property is automatic. A de-identification pass can miss an identifier buried in free text; a synthetic set can be private but too narrow to teach a model anything. Validation catches both, and it isn't optional on either path.

Work through it in order:

Re-scan for residual sensitive data. Run a second detection pass over the output to catch PII or PHI the first pass missed — a name in an unexpected place, an account number sitting inside a sentence. This is the safety gate; nothing downstream matters if it fails.
Assess re-identification risk. For de-identified data, confirm the result actually clears your standard — Safe Harbor's identifier list, or an Expert Determination that the risk of re-identifying anyone is very small.
Validate utility and fidelity. Train on the prepared data and test on held-out real data, and compare the prepared set's distributions against a real reference. A dataset that's safe but no longer representative will quietly teach a model the wrong lessons.
Record the governance trail. Document where the data came from, what transformations were applied, and what each validation step found. This is the audit record the EU AI Act expects, and it's what lets a compliance team vouch for the pipeline later.

Configuring how each entity type is detected and transformed — and reviewing the results before you commit them — is handled in the Tonic Textual documentation. The same discipline applies to generated data: even with no real PII to remove, you still owe the model a check that the synthetic set is good — measured for coverage, labels, and bias — before you rely on it.

AI training data for regulated industries: HIPAA, PII, and privacy