How to build a dataset for LLM fine-tuning

Building a dataset for LLM fine-tuning means assembling task-specific examples — usually prompt-and-response pairs — that are high-quality, diverse, correctly formatted, and split into separate training and evaluation sets. You can source those examples from data you already have (de-identified first if it contains sensitive information) or generate them synthetically, which lets you control coverage, difficulty, and labels directly. The hard part is rarely raw volume; it's quality and format — clean, consistent, well-labeled examples in the exact structure your training framework expects.

What goes into a fine-tuning dataset, and when you need one

A fine-tuning dataset is a collection of task-specific examples that teach an already-capable model to produce a particular behavior, format, or domain expertise it doesn't reliably generate on its own. Fine-tuning a large language model means continuing its training on these examples so the target behavior is absorbed into the model's weights, rather than coaxed out at inference time. Fine-tuning data is a specialized slice of AI training data — the broad category of examples any model learns from — narrowed to the specific task you want one particular model to get better at.

Most fine-tuning datasets take one of two shapes:

Supervised / instruction examples. Pairs of a prompt and the ideal response. Supervised fine-tuning (SFT) trains the model to reproduce the response given the prompt; instruction tuning is the common variant where the prompt is a natural-language instruction and the response is the behavior you want back. This is the workhorse format for teaching tone, structure, and domain tasks.
Preference and RL-style data. Instead of a single correct answer, the model learns from a signal about which output is better — a chosen-versus-rejected pair for preference methods, or a task with a verifiable reward in reinforcement learning. This shapes judgment where there is no single canonical response.

Which shape you need follows from a decision that comes before any data work: whether fine-tuning is even the right tool. If you mainly need the model to know current or proprietary facts, retrieval-augmented generation (RAG) — fetching relevant documents at inference time and placing them in the prompt — is usually the better fit, and it needs no training data at all. If you can get the behavior you want by rewriting the prompt, prompt engineering is faster and cheaper still. Fine-tuning earns its cost when you need consistent formatting or behavior across many calls, a domain skill prompting can't reliably produce, or shorter prompts at inference for lower latency and cost. Settling that question first tells you what the dataset has to contain, and it situates the work within the wider practice of preparing training data for AI — where the same trade-offs recur, including how much data you'll ultimately need.

Source it or generate it: where the examples come from

The examples in a fine-tuning set come from one of a few places, and the honest starting point is that you may not need anything exotic. Clean, non-sensitive data you already own — support transcripts you have the rights to, internal documents with no personal information, your own labeled records — can be formatted and used directly, with no de-identification and no generation step. Most teams, though, run into one of two walls: the relevant data is sensitive, or there isn't enough of it. The realistic sourcing paths, and what each costs you, look like this:

Proprietary / internal data — the highest relevance to your task, but often sensitive, unevenly distributed, and unlabeled.
Public datasets — fast and cheap to start with, but generic, sometimes restrictively licensed, and a contamination risk if they overlap your evaluation set.
Human-written examples — high quality and fully under your control, but slow and costly to produce at the scale fine-tuning wants.
Synthetic generation — lets you control coverage, difficulty, and labels and scales quickly, with quality bounded by how good the generator is.

Two of these paths are what actually unblock most teams. The first is making sensitive real data safe to train on. A great deal of the most useful fine-tuning material is free text — tickets, notes, chat logs — with names, account numbers, and other identifiers scattered through the prose. Tonic Textual handles this by extracting the text, detecting sensitive entities with proprietary named-entity-recognition (NER) models, and either redacting them or replacing them with realistic synthetic substitutes. Because it synthesizes rather than simply blanking values out, the surrounding text and its statistical properties stay intact — which is what makes turning documents, transcripts, and notes into training-ready data possible without stripping out the signal a model needs.

The second path is generating net-new examples. When you generate the examples synthetically, you decide the schema, the coverage, and the labels up front instead of hoping a collected sample happens to contain them. That control is what makes generation the faster route to a balanced, well-labeled set, and it reframes the trade-off between generating labeled data and annotating it by hand: a generated example arrives with its correct answer already attached, while a collected one has to be labeled afterward. Tonic Fabricate is built around this approach, generating relationally consistent records and free text from a prompt or schema, with the correct answers built in by construction.

Format and structure: getting examples into a trainable shape

Even well-chosen examples will train poorly if they're in the wrong shape, and format problems are one of the most common — and least obvious — reasons a fine-tune underperforms, independent of how good the underlying data is. The raw material of a supervised set is the prompt-and-response pair, but how that pair is encoded matters. Most modern instruction tuning uses a chat template: each example is a sequence of messages tagged with roles — typically system, user, and assistant — and the model learns to produce the assistant turn given everything before it. The system message sets persistent behavior; the user and assistant messages carry the exchange.

The common container for all of this is JSONL — JSON Lines — where each line of the file is one complete, self-contained JSON object representing a single training example. A minimal supervised example looks like this:

{"messages": [{"role": "system", "content": "You are a support agent for an email client."}, {"role": "user", "content": "How do I export my mailbox?"}, {"role": "assistant", "content": "Open Settings, choose Export, pick a format, and confirm."}]}

The failures that quietly degrade a fine-tune usually live in these details. Fields that vary from row to row, a chat template at training time that doesn't match the one used at inference, ragged structure where some examples carry a system message and others don't, mislabeled roles, and stray whitespace or special tokens all push the model toward learning the noise instead of the task. None of these show up as bad data — every example might be individually correct — yet a model trained on a structurally inconsistent file behaves as if the data were low quality. The fix is mechanical but easy to skip: validate that every record parses, that the schema is identical across the set, and that the template you train on is the one you'll serve.

Data quality: diversity, labels, dedup, and contamination

Once the format is right, quality is what decides whether fine-tuning helps or hurts — and quality beats volume almost every time. A few thousand clean, diverse, correctly labeled examples will usually produce a better model than ten times as many noisy ones, because fine-tuning amplifies whatever patterns are in the data, including the mistakes. Concretely, a strong fine-tuning set holds up on five dimensions:

Coverage and diversity — the examples span the range of inputs the model will actually see, including the harder and less frequent cases, not just the easy center of the distribution.
Label accuracy — each response is genuinely the ideal one; a wrong "correct" answer teaches the model the wrong lesson with full confidence.
Deduplication — near-duplicate examples waste model capacity and quietly bias the model toward whatever they over-represent.
A clean train/eval split — a held-out evaluation set the model never trains on, so your measurements reflect generalization rather than memorization.
No contamination — contamination is when evaluation examples leak into the training set; it inflates your scores and hides the failures you most need to see.

Coverage and labels are where collected data tends to struggle most. Real samples are dense in common cases and thin in the rare ones that matter, and every collected example has to be labeled after the fact — a slow, error-prone pass that becomes its own project at scale. Contamination is the quieter risk: it often enters through duplicates that straddle the split, or through public data that already contains your evaluation examples, so deduplicating across the train and eval sets — not just within each — is part of getting a split you can trust. This is the structural advantage of generated data, and it sits at the center of the trade-off between generating labeled data and annotating it by hand: when you produce an example from a specification, you already know the correct answer, because you defined the scenario that created it. Ground truth — the verified correct answer each example is graded against — ships with the data instead of being reconstructed later. With Tonic Fabricate, that built-in ground truth is part of how the data is generated rather than a separate labeling step, which is what makes a balanced, correctly labeled set practical to produce at volume.

The evidence that well-constructed synthetic data can carry a real fine-tune is concrete. In a Tonic.ai benchmark, an open-source model (Qwen3.5-35B-A3B) fine-tuned only on Fabricate-generated synthetic email data improved on the real-world Enron email benchmark from 80.5% to 86%, outperforming o3 and gpt-4.1-mini without training on a single real email; the corpus and tasks are published as an open, reproducible dataset. That is task-specific evidence, not a claim that every fine-tune should be synthetic — collected data still captures real behavior no specification anticipates. What it shows is that when fidelity, coverage, and labels are right, generated examples can match or beat real ones on the task you are training for, which turns what makes a training dataset good into a question of quality and control rather than raw origin.

How much data you actually need, and how to scale it

There's no magic number of examples for fine-tuning, and any source that quotes one is guessing — the right amount depends on the task, the base model, and how much the behavior you want already overlaps what the model knows. The reliable approach is empirical: find the point where more data stops helping by measuring, not by aiming at a round figure. A representative seed matters more than a large one here — a thousand examples that cover the task's real variety teach more than ten thousand near-identical ones, and they keep each training run cheap enough to iterate on.

Start with a small, high-quality seed set — often a few hundred to a few thousand clean examples is enough to see signal.
Fine-tune, then evaluate on your held-out real eval set.
Add more data, re-train, and re-measure, watching the eval curve.

When the curve flattens — each new batch of examples moves the score less — you've hit diminishing returns, and more raw volume is unlikely to be the lever. What usually helps at that point is precision, not quantity. If the eval shows the model failing on a specific kind of input — a rare intent, an underrepresented format, an edge case — generating synthetic examples that target exactly that gap is faster and more controllable than collecting more general data and hoping the missing cases turn up. This is the same logic that governs how much data you need to train a model from scratch: past a point, what moves performance is filling the specific gaps the evaluation exposes, and generation lets you fill them on demand.

Putting it together: a practical workflow

Putting the pieces in order gives a repeatable sequence: decide whether fine-tuning is the right tool, source or generate your examples, get them into a consistent trainable format, quality-check for coverage and clean splits, then size and scale by measuring where the eval curve flattens. The steps that tend to stall a project are the middle ones — producing enough well-labeled, correctly formatted examples — which is exactly where generating the data earns its place. The sequence is rarely a single pass, either: the eval in the last step usually sends you back to sourcing or generation to fill the specific gaps it exposed, and the loop tightens with each turn as the dataset converges on the cases the model still gets wrong.

Tonic Fabricate works as a concrete example of how those middle steps collapse into one. You describe the data you need in a prompt or a schema, and a Data Agent generates it; a Validation Agent then reviews what was produced and prompts refinements in a loop, so the set converges on something usable even when the initial prompt was imprecise. The labels are defined by the specification, so ground truth is built in rather than added afterward, and the result can export in the formats a training pipeline expects — JSONL among them — ready to drop into a fine-tuning run. Format and labeling, the two things most likely to quietly sink a fine-tune, are handled as the data is created rather than patched once it exists.

The Tonic Advantage: formatted and labeled by construction. The hardest parts of a fine-tuning dataset — consistent format, accurate labels, uniqueness constraints — are usually separate cleanup passes run after the data already exists. With Tonic Fabricate, the Data Agent and Validation Agent produce examples that are structured, labeled, and unique as they're generated: the ground truth comes from the specification, the format is defined up front, and the validation loop catches the inconsistencies that would otherwise surface only when training underperforms. The quality-and-format problems that sink most fine-tunes become properties of how the data was made, not chores to fix after collection.

How to build a dataset for LLM fine-tuning

What goes into a fine-tuning dataset, and when you need one

Source it or generate it: where the examples come from

Format and structure: getting examples into a trainable shape

Data quality: diversity, labels, dedup, and contamination

How much data you actually need, and how to scale it

Putting it together: a practical workflow

See how Tonic Fabricate handles AI training data

More in Methods

Generate vs. annotate: building labeled datasets

Preparing unstructured data for AI training: documents, transcripts, and notes