There's no universal number: the amount of training data a model needs depends on the complexity of the task, the type and size of the model, the number of classes it must distinguish, the quality and diversity of the data, and the accuracy you're targeting. The reliable way to find your number is empirical — train on progressively larger subsets of data and watch for the point where adding more stops improving performance. And when you can't collect enough, because the data is scarce, expensive, or privacy-restricted, generating synthetic data or de-identifying data you already have is often faster and safer than gathering more real examples.
How much data you need depends on five factors
There's no fixed number of examples that guarantees a working model, and any source that hands you one is offering a shortcut that doesn't exist. How much data to train a model is really a question of several factors working together, and the honest first move is to understand them before you start counting rows. Five do most of the work, and quality and diversity routinely matter more than raw volume:
- Task complexity. The harder the underlying pattern — many interacting variables, subtle distinctions, long-range dependencies — the more examples a model has to see before it generalizes rather than guesses. Sorting spam from not-spam is far less demanding than translating legal contracts.
- Model type and size. A small logistic-regression classifier can learn from a few hundred examples, while a deep neural network with millions of parameters needs far more before it stops memorizing and starts generalizing. More expressive models are hungrier by nature.
- Number of classes or size of the output space. A binary classifier needs enough examples of just two categories; a model that sorts inputs into a thousand categories needs enough of each one, which multiplies the requirement.
- Data quality and diversity. Clean, accurately labeled, varied examples teach more per row than a larger pile of noisy or near-duplicate ones. Coverage of the edge cases and rare scenarios your model will meet in production often matters more than sheer count — worth holding onto, because it's exactly where synthetic data earns its place later.
- Target accuracy. The last few points of performance are the most expensive to buy. Reaching 80% accuracy might take a modest dataset; pushing to 95% can take an order of magnitude more, because it now has to learn the rare and ambiguous cases the first tranche never showed it.
Read together, these factors turn "how much data?" from an unanswerable question into an estimate you can actually make.
Rules of thumb, and where they break down
Before running any experiment, most teams reach for a rule of thumb for training data, and a few are worth knowing — as long as you know where they stop working. The best-known is the 10× rule: roughly ten examples for every parameter, or every degree of freedom, in the model. A degree of freedom is an independent value the model can adjust to fit the data — loosely, one of the "knobs" it can turn — so the rule says give the model about ten examples per knob so it has enough signal to set each one instead of just memorizing. Related shortcuts say the same thing in different units: about ten examples per feature or per class for classification, and "more observations than parameters" for classical time-series and regression models.
For small, classical models — linear and logistic regression, modest decision trees, traditional forecasting — these heuristics give a defensible starting estimate, and they're fair to use that way. They encode a real intuition: a model with few parameters needs only enough data to pin those parameters down.
They break down the moment you move to deep learning. A modern neural network has parameters numbering in the hundreds of millions, and a large language model runs to the hundreds of billions. Ten examples per parameter would imply trillions of examples, which is not a number anyone acts on. At that scale, data needs are governed by the task and the architecture — and, for LLMs, by how much the model already absorbed during pretraining — not by parameter count. Treat heuristics as a way to get a ballpark for simple models, never as the answer for modern ones.
| Heuristic | When it's useful | Where it breaks |
|---|---|---|
| 10× rule (≈10 examples per parameter / degree of freedom) | Small classical models with few parameters | Deep nets and LLMs — parameter counts make it meaningless |
| ≈10 examples per feature or per class | Low-dimensional classification | High class counts or high-dimensional inputs |
| More observations than parameters | Classical time-series and regression | Over-parameterized / deep models |
How to estimate it for your model: learning curves
The dependable way to answer the question for your model is to measure it, with a learning curve — a plot of model performance against the amount of training data used. Instead of guessing from a heuristic, you let the model show you how much data it actually needs. The method is simple enough to run this week:
- Hold out a fixed validation set that you never train on, so every measurement is comparable across runs.
- Train your model on progressively larger random subsets of the remaining data — for example 10%, 25%, 50%, and 100%.
- After each run, measure performance on that same held-out validation set.
- Plot performance on the vertical axis against training-set size on the horizontal axis.
- Where you can, repeat each subset size with a couple of different random draws and average them, so an unlucky sample doesn't skew the picture.
A few practical choices make the curve trustworthy. Space your subset sizes so they widen as they grow — doubling, or roughly geometric steps — because the informative bend usually appears at the larger end. Hold everything else fixed across runs — the same architecture, hyperparameters, and training budget — so the only variable is data size. The full sweep costs a fraction of one production training run — worth doing before you commit real budget to collecting or generating more.
Reading the curve is where the answer lives. If the curve is still climbing steeply when you reach 100% of your data, the model is still learning from every example you add — more data, real or generated, will likely keep raising performance, and collecting it is worth the effort. If the curve has bent over into a plateau, you've hit diminishing returns: more of the same data won't move the metric much, and your bottleneck is somewhere else, whether that's model capacity, feature quality, or label noise. The shape of the curve, not a magic row count, is what tells you where you stand and whether more data is the right investment at all.
Why the answer differs for fine-tuning, RL, and classical ML
The sizing logic so far assumes you're training a model from scratch. Many teams aren't, and the data question changes shape depending on what you're actually doing — so it helps to place your project in one of three buckets.
Fine-tuning an LLM
Fine-tuning means adapting a pretrained model to a narrower task or domain by training it further on a smaller, targeted dataset. Because the base model already encodes broad language and reasoning ability, fine-tuning often needs only hundreds to a few thousand high-quality, diverse examples — what you supply is the specific behavior, tone, or format you want, not general competence. Here especially, the quality and coverage of the cases you care about matter far more than volume.
Reinforcement learning
A reinforcement learning environment is the simulated world an agent acts in, learning from feedback on its actions rather than from a labeled dataset. RL doesn't consume a fixed pile of rows the way supervised learning does; what it needs is a realistic environment and enough task variety to learn a strategy that generalizes. The constraint is rarely a row count — it's whether you can build reinforcement learning environments rich and varied enough to train against.
Classical ML
For traditional models — the classifiers, regressors, and tree-based models most teams still run in production — the sizing logic from the earlier sections applies directly: estimate with a heuristic, then confirm with a learning curve.
Across all three, the same pattern surfaces: for fine-tuning and reinforcement learning especially, the real constraint is often not how much data you can afford to label, but that suitable data doesn't exist yet, or can't be used because it's sensitive — which changes the question from how much data you need to where it comes from.
What to do when you can't get enough data
When collecting more real data is impractical, expensive, scarce, or blocked by privacy rules, two practical paths close the gap without gathering more real examples: generate the data you need, or unlock data you already hold but can't currently use.
The first path is synthetic data generation, the solution offered by Tonic Fabricate which generates synthetic data across a few modes. It can model an existing database through Live Connect or build a dataset from scratch, maintaining referential integrity and giving you control over schema, coverage, and ground truth. It also generates unstructured and free-text data, not only structured databases. And it builds simulated multi-agent environments for agent and reinforcement-learning training, populated with realistic context — emails, support tickets, CRM activity — over a structured metadata layer that supports verifiable tasks of graded difficulty.
The second path applies when the data exists but you can't touch it. Much of the most useful training material — clinical notes, support transcripts, internal documents — sits behind privacy and compliance constraints. Another product offered by Tonic.ai, Tonic Textual uses proprietary NER (named entity recognition) models to detect and then redact or synthesize the sensitive entities in free text, so the text becomes usable for training without exposing the people described in it.
The Tonic Advantage: For reinforcement learning and agent training, the hard part usually isn't volume — it's building a world realistic enough to learn from when no real corpus exists. Fabricate generates a simulated environment with a structured metadata layer underneath it, so you can produce training data, including unstructured and free-text content, with controllable coverage and verifiable tasks of graded difficulty. That lets you extend the learning curve past the point where collecting real data stalls, instead of waiting on data that may never arrive.
How far this can go is worth a concrete example. In a Tonic.ai benchmark, an open-source model (Qwen3.5-35B-A3B) was fine-tuned only on a synthetic email corpus generated by Fabricate — a fictional company invented for the test — and then evaluated on the real-world Enron email benchmark it had never seen. It reached 86%, a 5.5-point gain that put it ahead of both o3 and gpt-4.1-mini (each at 85%), without training on a single real email.
This is the learning-curve idea taken one step further: synthetic generation extends the curve when real-data collection has stalled, and for agents and RL it builds usable training data where no real corpus exists at all.