RL evaluation datasets need three things real-world data rarely delivers together: verifiable ground truth, controllable task difficulty, and freedom from contamination or privacy risk. Synthetic data solves this by generating the evaluation set and its ground-truth labels in the same step, turning difficulty and coverage into deliberate design choices instead of whatever a real-world corpus happens to contain. Teams building RL and agent evaluation harnesses increasingly rely on this approach to test agents against known-correct answers instead of noisy, unlabeled, or already-memorized benchmarks.

Why real-world data makes a poor RL evaluation benchmark

Real-world sources are the default starting point for RL and agent evaluation sets, and they fail in three predictable ways once a team tries to build something rigorous from them. Benchmarks assembled from GitHub issues, live websites, or scraped support queues — the raw material behind reference points like SWE-bench and WebArena — are useful as category examples, and they inherit the same structural weaknesses regardless of domain.

Scarcity. The tasks worth testing an agent against — a genuinely ambiguous support ticket, a multi-step debugging session, a contract clause that only shows up in edge cases — are rare in any single real-world corpus, and there's no way to ask for more of them.
Contamination. Frontier models are trained on enormous slices of the public internet, so a benchmark built from public repositories or well-known websites has often already been seen during pretraining, which inflates scores without reflecting genuine task competence.
No controllable ground truth. Real-world data doesn't arrive with a verified correct answer attached. Someone has to reconstruct what "correct" means after the fact, usually by hand, and that reconstruction is slow, inconsistent between annotators, and hard to audit at scale.

These three problems compound rather than average out: a benchmark that's both scarce and contaminated gives a false signal that's also expensive to refresh. None of this is a critique of SWE-bench, WebArena, or similar benchmarks for what they were built to measure — the point is structural. Building an evaluation set from data no one authored for evaluation purposes means the ground truth, the difficulty curve, and the coverage all have to be discovered after the fact rather than designed up front.

What a trustworthy RL evaluation set actually needs

A trustworthy RL evaluation set has to clear three bars that real-world data clears only by accident. First, every task needs a known-correct answer tied to structured metadata — not a label bolted on afterward, but a piece of data that exists because the underlying scenario was designed to produce it. Second, task difficulty and coverage need to be something you set, not something you discover: you should be able to specify a mix of single-hop lookups and multi-hop reasoning chains and get exactly that distribution. Third, the set has to be reproducible — the same evaluation run should mean the same thing next quarter, without the benchmark quietly drifting because an underlying data source changed, so a later score improvement can be trusted as the model getting better rather than the benchmark moving.

Structured, controllable generation is one of the harder problems in AI training data to solve well, because it requires building a coherent world first and only then generating tasks against it. Tonic Fabricate is built around that sequencing: it constructs the underlying entities, timeline, and relationships before generating any evaluation content, so ground truth is a property of the data rather than something layered on top of it afterward. That same structured-world approach is what makes Fabricate a fit for populating the reinforcement learning environments an agent trains and is scored against, not just one-off test sets.

The Tonic Advantage: structured metadata, not reconstructed labels. Fabricate builds a metadata layer alongside the world it generates — timeline events, entity relationships, thread and task specifications — that ties every generated scenario to its correct answer at creation time. Because that metadata is designed in, Fabricate can design hierarchies of verifiable tasks at controllable difficulty, from single-hop lookups to multi-hop reasoning chains, instead of hoping a real-world corpus happens to contain the right mix.

Building ground truth into the data instead of labeling it afterward

The generate-first principle is what separates ground truth built into an evaluation set from ground truth added afterward. Because Tonic Fabricate constructs the underlying world — the characters, the organization, the timeline of events — before any email, ticket, or document exists, the correct answer to a task is known at the moment the task is generated. There's no separate annotation pass reconstructing what "correct" means from an already-written artifact; the correctness is part of the specification that produced the artifact in the first place, which removes the slowest and most error-prone step in building an evaluation set.

The clearest evidence that this produces evaluation-grade data comes from a Tonic.ai research benchmark. Fabricate generated a complete synthetic corporate email environment — a fictional 100-person company with roughly 1,964 emails and a structured metadata layer of timeline events, thread specifications, and cross-references supporting tasks that range from single-email lookups to multi-hop reasoning across threads. An open-source model fine-tuned only on that synthetic data improved on the real-world Enron email benchmark from 80.5% to 86%, outperforming o3 and gpt-4.1-mini on email it had never seen — a result that's hard to attribute to anything but the quality of the ground truth Fabricate built in from the start, given that the model never touched a single real email during training.

The practical shift this enables is generating data with labels attached, rather than annotating it after the fact. With manual labeling, someone has to look at a finished example and decide what it should have been, one example at a time — and two annotators looking at the same ambiguous ticket or thread often disagree, so the "ground truth" itself carries noise before a model is ever scored against it. With a generated evaluation set, that decision is made once, at design time, and every example inherits it automatically and consistently, which is what makes it practical to build eval sets at the scale RL and agent training actually needs.

Controlling task difficulty and coverage

Difficulty in an RL evaluation set is a dial you can turn, not a property you inherit from whatever data happens to exist. Because Tonic Fabricate builds the underlying structured world before generating tasks against it, difficulty can be designed as a deliberate gradient: single-hop lookups that test basic retrieval, multi-step tasks that require connecting information across several messages or systems, and reasoning chains that force an agent to hold several pieces of context in mind at once. Each tier is a design choice, not a byproduct of what a real-world corpus happened to contain.

Coverage works the same way. If an agent needs to handle a rare escalation path, an unusual account state, or a scenario that occurs once in ten thousand real interactions, that scenario can be generated as often as the evaluation set needs it — rather than waiting for enough real occurrences to accumulate, or accepting an eval set where the hardest cases are underrepresented by definition.

The Tonic Advantage: task hierarchies, not a fixed difficulty curve. Because the structured metadata layer ties every task to the entities and relationships that produced it, Fabricate can generate a graded hierarchy of tasks — from single-hop to multi-hop — across the same underlying world, so a team can test an agent's reasoning depth deliberately instead of hoping the eval set's difficulty distribution happens to match what they need to measure.

These same difficulty and coverage choices carry over directly into the RL environment your agent trains against, not just the one-off benchmark it's scored against at the end.

Keeping eval data free of PII and contamination risk

If an RL evaluation set is built from real support tickets, chat transcripts, or emails, that free text usually carries PII — names, account numbers, order details — mixed into the prose rather than confined to a column you can strip out. Using that text to score or train an agent means handling the PII first, and the mechanism matters: naive redaction that blacks out every sensitive span tends to also strip the context an agent needs to be evaluated fairly, since a blacked-out name or account number changes the shape of the text the model is reasoning over.

Tonic Textual addresses this by detecting sensitive entities in free text using proprietary NER models and replacing them with realistic synthetic substitutes rather than blacking them out — so a real name becomes a different, plausible name, and the surrounding sentence structure and statistical properties stay intact. That distinction matters specifically for evaluation data: a task built around a synthesized-but-realistic ticket still tests the same reasoning skill a real ticket would, while a heavily redacted one can turn into pattern-matching against blank fields rather than a genuine test of an agent's judgment.

That claim is testable, not just descriptive. Tonic.ai's PrivacyBench benchmark scores exactly this kind of free-text de-identification on email and Slack data, measuring both how much PII a pipeline catches and how coherent its synthetic replacements are. Pairing Textual's detection with an LLM for synthesis lifted end-to-end de-identification accuracy to 92.0%, ahead of using an LLM alone for both stages at 87.7%, and held detection recall at 95% against 88.8% for LLM-only detection. That gap is the same distinction that matters for an eval set: a pipeline that misses PII or replaces it incoherently doesn't just create privacy exposure, it distorts the task the agent is being evaluated on.

Contamination and privacy risk share a root cause worth naming directly: both come from using text that was written for some other purpose — a real customer interaction, a public repository — and repurposing it for evaluation without controlling what's in it or who has already seen it. Generating the evaluation set, or de-identifying real data before using it, addresses that root cause directly rather than working around it after the fact.

A practical workflow for building an RL eval set

Building an RL evaluation set that holds up in practice follows a consistent sequence, whether the domain is customer support, coding, or internal tooling.

Define the domain and declare correctness before generating any text. Specify the entities, characters, and organization the evaluation set will simulate, and decide what a correct answer looks like for each task type before a single email, ticket, or document is generated.
Generate the structured metadata layer first. Build the timeline, relationships, and thread specifications that link each task to its correct answer, so the metadata exists independently of the free text that will sit on top of it.
Build difficulty tiers from single-hop to multi-hop. Generate tasks against the structured layer at each difficulty level deliberately, rather than generating text first and sorting it into difficulty buckets afterward.
Run the eval harness and watch for harness noise, not just model performance. The harness itself can move scores independently of the model being tested. In Tonic.ai's own RL benchmark work — the same synthetic Vectrix eval set behind the Enron result above — reviewing agent traces surfaced two harness-level issues, not model issues: an incidental phrase in the agent's system prompt ("you may take up to 10 turns") was suppressing scores by up to 14 percentage points for some models until reworded, and running the judge model at its default temperature rather than temperature zero introduced roughly 19% noise into scoring. Fix the harness configuration before comparing models or runs, or noise like this can swamp the signal the eval set is meant to capture.
Re-run and version the benchmark as the domain evolves. Treat the evaluation set the same way you'd treat production code — versioned, reproducible, and updated deliberately rather than left to drift.

This sequence produces a defined, repeatable outcome: an evaluation set where the ground truth, the difficulty distribution, and the harness itself are all things a team designed on purpose, rather than artifacts of whatever data happened to be available.

How to build RL evaluation datasets and benchmarks

Why real-world data makes a poor RL evaluation benchmark

What a trustworthy RL evaluation set actually needs

Building ground truth into the data instead of labeling it afterward

Controlling task difficulty and coverage

Keeping eval data free of PII and contamination risk

A practical workflow for building an RL eval set

See how Tonic Fabricate handles AI training data

More in Agents and RL

Building test data and environments for AI agents