Synthetic Data Generation: Create Test Datasets from a Schema
Use FutureAGI's Synthetic Data Generation feature to define column schemas, set categorical distributions, and generate structured test datasets — no code required.
Synthetic Data Generation lets you define a column schema with types, constraints, and categorical distributions, then generate structured test datasets directly from the FutureAGI dashboard. Review, iterate, and run quality evals on the output — all without writing code.
| Time | Difficulty | Package |
|---|---|---|
| 10 min | Beginner | Dashboard only |
- FutureAGI account → app.futureagi.com
Tutorial
Start the synthetic data wizard
- Go to app.futureagi.com, then Dataset, then Add Dataset
- Select Create Synthetic Data
Add details
| Field | Value |
|---|---|
| Name | support-qa-synthetic |
| Description | Customer support Q&A pairs for an e-commerce company covering returns, shipping, billing, and account issues |
| Objective | Fine-tuning a support chatbot |
| Pattern | Questions phrased naturally as a customer would ask. Answers professional, concise, and actionable. |
| Enter No. of rows | 20 |
Select knowledge base (optional): If you have a Knowledge Base with your product docs, select it here to ground the generated data in your domain. The generator will use your KB documents as context and produce Q&A pairs that are verifiable against your actual content. Leave empty to generate without domain grounding.
To set up a KB first, see the Knowledge Base cookbook. You can also start directly from the KB detail view; click Create Synthetic data in the action bar, and the wizard opens with your KB pre-selected.
Click Next.
Add column properties
Add three columns using the Add columns button:
Column 1: question
- Column Type: Text
- Properties:
Min Length=20,Max Length=200
Column 2: answer
- Column Type: Text
- Properties:
Min Length=50,Max Length=500
Column 3: category
- Column Type: Text
- Properties: Set Value to
Categoricalwith:shipping— 25%billing— 25%returns— 25%account— 25%
Note
Category percentages must sum to 100%. Use Add more properties to add constraints per column. See Dataset overview for all supported column types and properties.
Click Next.
Add description
Write a description for each column. Use {{column_name}} to reference other columns — this creates dependencies so generated values are contextually related.
Column 1: question
A realistic customer support question about {{category}} issues.
Phrased as a real customer would type it in a chat widget.Column 2: answer
A professional support response to {{question}} about {{category}}.
Directly addresses the concern with a clear next step.Column 3: category
The support category this Q&A pair belongs to. Generate
Click Create Dataset. The platform generates rows server-side and redirects you to the new dataset.
Review and iterate
- Sort/filter rows to inspect quality
- To re-generate or modify: click Configure Synthetic Data in the dataset toolbar. Synthetic Data Details drawer opens.
- Re-Generate same Configuration: retry with same settings
- Edit Configuration: modify and choose:
- Replace the current dataset: overwrite with new rows
- Create as new dataset: keep original, generate a separate dataset
- Add it to existing dataset: append new rows
Run evals on the generated data
- Click Evaluate in the dataset toolbar
- Add Evaluations → select
completeness - Map keys:
output→answer,input→question - Add & Run
Scores appear as a new column. Filter out low-quality rows before using the dataset for fine-tuning.
For batch evaluation via SDK, see Dataset SDK: Batch Evaluation.
What you built
You can now generate synthetic datasets with categorical distribution, iterate on the output, and run quality evals from the FutureAGI dashboard.
- Generated 20 synthetic Q&A rows with categorical distribution across support topics
- Used
{{column_name}}references to create interdependent columns - Reviewed and iterated on generation via the Configure Synthetic Data drawer
- Ran quality evals on the generated data