Synthetic Data Generation: Create Test Datasets from a Schema

Use FutureAGI's Synthetic Data Generation feature to define column schemas, set categorical distributions, and generate structured test datasets — no code required.

📝
TL;DR

Synthetic Data Generation lets you define a column schema with types, constraints, and categorical distributions, then generate structured test datasets directly from the FutureAGI dashboard. Review, iterate, and run quality evals on the output — all without writing code.

TimeDifficultyPackage
10 minBeginnerDashboard only
Prerequisites

Tutorial

Start the synthetic data wizard

  1. Go to app.futureagi.com, then Dataset, then Add Dataset
  2. Select Create Synthetic Data

Add details

FieldValue
Namesupport-qa-synthetic
DescriptionCustomer support Q&A pairs for an e-commerce company covering returns, shipping, billing, and account issues
ObjectiveFine-tuning a support chatbot
PatternQuestions phrased naturally as a customer would ask. Answers professional, concise, and actionable.
Enter No. of rows20

Select knowledge base (optional): If you have a Knowledge Base with your product docs, select it here to ground the generated data in your domain. The generator will use your KB documents as context and produce Q&A pairs that are verifiable against your actual content. Leave empty to generate without domain grounding.

To set up a KB first, see the Knowledge Base cookbook. You can also start directly from the KB detail view; click Create Synthetic data in the action bar, and the wizard opens with your KB pre-selected.

Click Next.

Add column properties

Add three columns using the Add columns button:

Column 1: question

  • Column Type: Text
  • Properties: Min Length = 20, Max Length = 200

Column 2: answer

  • Column Type: Text
  • Properties: Min Length = 50, Max Length = 500

Column 3: category

  • Column Type: Text
  • Properties: Set Value to Categorical with:
    • shipping — 25%
    • billing — 25%
    • returns — 25%
    • account — 25%

Note

Category percentages must sum to 100%. Use Add more properties to add constraints per column. See Dataset overview for all supported column types and properties.

Click Next.

Add description

Write a description for each column. Use {{column_name}} to reference other columns — this creates dependencies so generated values are contextually related.

Column 1: question

A realistic customer support question about {{category}} issues.
Phrased as a real customer would type it in a chat widget.

Column 2: answer

A professional support response to {{question}} about {{category}}.
Directly addresses the concern with a clear next step.

Column 3: category

The support category this Q&A pair belongs to.

Generate

Click Create Dataset. The platform generates rows server-side and redirects you to the new dataset.

Review and iterate

  • Sort/filter rows to inspect quality
  • To re-generate or modify: click Configure Synthetic Data in the dataset toolbar. Synthetic Data Details drawer opens.
    • Re-Generate same Configuration: retry with same settings
    • Edit Configuration: modify and choose:
      • Replace the current dataset: overwrite with new rows
      • Create as new dataset: keep original, generate a separate dataset
      • Add it to existing dataset: append new rows

Run evals on the generated data

  1. Click Evaluate in the dataset toolbar
  2. Add Evaluations → select completeness
  3. Map keys: outputanswer, inputquestion
  4. Add & Run

Scores appear as a new column. Filter out low-quality rows before using the dataset for fine-tuning.

For batch evaluation via SDK, see Dataset SDK: Batch Evaluation.

What you built

You can now generate synthetic datasets with categorical distribution, iterate on the output, and run quality evals from the FutureAGI dashboard.

  • Generated 20 synthetic Q&A rows with categorical distribution across support topics
  • Used {{column_name}} references to create interdependent columns
  • Reviewed and iterated on generation via the Configure Synthetic Data drawer
  • Ran quality evals on the generated data

Next steps

Was this page helpful?

Questions & Discussion