Synthetic Data Generation for AI Testing in Future AGI

Generate schema-driven test datasets without using real user data. Define column types, constraints, and descriptions, then generate rows using Future AGI.

About

Synthetic data is artificially generated data that follows real-world patterns without using actual user data. In Future AGI, you define a schema (columns, types, descriptions, and constraints) and the platform generates rows that match your specification.

For example, defining this schema:

Column	Type	Description
customer_query	text	A realistic customer support question
sentiment	text	One of: positive, negative, neutral
priority	integer	1 (low) to 5 (urgent)

Produces rows like:

customer_query	sentiment	priority
I haven’t received my order and it’s been two weeks	negative	4
Can I change the shipping address on my recent order?	neutral	2
Your product is fantastic, just wanted to say thanks!	positive	1

The generated data follows the constraints you set (sentiment is always one of three values, priority stays in range) while producing varied, realistic content.

When to use

No real data available: You’re building a new feature and don’t have production data yet
Privacy constraints: Real data contains PII or sensitive information that can’t be used for testing
Edge case testing: You need specific scenarios (angry customers, rare errors, multilingual queries) that are hard to find in real data
Scale testing: You need thousands of rows to stress-test evaluations or prompts
Balanced datasets: Real data is skewed (e.g. 95% positive reviews) and you need more balanced distributions

How It Works

Define the schema: column names, data types, and descriptions
Set constraints: value ranges, categorical options, patterns
Optionally connect a Knowledge Base to ground generation with your own documents
Choose the number of rows to generate
The platform generates the dataset. You can review, edit, and use it immediately.

Key Properties

Schema-driven: You control the structure. Every column has a type, description, and optional constraints that guide generation.
Realistic distribution: Generated data follows natural patterns and distributions, not random values. Descriptions give the generator context to produce relevant content.
Safe by default: Generated data does not contain real PII, credentials, or sensitive information.

Next Steps

Generate Synthetic Data: Step-by-step quickstart for creating your first synthetic dataset
Static Columns: How static columns store the data you provide
Dynamic Columns: How to add model outputs and evaluations on top of your synthetic data
Knowledge Base: Ground synthetic generation with your own documents

Was this page helpful?

Questions & Discussion