Synthetic Data
Generate realistic datasets from a schema without using real user data.
About
Synthetic data is artificially generated data that follows real-world patterns without using actual user data. In Future AGI, you define a schema (columns, types, descriptions, and constraints) and the platform generates rows that match your specification.
For example, defining this schema:
| Column | Type | Description |
|---|---|---|
| customer_query | text | A realistic customer support question |
| sentiment | text | One of: positive, negative, neutral |
| priority | integer | 1 (low) to 5 (urgent) |
Produces rows like:
| customer_query | sentiment | priority |
|---|---|---|
| I haven’t received my order and it’s been two weeks | negative | 4 |
| Can I change the shipping address on my recent order? | neutral | 2 |
| Your product is fantastic, just wanted to say thanks! | positive | 1 |
The generated data follows the constraints you set (sentiment is always one of three values, priority stays in range) while producing varied, realistic content.
When to use
- No real data available: You’re building a new feature and don’t have production data yet
- Privacy constraints: Real data contains PII or sensitive information that can’t be used for testing
- Edge case testing: You need specific scenarios (angry customers, rare errors, multilingual queries) that are hard to find in real data
- Scale testing: You need thousands of rows to stress-test evaluations or prompts
- Balanced datasets: Real data is skewed (e.g. 95% positive reviews) and you need more balanced distributions
How It Works
- Define the schema: column names, data types, and descriptions
- Set constraints: value ranges, categorical options, patterns
- Optionally connect a Knowledge Base to ground generation with your own documents
- Choose the number of rows to generate
- The platform generates the dataset. You can review, edit, and use it immediately.
Key Properties
- Schema-driven: You control the structure. Every column has a type, description, and optional constraints that guide generation.
- Realistic distribution: Generated data follows natural patterns and distributions, not random values. Descriptions give the generator context to produce relevant content.
- Safe by default: Generated data does not contain real PII, credentials, or sensitive information.
Next Steps
- Generate Synthetic Data: Step-by-step quickstart for creating your first synthetic dataset
- Static Columns: How static columns store the data you provide
- Dynamic Columns: How to add model outputs and evaluations on top of your synthetic data
- Knowledge Base: Ground synthetic generation with your own documents
Was this page helpful?