Synthetic Data

Generate realistic datasets from a schema without using real user data.

About

Synthetic data is artificially generated data that follows real-world patterns without using actual user data. In Future AGI, you define a schema (columns, types, descriptions, and constraints) and the platform generates rows that match your specification.

For example, defining this schema:

ColumnTypeDescription
customer_querytextA realistic customer support question
sentimenttextOne of: positive, negative, neutral
priorityinteger1 (low) to 5 (urgent)

Produces rows like:

customer_querysentimentpriority
I haven’t received my order and it’s been two weeksnegative4
Can I change the shipping address on my recent order?neutral2
Your product is fantastic, just wanted to say thanks!positive1

The generated data follows the constraints you set (sentiment is always one of three values, priority stays in range) while producing varied, realistic content.


When to use

  • No real data available: You’re building a new feature and don’t have production data yet
  • Privacy constraints: Real data contains PII or sensitive information that can’t be used for testing
  • Edge case testing: You need specific scenarios (angry customers, rare errors, multilingual queries) that are hard to find in real data
  • Scale testing: You need thousands of rows to stress-test evaluations or prompts
  • Balanced datasets: Real data is skewed (e.g. 95% positive reviews) and you need more balanced distributions

How It Works

  1. Define the schema: column names, data types, and descriptions
  2. Set constraints: value ranges, categorical options, patterns
  3. Optionally connect a Knowledge Base to ground generation with your own documents
  4. Choose the number of rows to generate
  5. The platform generates the dataset. You can review, edit, and use it immediately.

Key Properties

  • Schema-driven: You control the structure. Every column has a type, description, and optional constraints that guide generation.
  • Realistic distribution: Generated data follows natural patterns and distributions, not random values. Descriptions give the generator context to produce relevant content.
  • Safe by default: Generated data does not contain real PII, credentials, or sensitive information.

Next Steps

Was this page helpful?

Questions & Discussion