Create Synthetic Data
Synthetic data generation allows you to create realistic, structured datasets without using real-world data. This powerful feature helps you
- Prototype AI Applications — Build and test applications with representative data before collecting real data
- Augment Training Sets — Expand limited datasets with diverse synthetic examples to improve model performance
- Test Edge Cases — Generate rare scenarios that might be difficult to find in real-world data
- Ensure Privacy Compliance — Avoid data privacy concerns by using synthetic alternatives to sensitive information
- Balance Datasets — Create balanced class distributions for more effective model training
1. Open the Tool
Navigate to the Dataset section in the sidebar.
Click Add Dataset → Create Synthetic Data.
This opens the interface, where you’ll define the structure and patterns for your synthetic dataset you want to generate.
2. Set Dataset Details
Start by providing basic metadata:
- Name (required): Give your dataset a clear, descriptive title.
- Description(required) : Write the details of the dataset that you will be generating, what is the purpose of the generation etc.
- Use Case : Specify the Use case for your dataset that is going to be used
- “Simulated customer support logs for LLM fine-tuning”
- “Classification dataset with evenly distributed labels”
- Pattern (optional): Write the structure of how your dataset should be
- “Follow a Conversational pattern while generating the dataset”
- “Keep the tone formal for all the data points”
This context helps organize datasets in large projects and enables team collaboration.
3. Define the Schema
Click Add Column to define the structure of each row.
For every column:
- Name: Name of the column (e.g.,
message
,label
,timestamp
,transcript
) - Type: Choose from:
text
,float
,integer
,boolean
,array
,json
,datetime
- Properties:
- Add constraints (like min/max, string patterns, etc.) to ensure realistic value ranges.
- When choosing property
Value
You can specify the categorical label or go for dynamic and let the generator decide the label - You can create more properties based on your use case by specifying the name and description of the property
This step is where you define how your data behaves—whether it mimics user queries, numerical values, or system logs.
3.1 Example Schema Definition
Let’s illustrate with an example. Suppose you’re creating a dataset for product reviews. You might define the following columns:
- Column 1:
- Name:
review_text
- Type:
text
- Properties: None specific, as the content is freeform.
- Name:
- Column 2:
- Name:
rating
- Type:
integer
- Properties:
min
:1
(Ensures ratings are at least 1 star)max
:5
(Ensures ratings are at most 5 stars)
- Name:
- Column 3:
- Name:
sentiment
- Type:
text
- Properties:
Value
:positive
,negative
,neutral
(Specifies allowed categorical values)
- Name:
4. Set Row Count
Specify how many rows you want the dataset to contain.
The generator will create this many entries based on your schema.
Click Next
5. Define Column Descriptions
-
Define the details for each column you have provided.
This will give our generator all the information for each column to create a rich dataset that you desire
6. Generate the Dataset
- Click Next to preview the schema and example values.
- Review and make adjustments if needed.
- Click Create to generate the full dataset.
Once complete, the dataset is saved and ready for exploration or use in downstream tasks.
What’s Next?
Once your synthetic dataset is created, you can:
- Explore the Data: Click on the dataset name to view the generated rows and columns.
- Use in Experiments: Integrate your dataset into Experimentation Workflows.
- Add Annotations: Enhance the dataset with Annotations
Was this page helpful?