Understanding Dataset
A dataset in Future AGI is a structured collection of data that serves as the foundation for executing LLM prompts, conducting experiments, and optimizing AI-generated responses.
It organizes data in rows and columns, where each row represents an instance, and columns define the attributes associated with that instance. Datasets provides the necessary context, inputs, and evaluation references for prompt execution and iterative improvements.
Core Components of a Dataset
- Dataset Name: A user-defined label to distinguish different datasets.
- Column Order & Configuration: Maintains the structure of dataset columns, data types, and processing configurations.
- Organization & Permissions: Defines access control, ensuring datasets are linked to specific teams or projects.
Dataset Lifecycle
The dataset system is designed to support a full lifecycle of data management, ensuring flexibility, scalability, and usability across different AI workflows.
1. Creation
Datasets can be created through multiple methods:
- Manual Creation: Users can create datasets by defining structure and adding data manually. Learn more →
- Automated Generation: The system can generate synthetic datasets for controlled testing. Learn more →
- Importing from External Sources: Future AGI supports imports from CSV, Excel, JSON, JSONL, and Hugging Face datasets. Learn more →
- Derived from Experiments: Users can convert experiment results into datasets, allowing further analysis and refinements. Learn more →
2. Enrichment
Datasets can be enriched with additional metadata and evaluations, including:
- Annotations : Users can manually add the labels for a dataset defining their own set of rules and labels. Future AGI also provides auto-annotations which learn from the human in the loop and helps annotating the remaining datapoints. Learn more →
- Evaluations : Users can utilize Future AGI Evaluations to evaluate the datasets to filter out the specific noise etc
4. Maintenance
Datasets are dynamic and evolve over time. The system enables:
- Schema Updates: Columns and metadata can be modified without disrupting existing data.
- Archival & Cleanup: Old datasets can be archived, merged, or deleted, keeping workflows optimized.