Concept
A dataset in is a structured collection of data that serves as the foundation for executing LLM prompts, conducting experiments, and optimizing AI-generated responses. It organises data in rows and columns, where each row represents an instance, and columns define the attributes associated with that instance.
Datasets provides the necessary context, inputs, and evaluation references for prompt execution and iterative improvements.
Core Components of a Dataset
- Dataset Source Type: Specifies how the dataset was created, whether it was uploaded, imported, or generated.
- Dataset Name: A user-defined label to distinguish different datasets.
- Column Order & Configuration: Maintains the structure of dataset columns, including metadata, data types, and processing configurations.
- Organization & Permissions: Defines access control, ensuring datasets are linked to specific teams or projects.
Dataset Lifecycle
The dataset system is designed to support a full lifecycle of data management, ensuring flexibility, scalability, and usability across different AI workflows.
1. Creation
Datasets can be created through multiple methods:
- Manual Creation: Users can create datasets by defining structure and adding data manually.
- Automated Generation: The system can generate synthetic datasets for controlled testing.
- Importing from External Sources: Future AGI supports imports from CSV, Excel, JSON, JSONL, and Hugging Face datasets.
- Derived from Experiments: Users can convert experiment results into datasets, allowing further analysis and refinements.
2. Enrichment
Datasets can be enriched with additional metadata and evaluations, including:
- Annotations & Feedback: Users can manually add comments, corrections, or human review insights.
- Custom Evaluation Metrics: Datasets can store evaluation results, tracking accuracy, fluency, token usage, and performance.
3. Analysis & Performance Tracking
Future AGI provides built-in analysis tools to measure dataset impact on AI models:
- Quality Assessment: Users can evaluate data integrity and consistency.
- Performance Tracking: Monitors model response times, accuracy trends, and token usage for cost efficiency.
- Trend Analysis: Compares multiple datasets to detect shifts in model performance over time.
4. Maintenance
Datasets are dynamic and evolve over time. The system enables:
- Schema Updates: Columns and metadata can be modified without disrupting existing data.
- Archival & Cleanup: Old datasets can be archived, merged, or deleted, keeping workflows optimized.