Import Datasets from Hugging Face
Pull any public Hugging Face dataset into FutureAGI with a single SDK call and run evaluations on it.
Import any public Hugging Face dataset into FutureAGI with a single SDK call, run evaluations on it, and download scored results.
| Time | Difficulty | Package |
|---|---|---|
| 10 min | Beginner | futureagi, ai-evaluation |
By the end of this guide you will have imported a public Hugging Face dataset into FutureAGI, explored it in the dashboard, run a batch evaluation across every row, and downloaded the scored results.
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
Install
pip install futureagi ai-evaluation
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
What is Hugging Face dataset import?
FutureAGI can pull rows directly from any public Hugging Face dataset without manual download or CSV conversion. You specify the dataset name, an optional subset and split, and the number of rows you want. The SDK handles the rest.
Tutorial
Import a Hugging Face dataset
Use HuggingfaceDatasetConfig to specify which dataset, subset, split, and how many rows to pull. Pass it as the source argument to dataset.create().
This example imports 50 rows from the SmolLM-Corpus cosmopedia-v2 subset — a collection of synthetic textbook-style content with prompts, generated text, audience labels, and format tags.
import os
from fi.datasets import Dataset, DatasetConfig, HuggingfaceDatasetConfig
from fi.utils.types import ModelTypes
hf_config = HuggingfaceDatasetConfig(
name="HuggingFaceTB/smollm-corpus",
subset="cosmopedia-v2",
split="train",
num_rows=50,
)
dataset = Dataset(
dataset_config=DatasetConfig(
name="smollm-cosmopedia-import",
model_type=ModelTypes.GENERATIVE_LLM,
),
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
)
dataset = dataset.create(source=hf_config)
print(f"Dataset created: {dataset.dataset_config.name}")
print(f"Dataset ID: {dataset.dataset_config.id}")Expected output:
Dataset created: smollm-cosmopedia-import
Dataset ID: a1b2c3d4-...Tip
HuggingfaceDatasetConfig accepts four parameters: name (required; the Hugging Face dataset path), subset (defaults to "default"), split (defaults to "train"), and num_rows (optional; omit to import the entire split).
View the imported dataset in the dashboard
Navigate to Dataset in the left sidebar. Your new dataset appears in the list. Click it to browse the imported rows and columns.
The cosmopedia-v2 subset includes columns like prompt, text, audience, format, and token_length — ready for evaluation.
Run an evaluation on the imported data
The prompt column contains the generation instruction and text contains the generated output — a natural fit for a completeness evaluation that checks whether the output fully addresses the input.
dataset = dataset.add_evaluation(
name="completeness-check",
eval_template="completeness",
required_keys_to_column_names={
"input": "prompt",
"output": "text",
},
model="turing_small",
run=True,
reason_column=True,
)
print("Evaluation 'completeness-check' started")Expected output:
Evaluation 'completeness-check' startedNote
Column names depend on the Hugging Face dataset schema. Open the dataset in the dashboard to confirm exact column names before mapping.
Download scored results
Pull the evaluated dataset back as a CSV or a pandas DataFrame.
As CSV:
dataset.download(file_path="smollm_scored.csv")
print("Downloaded scored results to smollm_scored.csv")As pandas DataFrame:
df = dataset.download(load_to_pandas=True)
print("Columns:", list(df.columns))
print(df.head())Expected output:
Columns: ['prompt', 'text', 'token_length', 'audience', 'format', 'seed_data', 'completeness-check', 'completeness-check_reason']
prompt text ...
0 Write a children's story about... ... ... Clean up
dataset.delete()
print("Dataset deleted") What you built
You can now import any public Hugging Face dataset into FutureAGI, run evaluations on it, and download the scored results.
- Imported 50 rows from the SmolLM-Corpus Hugging Face dataset with a single SDK call
- Browsed the imported data in the FutureAGI dashboard
- Ran a completeness evaluation across every row
- Downloaded scored results as CSV and pandas DataFrame