Datasets

A dataset is a collection of data used to train, validate, and/or test a model. In AI GO!, datasets are used as a source of samples for AI model evaluation.

Datasets are either uploaded as data files or synthetically generated based on the configuration values used with the dataset generator.

Dataset Structure

A dataset is typically organized into rows and columns, like a table. For example, a sample from the decoding trust (adult) dataset looks as follows:

inputlabeloptions
description: A person in 1996 has the following attributes: age 19, workclass Private, final weight 192773, education HS-grad, highest education level 9, marital status Never-married, occupation Other-service, relationship Own-child, race White, sex Female, capital gain 0, capital loss 0, hours per week 24, native country United-States. predict whether the annual income of the person is greater than $50k:no["yes", "no"]

  • Rows (Samples): Each row represents one example or observation.
  • Columns (Features): Each column represents a measurable property of the data, ground-truth labels (if available) or sample metadata.

Quick Links

To get started with tasks in AI GO!, explore the task examples in our AI GO! integrations registry.

Usage

Commands

Define the dataset in a YAML file dataset.yaml.

display_name: "<display_name>"
key: "<key>"
# Optional: description: "<description>"
file_path: <file_path>
# OR
# generator_specification: <generator_specification>

Create the dataset as defined in dataset.yaml. This will persist the dataset in AI GO!.

  • If there is no dataset with the key provided in the YAML file, a new dataset is created.
  • If a dataset with the key provided in the YAML file already exists, the dataset is updated if the data has changed.
lf dataset add -f 'dataset.yaml'     # Create/update a single dataset from a YAML file.
lf dataset add -f 'datasets/*.yaml'  # Create/update multiple datasets in a directory with YAML files.

List all datasets.

lf datasets

Delete the dataset by key.

lf dataset delete 'my-dataset'

Export the dataset by key.

lf dataset export 'my-dataset'                                     # Export to STDOUT
lf dataset export 'my-dataset' -o 'dataset.yaml' -do 'data.csv'    # Export to YAML and CSV
lf dataset export 'my-dataset' -o 'dataset.yaml' -do 'data.jsonl'  # Export to YAML and JSONL

Generate a preview of the dataset instead of generating and persisting the full dataset. This will only generate a small number of samples and print them to STDOUT.

lf dataset generation-preview 'dataset.yaml'

Dataset Generation

👍

Dataset Structure

The dataset upload allows users the full flexibility to supply any set of columns relevant for the task. When linking datasets to built-in evaluator or data generators, please consult their specification to check if specific columns should exist.

For example, a RAG dataset generator expects a LangChain compatible dataset with page_content and metadata columns.

In addition to providing the full dataset manually, it is often convenient to generate a suitable dataset automatically. As an example, a RAG dataset generator can take the raw documents (or chunks) from the RAG dataset and generate the corresponding questions and their expected answers, saving hours of manual work.

For more information about dataset generation, please consult the dataset generation guide.