Dataset Generators
Dataset Generators provide a modular, declarative way to create synthetic datasets in AI GO!.
To interact with a dataset generators using the CLI, use the lf dataset-generator command.
Dataset Generator Overview
Properties
key string required
Unique identifier assigned to the entity in AI GO!.
Pattern:^[a-zA-Z0-9_\-]+$Max Length:
250display_name string required
The dataset generator's name displayed to the user.
description string required
Short description of the dataset generator.
long_description string
Long description of the dataset generator. Supports Markdown formatting.
Default:Noneconfig_spec array[FloatParameterSpec, IntParameterSpec, BooleanParameterSpec, StringParameterSpec, ModelParameterSpec, DatasetParameterSpec, DatasetColumnParameterSpec, ListParameterSpec, CategoricalParameterSpec]
Configuration specification for the dataset generator.
Default:[]definition SDKDeclarativeDatasetGeneratorDefinitionTemplate required
Declarative dataset generator definition. It must be of type SDKDeclarativeDatasetGeneratorDefinitionTemplate.
display_name: "My Question Generator"
key: "question-generator"
description: "Generates questions about arbitrary topics."
config_spec:
- key: "topic"
type: "string"
display_name: "Topic"
definition:
type: "declarative_dataset_generator"
data_source:
type: "empty"
synthesizers:
- type: "llm"
model_key: "openai$gpt-4-1-nano"
system_prompt_template: "You are a helpful dataset generator."
user_prompt_template: >
Produce 5 deep knowledge questions about << config.topic >>, including the
correct answer.
sample_properties:
question:
type: "string"
answer:
type: "string"Definitions
SDKDeclarativeDatasetGeneratorDefinitionTemplate
SDKDeclarativeDatasetGeneratorDefinitionTemplateProperties
type Literal "declarative_dataset_generator"
Refers to user-defined dataset generators.
Default:declarative_dataset_generatordata_source EmptyDataSourceTemplate, DatasetSampleDataSourceTemplate, InlineSamplesDataSourceTemplate, DatasetSampleCombinationsDataSourceTemplate required
Data source used by the dataset generator.
synthesizer EmptySynthesizerTemplate, LLMSynthesizerTemplate, QuestionAnsweringSynthesizerTemplate, TemplateSynthesizerTemplate, PythonSynthesizerTemplate, DropColumnsSynthesizerTemplate
Data synthesizer configuration used by the dataset generator. This field is deprecated and will be removed in future versions. Please use synthesizers instead.
Nonesynthesizers array[EmptySynthesizerTemplate, LLMSynthesizerTemplate, QuestionAnsweringSynthesizerTemplate, TemplateSynthesizerTemplate, PythonSynthesizerTemplate, DropColumnsSynthesizerTemplate]
Data synthesizer configurations used by the dataset generator. Must contain at least one synthesizer if provided.
Default:None...
definition:
type: "declarative_dataset_generator"
data_source:
type: "empty"
synthesizers:
- type: "llm"
model_key: "openai$gpt-4-1-nano"
system_prompt_template: "You are a helpful dataset generator."
user_prompt_template: >
Produce 5 deep knowledge questions about << config.topic >>, including the
correct answer.
sample_properties:
question:
type: "string"
answer:
type: "string"EmptyDataSourceTemplate
EmptyDataSourceTemplateProperties
type Literal "empty" required
The type of the data source.
num_samples integer, TemplateValue
The number of (empty) source samples returned.
Default:1...
definition:
type: "declarative_dataset_generator"
data_source:
type: "empty"DatasetSampleDataSourceTemplate
DatasetSampleDataSourceTemplateProperties
type Literal "dataset_samples" required
The type of the data source.
dataset_key Key, TemplateValue required
The ID of the dataset whose samples should be used.
random_seed integer, TemplateValue
The random seed to use for dataset generation. If not provided, the processing order will be sequential and deterministic.
Default:None...
definition:
...
data_source:
type: "dataset_samples"
dataset_key: "fairllm_sensitive_attributes-country"Use the CLI command
lf datasetsto list all available datasets.
InlineSamplesDataSourceTemplate
InlineSamplesDataSourceTemplateProperties
type Literal "inline_samples" required
The type of the data source.
samples array[object] required
The samples to use as the data source.
...
definition:
...
data_source:
type: "inline_samples"
samples:
- topic: "Biology"
style: "Multiple choice"
- topic: "Chemistry"
style: "Open-ended"
- topic: "Physics"
style: "Proofs"DatasetSampleCombinationsDataSourceTemplate
DatasetSampleCombinationsDataSourceTemplateProperties
type Literal "dataset_sample_combinations" required
The type of the data source.
dataset_keys array[Key, TemplateValue] required
The datasets whose samples should be used to generate combinations.
random_seed integer, TemplateValue
The random seed to use for dataset generation. If not provided, the processing order will be sequential and deterministic.
Default:None...
definition:
...
data_source:
type: "dataset_sample_combinations"
dataset_keys:
- "cities_dataset"
- "languages_dataset"Use the CLI command
lf datasetsto list all available datasets.
EmptySynthesizerTemplate
EmptySynthesizerTemplateProperties
type Literal "empty" required
The type of the synthesizer.
LLMSynthesizerTemplate
LLMSynthesizerTemplateProperties
type Literal "llm" required
The type of the synthesizer.
model_key Key, TemplateValue required
The key of the chat completion model to be used as a synthesizer.
system_prompt string, TemplateValue
The Jinja template used to create the system prompt. The source sample is available in the Jinja context as source. If not provided, no system message is sent and format instructions are appended to the user prompt. The system prompt must be provided if file_ids_field is used, as it is the only text channel available to instruct the model.
Noneuser_prompt_template string, TemplateValue
The Jinja template used to create the user prompt. The source sample is available in the Jinja context as source. Not used when file_ids_field is provided.
Nonefile_ids_field string, TemplateValue
The name of the field in the source sample that contains file IDs. If provided, the user message will use file content type instead of text. The field value should be a list of strings. The model must be chat-completion-compatible to support file content messages.
Default:Noneformat_instructions string, TemplateValue
Instructions that control the expected output format. When use_structured_outputs is False (default), they are appended to the system prompt if one is provided, or to the user prompt otherwise. When use_structured_outputs is True, they are placed in response_format.json_schema.description instead. If not provided, they will be derived from the sample_properties. When parsing non-structured output, the model output is expected to be valid JSON contained in
Nonesample_properties object required
The 'properties' field of the JSON schema for the output samples.
use_structured_outputs boolean
Whether to use structured outputs. Defaults to False. It is recommended to enable structured outputs if the model supports it.
Default:Falsesystem_prompt_template string, TemplateValue
Deprecated. Use system_prompt instead.
Nonesystem_prompt_format_instructions string, TemplateValue
Deprecated. Use format_instructions instead.
None...
definition:
...
synthesizers:
- type: "llm"
model_key: "openai$gpt-4-1-nano"
system_prompt_template: "You are a helpful dataset generator."
user_prompt_template: >
Generate a question-answer pair about {{ source.city }},
the capital of {{ source.country }}, written in
{{ source.language }}.
sample_properties:
question:
type: string
answer:
type: stringQuestionAnsweringSynthesizerTemplate
QuestionAnsweringSynthesizerTemplateProperties
type Literal "question_answering" required
The type of the synthesizer.
model_key Key, TemplateValue required
The key of the chat completion model to be used as a synthesizer.
qa_type string required
The type of question answering synthesizer. Supports 'multiple_choice' and 'open_ended'.
content_column string, TemplateValue
The name of the text content column in the dataset. This defines the source text when generating QA dataset. Defaults to 'text'.
Default:text
title_column string, TemplateValue
The name of the title column in the dataset. This is supplementary information and not required for generating QA dataset. Defaults to 'document_title'.
Default:document_title
summary_column string, TemplateValue
The name of the summary column in the dataset. This is supplementary information and not required for generating QA dataset. Defaults to 'document_summary'.
Default:document_summary
system_prompt string, TemplateValue
The system prompt to use. If not provided, a default prompt will be used.
Default:None
user_prompt string, TemplateValue
The user prompt to use. If not provided, a default prompt will be used.
Default:None
additional_instructions string, TemplateValue
Additional instructions to provide to the generator model.
Default:none
TemplateSynthesizerTemplate
TemplateSynthesizerTemplateProperties
type Literal "template" required
The type of the synthesizer.
template string, TemplateValue, array[string, TemplateValue] required
The template string used for synthesis.
The following jinja context is available:
- source: The sample for which the template is being rendered. Individual fields can be
accessed using the jinja
{{ source.field_name }}syntax. - individual fields: The individual fields of the source sample. These can be accessed
using the python string formatting
{field_name}syntax.
fields object required
A dictionary where keys are placeholder names in the template string, and values are lists of possible values for those placeholders.
...
definition:
...
synthesizers:
- type: "template"
template: |
I'm in {{ source.city }}, the capital of {{ source.country }}.
I want to travel {{ source.distance }} in the {direction} direction.
What is the name of the city I will arrive at?
fields:
direction:
- "north"
- "south"
- "east"
- "west"PythonSynthesizerTemplate
PythonSynthesizerTemplateProperties
type Literal "python" required
The type of the synthesizer.
synthesize_snippet string, TemplateValue required
The Python snippet defining how output samples are generated
from a single source sample. It must define a synthesize function, with the following
API:
def synthesize(source: dict[str, Any]) -> list[dict[str, Any]]:where:
sourceis a dictionary representing a single source sample.- The return value is a list of dictionaries, each representing an output sample.
Both def and async def are supported.
DropColumnsSynthesizerTemplate
DropColumnsSynthesizerTemplateProperties
type Literal "drop_columns" required
The type of the synthesizer.
columns array[string] required
The list of column names to remove from each output sample. All other columns are kept. Specified columns that are not present in the sample are silently ignored.
