Dataset Generators

Dataset Generators provide a modular, declarative way to create synthetic datasets in AI GO!.

To interact with a dataset generators using the CLI, use the lf dataset-generator command.

Dataset Generator Overview

Properties


key string required

Unique identifier assigned to the entity in AI GO!.

Pattern: ^[a-zA-Z0-9_\-]+$
Max Length: 250

display_name string required

The dataset generator's name displayed to the user.


description string required

Short description of the dataset generator.


long_description string

Long description of the dataset generator. Supports Markdown formatting.

Default: None

config_spec array[FloatParameterSpec, IntParameterSpec, BooleanParameterSpec, StringParameterSpec, ModelParameterSpec, DatasetParameterSpec, DatasetColumnParameterSpec, ListParameterSpec, CategoricalParameterSpec]

Configuration specification for the dataset generator.

Default: []

definition SDKDeclarativeDatasetGeneratorDefinitionTemplate required

Declarative dataset generator definition. It must be of type SDKDeclarativeDatasetGeneratorDefinitionTemplate.

display_name: "My Question Generator"
key: "question-generator"
description: "Generates questions about arbitrary topics."
config_spec:
  - key: "topic"
    type: "string"
    display_name: "Topic"
definition:
  type: "declarative_dataset_generator"
  data_source:
    type: "empty"
  synthesizers:
    - type: "llm"
      model_key: "openai$gpt-4-1-nano"
      system_prompt_template: "You are a helpful dataset generator."
      user_prompt_template: >
        Produce 5 deep knowledge questions about << config.topic >>, including the
        correct answer.
      sample_properties:
        question:
          type: "string"
        answer:
          type: "string"

Definitions

SDKDeclarativeDatasetGeneratorDefinitionTemplate

Properties


type Literal "declarative_dataset_generator"

Refers to user-defined dataset generators.

Default: declarative_dataset_generator

data_source EmptyDataSourceTemplate, DatasetSampleDataSourceTemplate, InlineSamplesDataSourceTemplate, DatasetSampleCombinationsDataSourceTemplate required

Data source used by the dataset generator.


synthesizer EmptySynthesizerTemplate, LLMSynthesizerTemplate, QuestionAnsweringSynthesizerTemplate, TemplateSynthesizerTemplate, PythonSynthesizerTemplate, DropColumnsSynthesizerTemplate

Data synthesizer configuration used by the dataset generator. This field is deprecated and will be removed in future versions. Please use synthesizers instead.

Default: None

synthesizers array[EmptySynthesizerTemplate, LLMSynthesizerTemplate, QuestionAnsweringSynthesizerTemplate, TemplateSynthesizerTemplate, PythonSynthesizerTemplate, DropColumnsSynthesizerTemplate]

Data synthesizer configurations used by the dataset generator. Must contain at least one synthesizer if provided.

Default: None
...
definition:
  type: "declarative_dataset_generator"
  data_source:
    type: "empty"
  synthesizers:
    - type: "llm"
      model_key: "openai$gpt-4-1-nano"
      system_prompt_template: "You are a helpful dataset generator."
      user_prompt_template: >
        Produce 5 deep knowledge questions about << config.topic >>, including the
        correct answer.
      sample_properties:
        question:
          type: "string"
        answer:
          type: "string"

EmptyDataSourceTemplate

Properties


type Literal "empty" required

The type of the data source.


num_samples integer, TemplateValue

The number of (empty) source samples returned.

Default: 1
...
definition:
  type: "declarative_dataset_generator"
  data_source:
    type: "empty"

DatasetSampleDataSourceTemplate

Properties


type Literal "dataset_samples" required

The type of the data source.


dataset_key Key, TemplateValue required

The ID of the dataset whose samples should be used.


random_seed integer, TemplateValue

The random seed to use for dataset generation. If not provided, the processing order will be sequential and deterministic.

Default: None
...
definition:
  ...
  data_source:
    type: "dataset_samples"
    dataset_key: "fairllm_sensitive_attributes-country"
💡

Use the CLI command lf datasets to list all available datasets.

InlineSamplesDataSourceTemplate

Properties


type Literal "inline_samples" required

The type of the data source.


samples array[object] required

The samples to use as the data source.

...
definition:
  ...
  data_source:
    type: "inline_samples"
    samples:
      - topic: "Biology"
        style: "Multiple choice"
      - topic: "Chemistry"
        style: "Open-ended"
      - topic: "Physics"
        style: "Proofs"

DatasetSampleCombinationsDataSourceTemplate

Properties


type Literal "dataset_sample_combinations" required

The type of the data source.


dataset_keys array[Key, TemplateValue] required

The datasets whose samples should be used to generate combinations.


random_seed integer, TemplateValue

The random seed to use for dataset generation. If not provided, the processing order will be sequential and deterministic.

Default: None
...
definition:
  ...
  data_source:
    type: "dataset_sample_combinations"
    dataset_keys:
      - "cities_dataset"
      - "languages_dataset"
💡

Use the CLI command lf datasets to list all available datasets.

EmptySynthesizerTemplate

Properties


type Literal "empty" required

The type of the synthesizer.

LLMSynthesizerTemplate

Properties


type Literal "llm" required

The type of the synthesizer.


model_key Key, TemplateValue required

The key of the chat completion model to be used as a synthesizer.


system_prompt string, TemplateValue

The Jinja template used to create the system prompt. The source sample is available in the Jinja context as source. If not provided, no system message is sent and format instructions are appended to the user prompt. The system prompt must be provided if file_ids_field is used, as it is the only text channel available to instruct the model.

Default: None

user_prompt_template string, TemplateValue

The Jinja template used to create the user prompt. The source sample is available in the Jinja context as source. Not used when file_ids_field is provided.

Default: None

file_ids_field string, TemplateValue

The name of the field in the source sample that contains file IDs. If provided, the user message will use file content type instead of text. The field value should be a list of strings. The model must be chat-completion-compatible to support file content messages.

Default: None

format_instructions string, TemplateValue

Instructions that control the expected output format. When use_structured_outputs is False (default), they are appended to the system prompt if one is provided, or to the user prompt otherwise. When use_structured_outputs is True, they are placed in response_format.json_schema.description instead. If not provided, they will be derived from the sample_properties. When parsing non-structured output, the model output is expected to be valid JSON contained in ... tags, meaning that the format instructions should instruct the model to follow this format.

Default: None

sample_properties object required

The 'properties' field of the JSON schema for the output samples.


use_structured_outputs boolean

Whether to use structured outputs. Defaults to False. It is recommended to enable structured outputs if the model supports it.

Default: False

system_prompt_template string, TemplateValue

Deprecated. Use system_prompt instead.

Default: None

system_prompt_format_instructions string, TemplateValue

Deprecated. Use format_instructions instead.

Default: None
...
definition:
  ...
  synthesizers:
    - type: "llm"
      model_key: "openai$gpt-4-1-nano"
      system_prompt_template: "You are a helpful dataset generator."
      user_prompt_template: >
        Generate a question-answer pair about {{ source.city }},
        the capital of {{ source.country }}, written in
        {{ source.language }}.
      sample_properties:
        question:
          type: string
        answer:
          type: string

QuestionAnsweringSynthesizerTemplate

Properties


type Literal "question_answering" required

The type of the synthesizer.


model_key Key, TemplateValue required

The key of the chat completion model to be used as a synthesizer.


qa_type string required

The type of question answering synthesizer. Supports 'multiple_choice' and 'open_ended'.


content_column string, TemplateValue

The name of the text content column in the dataset. This defines the source text when generating QA dataset. Defaults to 'text'.

Default: text

title_column string, TemplateValue

The name of the title column in the dataset. This is supplementary information and not required for generating QA dataset. Defaults to 'document_title'.

Default: document_title

summary_column string, TemplateValue

The name of the summary column in the dataset. This is supplementary information and not required for generating QA dataset. Defaults to 'document_summary'.

Default: document_summary

system_prompt string, TemplateValue

The system prompt to use. If not provided, a default prompt will be used.

Default: None

user_prompt string, TemplateValue

The user prompt to use. If not provided, a default prompt will be used.

Default: None

additional_instructions string, TemplateValue

Additional instructions to provide to the generator model.

Default: none

TemplateSynthesizerTemplate

Properties


type Literal "template" required

The type of the synthesizer.


template string, TemplateValue, array[string, TemplateValue] required

The template string used for synthesis.

The following jinja context is available:

  • source: The sample for which the template is being rendered. Individual fields can be accessed using the jinja {{ source.field_name }} syntax.
  • individual fields: The individual fields of the source sample. These can be accessed using the python string formatting {field_name} syntax.

fields object required

A dictionary where keys are placeholder names in the template string, and values are lists of possible values for those placeholders.

...
definition:
  ...
  synthesizers:
    - type: "template"
      template: |
        I'm in {{ source.city }}, the capital of {{ source.country }}.

        I want to travel {{ source.distance }} in the {direction} direction.

        What is the name of the city I will arrive at?
      fields:
        direction:
          - "north"
          - "south"
          - "east"
          - "west"

PythonSynthesizerTemplate

Properties


type Literal "python" required

The type of the synthesizer.


synthesize_snippet string, TemplateValue required

The Python snippet defining how output samples are generated from a single source sample. It must define a synthesize function, with the following API:

def synthesize(source: dict[str, Any]) -> list[dict[str, Any]]:

where:

  • source is a dictionary representing a single source sample.
  • The return value is a list of dictionaries, each representing an output sample.

Both def and async def are supported.

DropColumnsSynthesizerTemplate

Properties


type Literal "drop_columns" required

The type of the synthesizer.


columns array[string] required

The list of column names to remove from each output sample. All other columns are kept. Specified columns that are not present in the sample are silently ignored.