Tasks

A Task defines the execution flow of an evaluation of a model or a dataset.

To interact with a task using the CLI, use the lf task command.

Task Overview

Properties


key string required

Unique identifier assigned to the entity in AI GO!.

Pattern: ^[a-zA-Z0-9_\-]+$
Max Length: 250

display_name string required

The task's name displayed to the user.


description string required

Short description of the task.


long_description string

Long description of the task. Supports Markdown formatting.

Default: None

tasks array[enum MLTask]

ML tasks supported by the task.

Default: []

Possible MLTask values

The type of machine learning task to be performed.

Allowed Values:

  • chat_completion
  • embeddings
  • custom

evaluated_entity_type enum EvaluatedEntityType

Deprecated. Use definition.evaluated_entity_type instead.

Default: None

Possible EvaluatedEntityType values

Allowed Values:

  • dataset
  • model

config_spec array[FloatParameterSpec, IntParameterSpec, BooleanParameterSpec, StringParameterSpec, ModelParameterSpec, DatasetParameterSpec, DatasetColumnParameterSpec, ListParameterSpec, CategoricalParameterSpec]

Configuration specification of the task.

Default: []

definition SDKBenchmarkTaskDefinitionTemplate, SDKSystemTaskDefinitionTemplate required

Definition of the task.


tags array[string]

Tags associated with the task.

Default: []
display_name: "Single-turn Solver Generic Input"
key: "singleturn-generic-input"
description: "Example task that uses a single-turn solver with generic input builder."
config_spec: []
definition:
  dataset:
    key: "qa-single-answers"
  solver:
    type: "single_turn_solver"
    input_builder:
      type: generic
      template: >
        {
            "messages": [
            {
              "role": "system",
              "content": "You are a helpful assistant. Answer with no punctuation."
            },
            {
              "role": "user",
              "content": "{{ sample.question }}"
            }
          ]
        }
  scorers:
    - type: "string_equals"
      ground_truth: "{{ sample.target }}"
display_name: "Uniqueness Task"
key: "uniqueness-task"
description: >
  Evaluates the uniqueness rate of values in a field across samples in a dataset.
tags: ["Data Quality"]
config_spec:
  - type: "string"
    key: "field"
    display_name: "Field"
definition:
  type: "benchmark_task"
  evaluated_entity_type: "dataset"
  scorers:
    - type: "python_all_samples"
      compute_scores_snippet: !include "uniqueness_scorer.py"
      metrics:
        - type: "mean"
          field: "is_unique"
          name: "Uniqueness Rate"
from __future__ import annotations

from collections import Counter
from typing import Any


def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]:
    field_name = "<< config.field >>"
    values = [
        sample[field_name] if field_name in sample else None for sample in samples
    ]
    counter = Counter(values)

    return [
        {"is_unique": counter[value] == 1 if value is not None else True}
        for value in values
    ]

Definitions

SDKBenchmarkTaskDefinitionTemplate

Properties


type Literal "benchmark_task"

Type of the task definition which is set to benchmark_task (previously declarative_task).

Default: benchmark_task

evaluated_entity_type enum EvaluatedEntityType

Type of entity being evaluated: model or dataset.

Default: model

Possible EvaluatedEntityType values

Allowed Values:

  • dataset
  • model

dataset SDKTaskDatasetTemplate

The (benchmark) dataset used by the task. Required for tasks evaluating models. Should not be provided for task evaluating datasets (since the dataset is provided when using the task in an evaluation).

Default: None

solver SingleTurnSolverTemplate, MultiTurnSolverTemplate, PassThroughSolverTemplate, PythonSolverTemplate

Solver used by the task. Required for tasks evaluating models. Should not be provided for task evaluating datasets.

Default: None

scorers array[BLEUScorerTemplate, StringEqualsScorerTemplate, StringEqualsMCQAScorerTemplate, ModelAsAJudgeClassifierScorerTemplate, LabelerViaModelScorerTemplate, ModelAsAJudgeScorerTemplate, PythonScorerTemplate, AllSamplesPythonScorerTemplate, RAGCheckerScorerTemplate, FunctionCallCoverageScorerTemplate]

List of scorers used by the task.

Default: []

actions array[ActionRule]

Action rules used by this task.

Default: None
display_name: "Harry Potter Trivia Task"
key: "hp-trivia"
description: "Assesses the model knowledge of Harry Potter trivia."
config_spec: []
definition:
  evaluated_entity_type: "model"
  dataset:
    key: "hp-trivia-dataset"
  solver:
    type: "single_turn_solver"
    input_builder:
      type: "chat_completion"
      input_messages:
        - role: "system"
          content: "You are a helpful assistant."
        - role: "user"
          content: "Respond to this question concisely. {{ sample.question }}"
  scorers:
    - type: "model_as_a_judge_classifier"
      model_key: "openai$gpt-4-1-nano"
      system_prompt: |
        You are a helpful assistant and rate whether the ground truth answer and the
        candidate answer are semantically the same. Semantically the same means that
        you do not care about small spelling mistakes, capitalization or whether
        additional information is given. This is all still considered correct.

        Respond only with 'correct' or 'incorrect' and nothing else.
      user_prompt: |
        Ground Truth Answer: {{ sample.gt_answer }}
        Candidate Answer: {{ solver_output.output }}
      correct_labels:
        - "correct"
      incorrect_labels:
        - "incorrect"
      use_structured_outputs: true

SDKSystemTaskDefinitionTemplate

Properties


type Literal "system_task"

Type of the task definition, set to system_task.

Default: system_task

compute_evidence_snippet string required

Python code snippet that computes evidence for the task.

SDKTaskDatasetTemplate

Properties


key string required

Key of the dataset to be used for the task.


fast_subset_size integer, string

Size of the fast subset. This field is deprecated and will be removed in future versions.

Default: None
...
definition:
  ...
  dataset:
    key: "hp-trivia-dataset"
💡

Use the CLI command lf datasets to list all available datasets.

PythonSolverTemplate

Properties


type Literal "python" required

The type of the solver.


run_solver_snippet string required

The Python code snippet defining how to run the solver. It must define a run_solver function with the following API:

async def run_solver(sample, model, trace) -> SolverTrace

where:

  • sample is a dictionary representing the current sample.
  • model is the model object used for inference (with a predict method).
  • trace is a SolverTrace object that can be used to record the conversation and should be returned by the function.

PythonScorerTemplate

Properties


key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose enum ScorerPurpose

The purpose of this scorer.

  • score: The scorer is used to score the solver output or the dataset sample.
  • qa: The scorer is used to do QA over the solver output or the dataset sample.
Default: score

Possible ScorerPurpose values

Allowed Values:

  • score
  • qa

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None

display_name string

The display name of the scorer.

Default: None

type Literal "python" required

The type of the scorer.


compute_scores_snippet string, TemplateValue required

The Python code snippet defining how to compute the scores. It must define a compute_scores function with one the following APIs:

For model tasks:

def compute_scores(sample: dict[str, Any], solver_output: SolverOutput) -> dict[str, Any]:

For dataset tasks:

def compute_scores(sample: dict[str, Any]) -> dict[str, Any]:

Both def and async def are supported.

where

  • sample is a dictionary representing the current sample

  • solver_output is an object with 2 attributes:

    1. output: the last model output
    2. messages: a list of input/output messages representing the interaction history

The function can return scores in one of two formats:

Scores only - return a flat dict of score key/value pairs:

return {"accuracy": 0.95, "is_correct": True}

Scores with metadata - return a structured dict with "scores" and "metadata" keys. Metadata is stored alongside the scores but is not aggregated into metrics:

return {
    "scores": {"accuracy": 0.95, "is_correct": True},
    "metadata": {"model": "gpt-4", "tokens": 123},
}

AllSamplesPythonScorerTemplate

Properties


key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose enum ScorerPurpose

The purpose of this scorer.

  • score: The scorer is used to score the solver output or the dataset sample.
  • qa: The scorer is used to do QA over the solver output or the dataset sample.
Default: score

Possible ScorerPurpose values

Allowed Values:

  • score
  • qa

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None

display_name string

The display name of the scorer.

Default: None

type Literal "python_all_samples" required

The type of the scorer.


compute_scores_snippet string, TemplateValue required

The Python code snippet defining how to compute the scores. It must define a compute_scores function with the following API:

def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]:

Both def and async def are supported.

where

  • samples is the list of samples (in the same order as the dataset)
  • the returned list must contain one entry per sample

Each entry can be in one of two formats:

Scores only - a flat dict of score key/value pairs:

[
    {"accuracy": 0.95, "is_correct": True},
    ...
]

Scores with metadata - a structured dict with "scores" and "metadata" keys. Metadata is stored alongside the scores but is not aggregated into metrics:

[
    {
        "scores": {"accuracy": 0.95, "is_correct": True},
        "metadata": {"model": "gpt-4", "tokens": 123},
    },
    ...
]

ModelAsAJudgeScorerTemplate

Properties


key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose enum ScorerPurpose

The purpose of this scorer.

  • score: The scorer is used to score the solver output or the dataset sample.
  • qa: The scorer is used to do QA over the solver output or the dataset sample.
Default: score

Possible ScorerPurpose values

Allowed Values:

  • score
  • qa

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None

display_name string

The display name of the scorer.

Default: None

type Literal "model_as_a_judge_scorer" required

The type of the scorer.


model_key Key, TemplateValue required

The model to be used as the judge.


system_prompt string, TemplateValue

The system prompt given to the judge model. The prompt can refer to the following variable dynamically (using &#123;&#123; &#125;&#125; syntax):

In all scenarios:

  1. sample: Sample attributes (ex: {{ sample.answer }})

If the task has a solver:

  1. model_output: The last model output (ex: {{ model_output }})
  2. messages: The full list of input/output messages (ex: {{ messages[0]['content'] }})
  3. input_prompt: The message contents of the last model output (only for chat completion tasks) (ex: {{ input_prompt }})
Default: You are a helpful assistant and will be used to judge the output of another model.

user_prompt string, TemplateValue required

The user prompt given to the judge model. The prompt can refer to the following variable dynamically (using &#123;&#123; &#125;&#125; syntax):

In all scenarios:

  1. sample: Sample attributes (ex: {{ sample.answer }})

If the task has a solver:

  1. model_output: The last model output (ex: {{ model_output }})
  2. messages: The full list of input/output messages (ex: {{ messages[0]['content'] }})
  3. input_prompt: The message contents of the last model output (only for chat completion tasks) (ex: {{ input_prompt }})

score_min number, TemplateValue

The minimum score that the judge model can predict.

Default: 0.0

score_max number, TemplateValue

The maximum score that the judge model can predict.

Default: 1.0

use_structured_outputs boolean

Whether to use structured outputs. It is recommended to enable this if the model supports it.

Default: False

LabelerViaModelScorerTemplate

Properties


key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose enum ScorerPurpose

The purpose of this scorer.

  • score: The scorer is used to score the solver output or the dataset sample.
  • qa: The scorer is used to do QA over the solver output or the dataset sample.
Default: score

Possible ScorerPurpose values

Allowed Values:

  • score
  • qa

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None

display_name string

The display name of the scorer.

Default: None

type Literal "labeler_via_model" required

The type of the scorer.


model_key Key, TemplateValue required

The model to be used as the labeler.


system_prompt string, TemplateValue

The system prompt given to the labeler model. The prompt can refer to the following variable dynamically (using &#123;&#123; &#125;&#125; syntax):

In all scenarios:

  1. sample: Sample attributes (ex: {{ sample.answer }})

If the task has a solver:

  1. model_output: The last model output (ex: {{ model_output }})
  2. messages: The full list of input/output messages (ex: {{ messages[0]['content'] }})
  3. input_prompt: The message contents of the last model output (only for chat completion tasks) (ex: {{ input_prompt }})
Default: You are a helpful assistant and will be used to label the output of another model or a dataset sample.

user_prompt string, TemplateValue required

The user prompt given to the labeler model. The prompt can refer to the following variable dynamically (using &#123;&#123; &#125;&#125; syntax):

In all scenarios:

  1. sample: Sample attributes (ex: {{ sample.answer }})

If the task has a solver:

  1. model_output: The last model output (ex: {{ model_output }})
  2. messages: The full list of input/output messages (ex: {{ messages[0]['content'] }})
  3. input_prompt: The message contents of the last model output (only for chat completion tasks) (ex: {{ input_prompt }})

valid_labels array[string, TemplateValue]

The list of valid labels. To allow any label, use an empty list (default).

Default: []

use_structured_outputs boolean

Whether to use structured outputs. It is recommended to enable this if the model supports it.

Default: False

ActionRule

Properties


key string required

Key: 1-250 chars, allowed: a-z A-Z 0-9 _ -

Pattern: ^[a-zA-Z0-9_\-]+$
Max Length: 250

action enum ActionRuleAction required

The action to be applied to samples that match the filter.


Possible ActionRuleAction values

Allowed Values:

  • exclude_from_metrics

filter FilterComparison, FilterMembership, FilterUnary required

The filter that determines which samples the action applies to.

FilterUnary

Properties


op enum FilterUnaryOp required

Possible FilterUnaryOp values

The unary operator to apply.

Allowed Values:

  • exists
  • not_exists
  • is_true
  • is_false

expression string required

An expression encoding what to apply the unary operator to.

Depending on the context, it can refer to different variables:

  • When filtering a dataset: it can refer to column values by name (ex: {{ category }}).
  • When used within a task action: it can refer to the sample, the solver_output or the scores (which is a mapping between scorer keys and their corresponding score values dict).

FilterComparison

Properties


op enum FilterComparisonOp required

Possible FilterComparisonOp values

The comparison operator to apply.

Allowed Values:

  • equals
  • not_equals
  • greater_than
  • less_than
  • greater_or_equal
  • less_or_equal

expression string required

An expression encoding what to compare against the value.

Depending on the context, it can refer to different variables:

  • When filtering a dataset: it can refer to the sample and use dot or bracket notation to access the columns. If filtering a dataset with column names that are illegal under jinja substitution rules (e.g. containing spaces), use bracket notation to access the column.
  • When used within a task action: it can refer to the sample, the solver_output or the scores (which is a mapping between scorer keys and their corresponding score values dict).

value string, number, integer, boolean required

The value against which the expression is compared.

FilterMembership

Properties


op enum FilterMembershipOp required

Possible FilterMembershipOp values

The membership operator to apply.

Allowed Values:

  • in
  • not_in

expression string required

An expression encoding what to check membership against the values.

Depending on the context, it can refer to different variables:

  • When filtering a dataset: it can refer to column values by name (ex: {{ category }}).
  • When used within a task action: it can refer to the sample, the solver_output or the scores (which is a mapping between scorer keys and their corresponding score values dict).

values array[string, number, boolean] required

The set of values to test membership against.