Tasks
A Task defines the execution flow of an evaluation of a model or a dataset.
To interact with a task using the CLI, use the lf task command.
Task Overview
Properties
key string required
Unique identifier assigned to the entity in AI GO!.
Pattern:^[a-zA-Z0-9_\-]+$Max Length:
250display_name string required
The task's name displayed to the user.
description string required
Short description of the task.
long_description string
Long description of the task. Supports Markdown formatting.
Default:Nonetasks array[enum MLTask]
ML tasks supported by the task.
Default:[]Possible MLTask values
The type of machine learning task to be performed.
Allowed Values:
chat_completionembeddingscustom
evaluated_entity_type enum EvaluatedEntityType
Deprecated. Use definition.evaluated_entity_type instead.
NonePossible EvaluatedEntityType values
Allowed Values:
datasetmodel
config_spec array[FloatParameterSpec, IntParameterSpec, BooleanParameterSpec, StringParameterSpec, ModelParameterSpec, DatasetParameterSpec, DatasetColumnParameterSpec, ListParameterSpec, CategoricalParameterSpec]
Configuration specification of the task.
Default:[]definition SDKBenchmarkTaskDefinitionTemplate, SDKSystemTaskDefinitionTemplate required
Definition of the task.
tags array[string]
Tags associated with the task.
Default:[]display_name: "Single-turn Solver Generic Input"
key: "singleturn-generic-input"
description: "Example task that uses a single-turn solver with generic input builder."
config_spec: []
definition:
dataset:
key: "qa-single-answers"
solver:
type: "single_turn_solver"
input_builder:
type: generic
template: >
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Answer with no punctuation."
},
{
"role": "user",
"content": "{{ sample.question }}"
}
]
}
scorers:
- type: "string_equals"
ground_truth: "{{ sample.target }}"display_name: "Uniqueness Task"
key: "uniqueness-task"
description: >
Evaluates the uniqueness rate of values in a field across samples in a dataset.
tags: ["Data Quality"]
config_spec:
- type: "string"
key: "field"
display_name: "Field"
definition:
type: "benchmark_task"
evaluated_entity_type: "dataset"
scorers:
- type: "python_all_samples"
compute_scores_snippet: !include "uniqueness_scorer.py"
metrics:
- type: "mean"
field: "is_unique"
name: "Uniqueness Rate"from __future__ import annotations
from collections import Counter
from typing import Any
def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]:
field_name = "<< config.field >>"
values = [
sample[field_name] if field_name in sample else None for sample in samples
]
counter = Counter(values)
return [
{"is_unique": counter[value] == 1 if value is not None else True}
for value in values
]Definitions
SDKBenchmarkTaskDefinitionTemplate
SDKBenchmarkTaskDefinitionTemplateProperties
type Literal "benchmark_task"
Type of the task definition which is set to benchmark_task (previously declarative_task).
benchmark_taskevaluated_entity_type enum EvaluatedEntityType
Type of entity being evaluated: model or dataset.
modelPossible EvaluatedEntityType values
Allowed Values:
datasetmodel
dataset SDKTaskDatasetTemplate
The (benchmark) dataset used by the task. Required for tasks evaluating models. Should not be provided for task evaluating datasets (since the dataset is provided when using the task in an evaluation).
Default:Nonesolver SingleTurnSolverTemplate, MultiTurnSolverTemplate, PassThroughSolverTemplate, PythonSolverTemplate
Solver used by the task. Required for tasks evaluating models. Should not be provided for task evaluating datasets.
Default:Nonescorers array[BLEUScorerTemplate, StringEqualsScorerTemplate, StringEqualsMCQAScorerTemplate, ModelAsAJudgeClassifierScorerTemplate, LabelerViaModelScorerTemplate, ModelAsAJudgeScorerTemplate, PythonScorerTemplate, AllSamplesPythonScorerTemplate, RAGCheckerScorerTemplate, FunctionCallCoverageScorerTemplate]
List of scorers used by the task.
Default:[]actions array[ActionRule]
Action rules used by this task.
Default:Nonedisplay_name: "Harry Potter Trivia Task"
key: "hp-trivia"
description: "Assesses the model knowledge of Harry Potter trivia."
config_spec: []
definition:
evaluated_entity_type: "model"
dataset:
key: "hp-trivia-dataset"
solver:
type: "single_turn_solver"
input_builder:
type: "chat_completion"
input_messages:
- role: "system"
content: "You are a helpful assistant."
- role: "user"
content: "Respond to this question concisely. {{ sample.question }}"
scorers:
- type: "model_as_a_judge_classifier"
model_key: "openai$gpt-4-1-nano"
system_prompt: |
You are a helpful assistant and rate whether the ground truth answer and the
candidate answer are semantically the same. Semantically the same means that
you do not care about small spelling mistakes, capitalization or whether
additional information is given. This is all still considered correct.
Respond only with 'correct' or 'incorrect' and nothing else.
user_prompt: |
Ground Truth Answer: {{ sample.gt_answer }}
Candidate Answer: {{ solver_output.output }}
correct_labels:
- "correct"
incorrect_labels:
- "incorrect"
use_structured_outputs: trueSDKSystemTaskDefinitionTemplate
SDKSystemTaskDefinitionTemplateProperties
type Literal "system_task"
Type of the task definition, set to system_task.
system_task
compute_evidence_snippet string required
Python code snippet that computes evidence for the task.
SDKTaskDatasetTemplate
SDKTaskDatasetTemplateProperties
key string required
Key of the dataset to be used for the task.
fast_subset_size integer, string
Size of the fast subset. This field is deprecated and will be removed in future versions.
Default:None...
definition:
...
dataset:
key: "hp-trivia-dataset"Use the CLI command
lf datasetsto list all available datasets.
PythonSolverTemplate
PythonSolverTemplateProperties
type Literal "python" required
The type of the solver.
run_solver_snippet string required
The Python code snippet defining how to run the solver. It must define
a run_solver function with the following API:
async def run_solver(sample, model, trace) -> SolverTrace
where:
sampleis a dictionary representing the current sample.modelis the model object used for inference (with apredictmethod).traceis aSolverTraceobject that can be used to record the conversation and should be returned by the function.
PythonScorerTemplate
PythonScorerTemplateProperties
key string
Unique identifier assigned to the entity in AI GO!.
Default:None
purpose enum ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
Possible ScorerPurpose values
Allowed Values:
scoreqa
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
display_name string
The display name of the scorer.
Default:None
type Literal "python" required
The type of the scorer.
compute_scores_snippet string, TemplateValue required
The Python code snippet defining how to compute the scores. It must define
a compute_scores function with one the following APIs:
For model tasks:
def compute_scores(sample: dict[str, Any], solver_output: SolverOutput) -> dict[str, Any]:For dataset tasks:
def compute_scores(sample: dict[str, Any]) -> dict[str, Any]:Both def and async def are supported.
where
-
sampleis a dictionary representing the current sample -
solver_outputis an object with 2 attributes:output: the last model outputmessages: a list of input/output messages representing the interaction history
The function can return scores in one of two formats:
Scores only - return a flat dict of score key/value pairs:
return {"accuracy": 0.95, "is_correct": True}Scores with metadata - return a structured dict with "scores" and "metadata" keys.
Metadata is stored alongside the scores but is not aggregated into metrics:
return {
"scores": {"accuracy": 0.95, "is_correct": True},
"metadata": {"model": "gpt-4", "tokens": 123},
}AllSamplesPythonScorerTemplate
AllSamplesPythonScorerTemplateProperties
key string
Unique identifier assigned to the entity in AI GO!.
Default:None
purpose enum ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
Possible ScorerPurpose values
Allowed Values:
scoreqa
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
display_name string
The display name of the scorer.
Default:None
type Literal "python_all_samples" required
The type of the scorer.
compute_scores_snippet string, TemplateValue required
The Python code snippet defining how to compute the scores. It must define
a compute_scores function with the following API:
def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]:Both def and async def are supported.
where
samplesis the list of samples (in the same order as the dataset)- the returned list must contain one entry per sample
Each entry can be in one of two formats:
Scores only - a flat dict of score key/value pairs:
[
{"accuracy": 0.95, "is_correct": True},
...
]Scores with metadata - a structured dict with "scores" and "metadata" keys.
Metadata is stored alongside the scores but is not aggregated into metrics:
[
{
"scores": {"accuracy": 0.95, "is_correct": True},
"metadata": {"model": "gpt-4", "tokens": 123},
},
...
]ModelAsAJudgeScorerTemplate
ModelAsAJudgeScorerTemplateProperties
key string
Unique identifier assigned to the entity in AI GO!.
Default:None
purpose enum ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
Possible ScorerPurpose values
Allowed Values:
scoreqa
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
display_name string
The display name of the scorer.
Default:None
type Literal "model_as_a_judge_scorer" required
The type of the scorer.
model_key Key, TemplateValue required
The model to be used as the judge.
system_prompt string, TemplateValue
The system prompt given to the judge model. The prompt can refer
to the following variable dynamically (using {{ }} syntax):
In all scenarios:
sample: Sample attributes (ex:{{ sample.answer }})
If the task has a solver:
model_output: The last model output (ex:{{ model_output }})messages: The full list of input/output messages (ex:{{ messages[0]['content'] }})input_prompt: The message contents of the last model output (only for chat completion tasks) (ex:{{ input_prompt }})
You are a helpful assistant and will be used to judge the output of another model.
user_prompt string, TemplateValue required
The user prompt given to the judge model. The prompt can refer
to the following variable dynamically (using {{ }} syntax):
In all scenarios:
sample: Sample attributes (ex:{{ sample.answer }})
If the task has a solver:
model_output: The last model output (ex:{{ model_output }})messages: The full list of input/output messages (ex:{{ messages[0]['content'] }})input_prompt: The message contents of the last model output (only for chat completion tasks) (ex:{{ input_prompt }})
score_min number, TemplateValue
The minimum score that the judge model can predict.
Default:0.0
score_max number, TemplateValue
The maximum score that the judge model can predict.
Default:1.0
use_structured_outputs boolean
Whether to use structured outputs. It is recommended to enable this if the model supports it.
Default:False
LabelerViaModelScorerTemplate
LabelerViaModelScorerTemplateProperties
key string
Unique identifier assigned to the entity in AI GO!.
Default:None
purpose enum ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
Possible ScorerPurpose values
Allowed Values:
scoreqa
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
display_name string
The display name of the scorer.
Default:None
type Literal "labeler_via_model" required
The type of the scorer.
model_key Key, TemplateValue required
The model to be used as the labeler.
system_prompt string, TemplateValue
The system prompt given to the labeler model. The prompt can refer
to the following variable dynamically (using {{ }} syntax):
In all scenarios:
sample: Sample attributes (ex:{{ sample.answer }})
If the task has a solver:
model_output: The last model output (ex:{{ model_output }})messages: The full list of input/output messages (ex:{{ messages[0]['content'] }})input_prompt: The message contents of the last model output (only for chat completion tasks) (ex:{{ input_prompt }})
You are a helpful assistant and will be used to label the output of another model or a dataset sample.
user_prompt string, TemplateValue required
The user prompt given to the labeler model. The prompt can refer
to the following variable dynamically (using {{ }} syntax):
In all scenarios:
sample: Sample attributes (ex:{{ sample.answer }})
If the task has a solver:
model_output: The last model output (ex:{{ model_output }})messages: The full list of input/output messages (ex:{{ messages[0]['content'] }})input_prompt: The message contents of the last model output (only for chat completion tasks) (ex:{{ input_prompt }})
valid_labels array[string, TemplateValue]
The list of valid labels. To allow any label, use an empty list (default).
Default:[]
use_structured_outputs boolean
Whether to use structured outputs. It is recommended to enable this if the model supports it.
Default:False
ActionRule
ActionRuleProperties
key string required
Key: 1-250 chars, allowed: a-z A-Z 0-9 _ -
Pattern:^[a-zA-Z0-9_\-]+$
Max Length:
250
action enum ActionRuleAction required
The action to be applied to samples that match the filter.
Possible ActionRuleAction values
Allowed Values:
exclude_from_metrics
filter FilterComparison, FilterMembership, FilterUnary required
The filter that determines which samples the action applies to.
FilterUnary
FilterUnaryProperties
op enum FilterUnaryOp required
Possible FilterUnaryOp values
The unary operator to apply.
Allowed Values:
existsnot_existsis_trueis_false
expression string required
An expression encoding what to apply the unary operator to.
Depending on the context, it can refer to different variables:
- When filtering a dataset: it can refer to column values by name (ex:
{{ category }}). - When used within a task action: it can refer to the
sample, thesolver_outputor thescores(which is a mapping between scorer keys and their corresponding score values dict).
FilterComparison
FilterComparisonProperties
op enum FilterComparisonOp required
Possible FilterComparisonOp values
The comparison operator to apply.
Allowed Values:
equalsnot_equalsgreater_thanless_thangreater_or_equalless_or_equal
expression string required
An expression encoding what to compare against the value.
Depending on the context, it can refer to different variables:
- When filtering a dataset: it can refer to the
sampleand use dot or bracket notation to access the columns. If filtering a dataset with column names that are illegal under jinja substitution rules (e.g. containing spaces), use bracket notation to access the column. - When used within a task action: it can refer to the
sample, thesolver_outputor thescores(which is a mapping between scorer keys and their corresponding score values dict).
value string, number, integer, boolean required
The value against which the expression is compared.
FilterMembership
FilterMembershipProperties
op enum FilterMembershipOp required
Possible FilterMembershipOp values
The membership operator to apply.
Allowed Values:
innot_in
expression string required
An expression encoding what to check membership against the values.
Depending on the context, it can refer to different variables:
- When filtering a dataset: it can refer to column values by name (ex:
{{ category }}). - When used within a task action: it can refer to the
sample, thesolver_outputor thescores(which is a mapping between scorer keys and their corresponding score values dict).
values array[string, number, boolean] required
The set of values to test membership against.
