Tasks define the execution flow of an evaluation of a model or a dataset. AI GO! tasks take inspiration from Inspect AI framework.

To ensure traceability and reproducibility, all tasks are defined declaratively, i.e. in a YAML file.

Quick Links

Single-turn Task

Learn how to define an evaluation task with single-turn model interaction.

Multi-turn Task

Learn how to define an evaluation task with multi-turn model interaction.

Data Quality Tasks

Learn how to define an data quality evaluation task.

CLI

Learn about CLI commands.

Task Reference

Learn about task definition in the reference.

Define an Evaluation Task

Let's start with an example of defining a new task that measures how good the model is at following instructions.

First, we create an OpenAI model that we want to evaluate.

lf model add -p 'openai/gpt-5-nano'

Second, we download and create a dataset of test examples, where each sample has the following structure: {"options": "[a, b, c]", "first": "a", "last": "c"}. We will use this dataset to define the task.

curl 'https://cdn.latticeflow.cloud/_aigo/demos/docs/1.1.0/jsonl_instructions_categorical_options.yaml' -o 'jsonl_instructions_categorical_options.yaml'
curl 'https://cdn.latticeflow.cloud/_aigo/demos/docs/1.1.0/simple_instructions_options.jsonl' -o 'simple_instructions_options.jsonl'
lf dataset add -f 'jsonl_instructions_categorical_options.yaml'

Next, we create a task, where for every example in the dataset, the model is prompted to respond with the first option in a list. The list of options for each sample is given in the options column. It checks the correctness of the response by checking if the model output exactly matches the first column of the sample. This task is defined below, create a file task.yaml and copy the contents.

display_name: "Simple Instructions Following"
key: "simple-instructions-following"
description: "Evaluates how good the model is at following instructions."
config_spec: []
definition:
  dataset:
    key: "instructions-categorical-options"
  solver:
    type: "single_turn_solver"
    input_builder:
      type: "chat_completion"
      input_messages:
        - role: "user"
          content: >
            Example: If asked for first option in [A, B, C], correct response is: A.
            
            Instructions: The list of allowed responses is {{ sample.options }}. 
            Pick the first option. No punctuation, quotes or explanations.
  scorers:
    - type: "string_equals"
      ground_truth: "{{ sample.first }}"

Now, we can create the task in AI GO! and test it using the CLI.

lf task add -f 'task.yaml'
lf task test 'simple-instructions-following' --model-key 'openai$gpt-5-nano' -n 1

Congrats, you have successfully defined you first task. 🚀

Suppose, we are not satisfied with the rigidity of the defined tasks and instead of requiring that the model responds always with the first option, we want to configure the task on each run. We want that the model to returns either the first or the last option.

We can do this by extending the task definition and exposing a parameter in the task's config_spec. The new definition includes a parameter ground_truth that is later used in the solver and the scorer to configure their behavior based on the configuration value.

To update the task, copy the new contents of the YAML above into task.yaml.

display_name: "Simple Instructions Following"
key: "simple-instructions-following"
description: "Evaluates how good the model is at following instructions."
config_spec:
  - display_name: "Ground Truth Option"
    key: "ground_truth"
    type: "categorical"
    description: >
      Parameter that controls whether the model is expected to return the first or the
      last option.
    allowed_values: ["first", "last"]
definition:
  dataset:
    key: "instructions-categorical-options"
  solver:
    type: "single_turn_solver"
    input_builder:
      type: "chat_completion"
      input_messages:
        - role: "user"
          content: >
            Example: If asked for first option in [A, B, C], correct response is: A.
            Example: If asked for last option in [A, B, C], correct response is: C.

            Instructions: The list of allowed responses is {{ sample.options }}. 
            Pick the << config.ground_truth >> option. No punctuation, quotes or 
            explanations.
  scorers:
    - type: "string_equals"
      ground_truth: >
        {{ sample.first if "<< config.ground_truth >>" == "first" else sample.last }}

Since the task exposes a parameter, we first need to create a config file config.yaml with the contents below to be able to run it. In the configuration file, we specify the parameter ground_truth value to be last. This configuration value will change the task behavior to check whether the model knows how to response with the last option in the list.

ground_truth: "last"

Now, we can use the CLI to update the task definition and run it again.

lf task add -f 'task.yaml'
lf task test 'simple-instructions-following' --model-key 'openai$gpt-5-nano' --config 'config.yaml'

Definition

📘
Reference
This page describes the most important fields when defining a task; to see the full specification, please see this page.

A task is definition includes multiple parts:

Metadata: Contains information on the name, the key and the description of the task.
Evaluated Entity Type: A task can be used to run an evaluation on the model to measure performance and behavior or on the dataset to measure the quality of the data. This is defined by the evaluated_entity_type field that can be set to model or dataset, respectively.
Configuration Specification: A task can expose parameters that can be passed in the task specification.
Dataset: A task uses a dataset as a collection of test scenarios used in the evaluation.
Solver: The task's execution plan is defined by the solver - it defines how test cases in the dataset are used to interact with the model. The idea of a solver comes from the Inspect AI framework.
Scorers: A task uses a list of scorers. A scorer defines one or more scoring functions, i.e. functions that compute a score of interest for each sample, and one or more metrics, i.e. functions that aggregate the per sample scores.

In general, task definition follows the structure of this YAML template:

display_name: "<display_name>"
key: "<key>"
description: "<description>"
evaluated_entity_type: "<evaluated_entity_type>"
config_spec: <config_spec>
definition:
  dataset: <dataset>  
  solver: <solver>
  scorers: <scorers>

Configuration Specification

The specification is defined under the config_spec field in the YAML file. Configuration specification defines a list of parameters.

A parameter is defined by its name, key and type. Parameter value in the configuration specification can be used in the task definition as << config.<key> >> to control task's execution logic. For more details, see the reference.

config_spec:
  # Parameter 1
  - type: "<type>"
    key: "<key>"
    display_name: "<name>"
  # Parameter 2
  - ...

If a task does not expose any parameters, then the configuration specification is set to an empty list.

config_spec: []

Dataset

The dataset used by the task is configured by referring to it by the dataset key in the YAML fieldkey. For more details, see the reference .

definition:
  dataset:
    key: "<key>"

If the a task does not uses a fixed dataset, but exposes a dataset parameter in the configuration specification, then we can get the key from the config as:

config_spec:
  - type: "dataset"
    key: "my_dataset"
    display_name: "My Dataset"
  - ...
definition:
  ...
  dataset:
    key: "<< config.my_dataset >>"

Solver

The solver specification is defined under the definition > solver field in the YAML file. The structure of the solver is defined by its type and the input builder. All available types and parameters are defined in the reference.

When evaluating the single-turn behavior and performance of models, use the single_turn_solver solver type.

The input builder allows us to build the messages that will be sent to the models. For chat completion models, use the chat_completion input builder type that helps you define prompts.

definition:
  ...
  solver:
    type: "single_turn_solver"
    input_builder:
      type: "chat_completion"
      input_messages:
        - role: "user"
          content: "{{ sample.question }}"

When evaluating custom models, use the generic input builder type that allows you to define custom JSON messages by building the Jinja template.

definition:
  ...
  solver:
    type: "single_turn_solver"
    input_builder:
      type: "generic"
      template: |
        {"job_description": {{ sample.question | tojson }}}

The solver's definition can use the configuration specification to control its behavior. Access configuration parameter values as << config.<key> >>.

config_spec:
  - type: "dataset_column"
    key: "dataset_column"
    display_name: "Dataset column that contains the job description."
  - ...
definition:
  ...
  solver:
    type: "single_turn_solver"
    input_builder:
      type: "generic"
      template: |
        {"job_description": {{ sample.<< config.dataset_column >> | tojson }}}

Scorers

The scorer specification is defined under the definition > scorers field in the YAML file. A task can define multiple scorers. A scorer is defined by its type and parameters specific to its type. All available types and parameters are defined in the reference.

definition:
  ...
  scorers:
  - type: string_equals
    ground_truth: '{{ sample.ground_truth }}'

The scorer's definition can use the configuration specification to control its behavior. Access configuration parameter values as << config.<key> >>.

config_spec:
  - display_name: "Ground Truth Option"
    key: "ground_truth"
    type: "string"
  - ...
definition:
  ...
  scorers:
    - type: "string_equals"
      ground_truth: >
        {{ sample.first if "<< config.ground_truth >>" == "first" else sample.last }}

If none of the built-in scorers matches your needs, you can define your own scorer with custom Python logic. For example, to score if the model, given the country, can predict the correct capital of that country, we can use this snippet:

def compute_scores(sample: dict, model_input, model_output) -> dict:
    model_prediction = model_output['choices'][0]['message']['content'].strip().lower()
    capital = sample['capital']
    country = sample['country']
    return {
        'country': country,
        'capital': capital,
        'model_prediction': model_prediction,
        'is_correct': model_prediction == capital,
    }

This logic can then be used within a task to define the scorer:

definition:
  ...
  scorers:
  - key: "my_geography_scorer"
    type: "python"
    compute_scores_snippet: !include score_sample.py
    metrics:
    - type: "mean"
      field: "is_correct"

You can indicate that a scorer was added for the purpose of QA. Such scorers are useful to evaluate the quality of the sample or the solver output. If a row doesn’t meet your quality criteria, you can action rules to ensure the metrics ignore it.

definition:
  ...
  scorers:
  - key: "language_labeler"
    purpose: "qa"
    type: "labeler_via_model"
    model_key: << config.labeler_model >>
    labels: ["english", "non-english"]
    user_prompt: |
      Please analyse the language of the following text in the <Text> section. If the
      language is English, output 'english'. Otherwise, output 'non-english'.
      Do not output anything else.
      	
      <Text>
      {{ solver_output.output.choices[0].message.content }}
      </Text>

Action Rules

The action rules specification is defined under the definition > actions field in the YAML file. A task can define multiple action rules. An action rule is defined by its action and filter.

definition:
  ...
  actions:
  - key: 'excluded_non_english_solver_output'
    action: exclude_from_metrics
    filter:
      op: equals
      expression: {{ scores.language_labeller.label }}
      value: 'non-english'

If a row doesn’t meet your quality criteria, you can use actions to ensure the metrics ignore it. To do so, specify which rows should be ignored (via the filter) and set the action to exclude_from_metrics. The filter can operate over the sample, the solver_output (if available) and the scores.

Metrics

The metrics specification is defined under the definition > scorers > metrics field in the YAML file. A scorer can define multiple metrics that perform different aggregations of scores computed by the scorer.

A metric is defined by its type and parameters specific to its type. All available types and parameters are defined in the reference.

definition:
  ...
  scorers:
  - type: "string_equals"
    ground_truth: "{{ sample.ground_truth }}"
    metrics:
    - type: "mean"
      field: "prediction_is_correct"

If none of the metrics above match your needs, you can add your own metric with custom Python logic. For example, to evaluate the precision and recall for a binary classification problem, the following snippet can be used

def compute_scores(scores) -> dict[str, int | float]:
    num_scores = len(scores)
    if num_scores == 0:
        raise ValueError(
            "Cannot compute metrics: received 0 sample scores."
        )

    def safe_division(a, b, default):
        return a/b if b != 0 else default

    num_tp = 0
    num_fp = 0
    num_p = 0

    for sample_scores in scores:
        gt = sample_scores['gt']
        pred = sample_scores['pred']
        if gt is True and pred is True:
            num_tp += 1
            num_p += 1
        elif gt is True and pred is False:
            num_p += 1
        elif gt is False and pred is True:
            num_fp += 1

    return {
        "precision": safe_division(num_tp, num_tp + num_fp, 1.0),
        "recall": safe_division(num_tp, num_p, 1.0),
    }

This logic can then be used within a task in the following way

definition:
  dataset: ...
  solver: ...
  scorers:
  - key: "my_binary_classification_scorer"
    type: "python"
    compute_scores_snippet: !include score_sample.py  # Produces scores contain 'gt' and 'pred'.
    metrics:
    - type: "python"
      compute_metrics_snippet: !include compute_metrics.py

Usage

Commands

Create the task as defined in task.yaml. This will persist the task in AI GO!.

If there is no task with the key provided in the YAML file, a new task is created.
If a task with the key provided in the YAML file already exists, the task is updated if the task specification is changed.

lf task add -f 'task.yaml'     # Create/update a single task from a YAML file.
lf task add -f 'tasks/*.yaml'  # Create/update multiple tasks in a directory with YAML files.

List all tasks.

lf tasks

Delete the task by key.

lf task delete 'my-task'

Export the task by key.

lf task export 'my-task'                   # Export to STDOUT
lf task export 'my-task' -o 'task.yaml'    # Export to YAML

Testing

After defining a task, the first question is often whether the task actually works as expected. There are two general types of issues that a task can have: "issues that prevent the execution of the task" and "issues that affect the correctness of the results". Task testing tool can be used to detect and address both types of issues. The task testing tool will show the execution flow with results and error messages for each stage.

To run the task for testing, use the CLI and specify the key of the task to test, the key of the model to use for evaluation. For more details, see the reference.

lf task test 'my-task' --model-key 'my-model' --config 'my-config.yaml'

If your task exposes configuration parameters, we need to additionally create a YAML file that maps configuration parameters to their values. This file needs to contain a value for all the parameters in the config_spec defined as part of the task. An example YAML file would look like:

dataset: "my-dataset"
dataset_column: "question"

To run the task now, pass the path to the configuration file as the CLI option.

lf task test 'my-task' --model-key 'my-model' --config 'config.yaml'

Example: Task Testing Output

If successful, the output of the lf task test command will be:

Checking configuration
Configuration is valid.
=====================================================================================================================
Processing sample 1/1
---------------------------------------------------------------------------------------------------------------------
Dataset sample
                                                                                                                   
options: '[a, b, c]'                                                                                                 
first: a                                                                                                             
last: c                                                                                                              
sample_id: 0                                                                                                         
                                                                                                                   
---------------------------------------------------------------------------------------------------------------------
Solver
                                                                                                                   
output:                                                                                                              
messages:                                                                                                          
- role: user                                                                                                       
  content: |-                                                                                                      
    Example: If asked for first option in [A, B, C], correct response is: A. Example: If asked for last option in [A, B, C], correct response is: C.                                                                                    
    Instructions: The list of allowed responses is [a, b, c].  Pick the last option. No punctuation, quotes or  explanations.                                                                                                           
- role: assistant                                                                                                  
  content: c                                                                                                       
output:                                                                                                            
  choices:                                                                                                         
  - message:                                                                                                       
      role: assistant                                                                                              
      content: c                                                                                                   
                                                                                                                   
---------------------------------------------------------------------------------------------------------------------
Scores
                                                                                                                   
- prediction_is_correct: true                                                                                        
answer_gt: c                                                                                                       
answer_pred: c                                                                                                     
                                                                                                                   
---------------------------------------------------------------------------------------------------------------------
Processed all samples.
=====================================================================================================================
Metrics
                                                                                                                   
string_equality_mean: 1.0                                                                                            
                                                                                                                   
---------------------------------------------------------------------------------------------------------------------
Successfully tested configuration of task with key 'simple-instructions-following'