Evaluations

An evaluation is a set of configured tasks (i.e. task specifications) that are run on a model or a dataset.

To interact with an evaluation using the CLI, it is recommended to use them as part of a Run.

Evaluation Overview

Properties


key string required

Unique identifier assigned to the entity in AI GO!.

Pattern: ^[a-zA-Z0-9_\-]+$
Max Length: 250

display_name string required

The evaluation's name displayed to the user.


mode string

The mode of evaluation to be performed. Supported modes are 'debug' and full. This field is deprecated and will be removed in future versions. Please use the num_samples and subsampling fields instead. Setting this field will be translated as follows: 'debug' setsnum_samples to 10 while 'full' sets num_samples to None (entire dataset).

Default: None

num_samples integer

Maximum number of samples from each dataset to be used for the evaluation. This argument only applies to tasks that do not specify num_samples in their task specifications. If not provided, the entire dataset will be used.

Default: None

subsampling enum Subsampling

The method used to subsample the dataset when num_samples is specified. 'head' selects the first N samples, while 'random' selects N samples randomly from the dataset. If not provided, defaults to 'head'.

Default: None

Possible Subsampling values

The subsampling strategy to use when selecting samples for evaluation. Supported values:

  • head - Select the first N samples.
  • random - Select N random samples. The random seed is fixed for reproducibility. If not specified, defaults to 'head'.

Allowed Values:

  • head
  • random

cache_policy CachePolicy

The caching policy to use for the task results in the evaluation.

Default: reuse

task_specifications array[SDKTaskSpecification] required

List of task specifications for the evaluation.


tags array[string]

Tags associated with the evaluation.

Default: []

description string

Short description of the evaluation.

Default: None

long_description string

Long description of the evaluation. Supports Markdown formatting.

Default: None
...
evaluation:
  key: "hp-trivia-evaluation"
  display_name: "Harry Potter Trivia evaluation"
  task_specifications:
    - key: "hp-trivia-gpt-4-1-nano"
      task_key: "hp-trivia"
      display_name: "Harry Potter Trivia OpenAI GPT-4.1 Nano"
      task_config:
        config_key: []
      model_key: "openai$gpt-4-1-nano"
...
evaluation:
  display_name: "Data Quality"
  key: "data-quality"
  tags: ["Data Quality", "Uniqueness", "Completeness"]
  task_specifications:
    - key: "uniqueness-task-hello-dataset"
      task_key: "uniqueness-task"
      task_config:
        field: "content"
      dataset_key: "hello-dataset"
      display_name: "Sample Uniqueness in Hello Dataset"
    - key: "completeness-task-hello-dataset"
      task_key: "completeness-task"
      task_config:
        field: "content"
      dataset_key: "hello-dataset"
      display_name: "Sample Completeness in Hello Dataset"
💡

The task_config allows instantiating the same Task template in different ways. Always refer to the Task definition for the available configuration options.

Definitions

SDKTaskSpecification

Properties


key string

Unique identifier assigned to the entity in AI GO!.

Default: None

task_key string required

Unique identifier assigned to the entity in AI GO!.

Pattern: ^[a-zA-Z0-9_\-]+$
Max Length: 250

task_config object

Configuration for the specified task. Must match the task's config spec.

Default: {}

model_key string

Reference to an existing entity in AI GO!.

Default: None

dataset_key string

Reference to an existing entity in AI GO!.

Default: None

num_samples integer

Maximum number of samples from the dataset to be used for the task. If not provided, the entire dataset will be used.

Default: None

subsampling enum Subsampling

The method used to subsample the dataset when num_samples is specified. 'head' selects the first N samples, while 'random' selects N samples randomly from the dataset. If not provided, defaults to 'head'.

Default: None

Possible Subsampling values

The subsampling strategy to use when selecting samples for evaluation. Supported values:

  • head - Select the first N samples.
  • random - Select N random samples. The random seed is fixed for reproducibility. If not specified, defaults to 'head'.

Allowed Values:

  • head
  • random

display_name string

The task specification's name displayed to the user.

Default: None

task_result_log_path string

Path to the task result log file.

Default: None

CachePolicy

The caching policy to use for the task results in the evaluation. Supported values:

  • reuse - Use a cached task result if one is available (the default). Partial task results are also reused automatically - if a task is the same as another, completed task for all of its configuration except the scorers configuration, then only the scores, metrics and errors and failures related to them will be recomputed. This saves queries to the model during the solver part of the evaluation.
  • update - Do not use cached task results, but cache the results of the execution.
  • no-cache - Do not use cached task results and do not cache the results of the execution.

Allowed Values:

  • reuse
  • update
  • no-cache