Evaluations
An evaluation is a set of configured tasks (i.e. task specifications) that are run on a model or a dataset.
To interact with an evaluation using the CLI, it is recommended to use them as part of a Run.
Evaluation Overview
Properties
key string required
Unique identifier assigned to the entity in AI GO!.
Pattern:^[a-zA-Z0-9_\-]+$Max Length:
250display_name string required
The evaluation's name displayed to the user.
mode string
The mode of evaluation to be performed. Supported modes are 'debug' and full. This field is deprecated and will be removed in future versions. Please use the num_samples and subsampling fields instead. Setting this field will be translated as follows: 'debug' setsnum_samples to 10 while 'full' sets num_samples to None (entire dataset).
Nonenum_samples integer
Maximum number of samples from each dataset to be used for the evaluation. This argument only applies to tasks that do not specify num_samples in their task specifications. If not provided, the entire dataset will be used.
Nonesubsampling enum Subsampling
The method used to subsample the dataset when num_samples is specified. 'head' selects the first N samples, while 'random' selects N samples randomly from the dataset. If not provided, defaults to 'head'.
NonePossible Subsampling values
The subsampling strategy to use when selecting samples for evaluation. Supported values:
- head - Select the first N samples.
- random - Select N random samples. The random seed is fixed for reproducibility. If not specified, defaults to 'head'.
Allowed Values:
headrandom
cache_policy CachePolicy
The caching policy to use for the task results in the evaluation.
Default:reusetask_specifications array[SDKTaskSpecification] required
List of task specifications for the evaluation.
tags array[string]
Tags associated with the evaluation.
Default:[]description string
Short description of the evaluation.
Default:Nonelong_description string
Long description of the evaluation. Supports Markdown formatting.
Default:None...
evaluation:
key: "hp-trivia-evaluation"
display_name: "Harry Potter Trivia evaluation"
task_specifications:
- key: "hp-trivia-gpt-4-1-nano"
task_key: "hp-trivia"
display_name: "Harry Potter Trivia OpenAI GPT-4.1 Nano"
task_config:
config_key: []
model_key: "openai$gpt-4-1-nano"...
evaluation:
display_name: "Data Quality"
key: "data-quality"
tags: ["Data Quality", "Uniqueness", "Completeness"]
task_specifications:
- key: "uniqueness-task-hello-dataset"
task_key: "uniqueness-task"
task_config:
field: "content"
dataset_key: "hello-dataset"
display_name: "Sample Uniqueness in Hello Dataset"
- key: "completeness-task-hello-dataset"
task_key: "completeness-task"
task_config:
field: "content"
dataset_key: "hello-dataset"
display_name: "Sample Completeness in Hello Dataset"The
task_configallows instantiating the same Task template in different ways. Always refer to the Task definition for the available configuration options.
Definitions
SDKTaskSpecification
SDKTaskSpecificationProperties
key string
Unique identifier assigned to the entity in AI GO!.
Default:None
task_key string required
Unique identifier assigned to the entity in AI GO!.
Pattern:^[a-zA-Z0-9_\-]+$
Max Length:
250
task_config object
Configuration for the specified task. Must match the task's config spec.
Default:{}
model_key string
Reference to an existing entity in AI GO!.
Default:None
dataset_key string
Reference to an existing entity in AI GO!.
Default:None
num_samples integer
Maximum number of samples from the dataset to be used for the task. If not provided, the entire dataset will be used.
Default:None
subsampling enum Subsampling
The method used to subsample the dataset when num_samples is specified. 'head' selects the first N samples, while 'random' selects N samples randomly from the dataset. If not provided, defaults to 'head'.
None
Possible Subsampling values
The subsampling strategy to use when selecting samples for evaluation. Supported values:
- head - Select the first N samples.
- random - Select N random samples. The random seed is fixed for reproducibility. If not specified, defaults to 'head'.
Allowed Values:
headrandom
display_name string
The task specification's name displayed to the user.
Default:None
task_result_log_path string
Path to the task result log file.
Default:None
CachePolicy
CachePolicyThe caching policy to use for the task results in the evaluation. Supported values:
- reuse - Use a cached task result if one is available (the default). Partial task results are also reused automatically - if a task is the same as another, completed task for all of its configuration except the scorers configuration, then only the scores, metrics and errors and failures related to them will be recomputed. This saves queries to the model during the solver part of the evaluation.
- update - Do not use cached task results, but cache the results of the execution.
- no-cache - Do not use cached task results and do not cache the results of the execution.
Allowed Values:
reuseupdateno-cache
