Python Batch Scorer
Scores all samples at once using a custom Python function, enabling cross-sample scoring such as uniqueness or aggregate comparisons. Use this when your evaluation logic depends on the full set of samples. For per-sample logic, use Python Scorer instead, which is simpler and scores each sample in parallel.
Output
The score keys and values are entirely defined by your compute_scores
function. Each entry in the returned list can use one of two formats:
- Scores only -
{"uniqueness": 1.0} - Scores with metadata -
{"scores": {"uniqueness": 1.0}, "metadata": {"duplicates": []}}
Metadata is stored alongside scores but not aggregated into metrics.
Function Signature
Both def and async def are supported. The function signature is:
def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]: ...where samples is the full list of samples in dataset order. The
returned list must contain one entry (of scores and optional metadata) per sample,
in the same order.
Examples
Example: Output Uniqueness. Receives all outputs at once and assigns lower scores to responses that are duplicates of others.
...
definition:
...
scorers:
- type: "python_all_samples"
compute_scores_snippet: !include "uniqueness_scorer.py"
metrics:
- type: "mean"
field: "is_unique"
name: "Uniqueness Rate"from __future__ import annotations
from collections import Counter
from typing import Any
def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]:
field_name = "<< config.field >>"
values = [
sample[field_name] if field_name in sample else None for sample in samples
]
counter = Counter(values)
return [
{"is_unique": counter[value] == 1 if value is not None else True}
for value in values
]With config.field = "answer":
sample["answer"] | is_unique |
|---|---|
"Paris" | 0.0 |
"London" | 1.0 |
"Paris" | 0.0 |
Configuration
Properties
type Literal "python_all_samples" required
The type of the scorer.
compute_scores_snippet string, TemplateValue required
The Python code snippet defining how to compute the scores. It must define
a compute_scores function with the following API:
def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]:Both def and async def are supported.
where
samplesis the list of samples (in the same order as the dataset)- the returned list must contain one entry per sample
Each entry can be in one of two formats:
Scores only - a flat dict of score key/value pairs:
[
{"accuracy": 0.95, "is_correct": True},
...
]Scores with metadata - a structured dict with "scores" and "metadata" keys.
Metadata is stored alongside the scores but is not aggregated into metrics:
[
{
"scores": {"accuracy": 0.95, "is_correct": True},
"metadata": {"model": "gpt-4", "tokens": 123},
},
...
]key string
Unique identifier assigned to the entity in AI GO!.
Default:None
purpose ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
display_name string
The display name of the scorer.
Default:None
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
