Python Batch Scorer

Scores all samples at once using a custom Python function, enabling cross-sample scoring such as uniqueness or aggregate comparisons. Use this when your evaluation logic depends on the full set of samples. For per-sample logic, use Python Scorer instead, which is simpler and scores each sample in parallel.

Output

The score keys and values are entirely defined by your compute_scores function. Each entry in the returned list can use one of two formats:

Scores only - {"uniqueness": 1.0}
Scores with metadata - {"scores": {"uniqueness": 1.0}, "metadata": {"duplicates": []}}

Metadata is stored alongside scores but not aggregated into metrics.

Function Signature

Both def and async def are supported. The function signature is:

def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]: ...

where samples is the full list of samples in dataset order. The returned list must contain one entry (of scores and optional metadata) per sample, in the same order.

Examples

Example: Output Uniqueness. Receives all outputs at once and assigns lower scores to responses that are duplicates of others.

...
definition:
  ...
  scorers:
    - type: "python_all_samples"
      compute_scores_snippet: !include "uniqueness_scorer.py"
      metrics:
        - type: "mean"
          field: "is_unique"
          name: "Uniqueness Rate"

from __future__ import annotations

from collections import Counter
from typing import Any


def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]:
    field_name = "<< config.field >>"
    values = [
        sample[field_name] if field_name in sample else None for sample in samples
    ]
    counter = Counter(values)

    return [
        {"is_unique": counter[value] == 1 if value is not None else True}
        for value in values
    ]

With config.field = "answer":

`sample["answer"]`	`is_unique`
`"Paris"`	`0.0`
`"London"`	`1.0`
`"Paris"`	`0.0`

Configuration

Properties

type Literal "python_all_samples" required

The type of the scorer.

compute_scores_snippet string, TemplateValue required

The Python code snippet defining how to compute the scores. It must define a compute_scores function with the following API:

def compute_scores(samples: list[dict[str, Any]]) -> list[dict[str, Any]]:

Both def and async def are supported.

where

samples is the list of samples (in the same order as the dataset)
the returned list must contain one entry per sample

Each entry can be in one of two formats:

Scores only - a flat dict of score key/value pairs:

[
    {"accuracy": 0.95, "is_correct": True},
    ...
]

Scores with metadata - a structured dict with "scores" and "metadata" keys. Metadata is stored alongside the scores but is not aggregated into metrics:

[
    {
        "scores": {"accuracy": 0.95, "is_correct": True},
        "metadata": {"model": "gpt-4", "tokens": 123},
    },
    ...
]

key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose ScorerPurpose

The purpose of this scorer.

score: The scorer is used to score the solver output or the dataset sample.
qa: The scorer is used to do QA over the solver output or the dataset sample.

Default: score

display_name string

The display name of the scorer.

Default: None

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None