Python Scorer

Scores each sample or model output using a custom Python function, giving you full programmatic control over the scoring logic. Use this when built-in scorers don't cover your criterion - e.g. regex extraction, JSON schema validation, or numeric range checks. For scores that depend on the full set of samples, use Python Batch Scorer instead.

Output

The score keys and values are entirely defined by your compute_scores function. Two return formats are supported:

Scores only - {"accuracy": 0.95, "is_correct": True}
Scores with metadata - {"scores": {"accuracy": 0.95}, "metadata": {"tokens": 123}}

Metadata is stored alongside scores but not aggregated into metrics.

Function Signature

Both def and async def are supported. Two signatures are accepted:

# Model tasks (solver output available)
def compute_scores(sample: dict[str, Any], solver_output: SolverOutput) -> dict[str, Any]: ...

# Dataset tasks (no solver output)
def compute_scores(sample: dict[str, Any]) -> dict[str, Any]: ...

where sample is the current dataset row and solver_output has two attributes: output (the last model output) and messages (the full interaction history).

Examples

Example: Field Completeness. Checks whether a specific field is present and non-empty in the model output.

...
definition:
  ...
  scorers:
    - type: "python"
      compute_scores_snippet: !include "completeness_scorer.py"
      metrics:
        - type: "mean"
          field: "is_complete"
          name: "Field Completeness"

from __future__ import annotations

from typing import Any


def compute_scores(sample: dict[str, Any]) -> dict[str, Any]:
    field_name = "<< config.field >>"

    field_value = sample.get(field_name, None)
    has_field = field_name in sample and not isinstance(field_value, float)

    empty_field = isinstance(field_value, float) or (
        isinstance(field_value, str) and len(field_value) == 0
    )

    is_complete = has_field and not empty_field
    return {
        "is_complete": is_complete,
        "has_field": has_field,
        "empty_field": empty_field,
    }

With config.field = "answer":

`sample`	`is_complete`	`has_field`	`empty_field`
`{"answer": "Paris"}`	`1.0`	`1.0`	`0.0`
`{"answer": ""}`	`0.0`	`1.0`	`1.0`
`{}`	`0.0`	`0.0`	`0.0`

Configuration

Properties

type Literal "python" required

The type of the scorer.

compute_scores_snippet string, TemplateValue required

The Python code snippet defining how to compute the scores. It must define a compute_scores function with one the following APIs:

For model tasks:

def compute_scores(sample: dict[str, Any], solver_output: SolverOutput) -> dict[str, Any]:

For dataset tasks:

def compute_scores(sample: dict[str, Any]) -> dict[str, Any]:

Both def and async def are supported.

where

sample is a dictionary representing the current sample
solver_output is an object with 2 attributes:
1. output: the last model output
2. messages: a list of input/output messages representing the interaction history

The function can return scores in one of two formats:

Scores only - return a flat dict of score key/value pairs:

return {"accuracy": 0.95, "is_correct": True}

Scores with metadata - return a structured dict with "scores" and "metadata" keys. Metadata is stored alongside the scores but is not aggregated into metrics:

return {
    "scores": {"accuracy": 0.95, "is_correct": True},
    "metadata": {"model": "gpt-4", "tokens": 123},
}

key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose ScorerPurpose

The purpose of this scorer.

score: The scorer is used to score the solver output or the dataset sample.
qa: The scorer is used to do QA over the solver output or the dataset sample.

Default: score

display_name string

The display name of the scorer.

Default: None

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None