Python Scorer
Scores each sample or model output using a custom Python function, giving you full programmatic control over the scoring logic. Use this when built-in scorers don't cover your criterion - e.g. regex extraction, JSON schema validation, or numeric range checks. For scores that depend on the full set of samples, use Python Batch Scorer instead.
Output
The score keys and values are entirely defined by your compute_scores
function. Two return formats are supported:
- Scores only -
{"accuracy": 0.95, "is_correct": True} - Scores with metadata -
{"scores": {"accuracy": 0.95}, "metadata": {"tokens": 123}}
Metadata is stored alongside scores but not aggregated into metrics.
Function Signature
Both def and async def are supported. Two signatures are accepted:
# Model tasks (solver output available)
def compute_scores(sample: dict[str, Any], solver_output: SolverOutput) -> dict[str, Any]: ...
# Dataset tasks (no solver output)
def compute_scores(sample: dict[str, Any]) -> dict[str, Any]: ...where sample is the current dataset row and solver_output has two
attributes: output (the last model output) and messages (the full
interaction history).
Examples
Example: Field Completeness. Checks whether a specific field is present and non-empty in the model output.
...
definition:
...
scorers:
- type: "python"
compute_scores_snippet: !include "completeness_scorer.py"
metrics:
- type: "mean"
field: "is_complete"
name: "Field Completeness"from __future__ import annotations
from typing import Any
def compute_scores(sample: dict[str, Any]) -> dict[str, Any]:
field_name = "<< config.field >>"
field_value = sample.get(field_name, None)
has_field = field_name in sample and not isinstance(field_value, float)
empty_field = isinstance(field_value, float) or (
isinstance(field_value, str) and len(field_value) == 0
)
is_complete = has_field and not empty_field
return {
"is_complete": is_complete,
"has_field": has_field,
"empty_field": empty_field,
}With config.field = "answer":
sample | is_complete | has_field | empty_field |
|---|---|---|---|
{"answer": "Paris"} | 1.0 | 1.0 | 0.0 |
{"answer": ""} | 0.0 | 1.0 | 1.0 |
{} | 0.0 | 0.0 | 0.0 |
Configuration
Properties
type Literal "python" required
The type of the scorer.
compute_scores_snippet string, TemplateValue required
The Python code snippet defining how to compute the scores. It must define
a compute_scores function with one the following APIs:
For model tasks:
def compute_scores(sample: dict[str, Any], solver_output: SolverOutput) -> dict[str, Any]:For dataset tasks:
def compute_scores(sample: dict[str, Any]) -> dict[str, Any]:Both def and async def are supported.
where
-
sampleis a dictionary representing the current sample -
solver_outputis an object with 2 attributes:output: the last model outputmessages: a list of input/output messages representing the interaction history
The function can return scores in one of two formats:
Scores only - return a flat dict of score key/value pairs:
return {"accuracy": 0.95, "is_correct": True}Scores with metadata - return a structured dict with "scores" and "metadata" keys.
Metadata is stored alongside the scores but is not aggregated into metrics:
return {
"scores": {"accuracy": 0.95, "is_correct": True},
"metadata": {"model": "gpt-4", "tokens": 123},
}key string
Unique identifier assigned to the entity in AI GO!.
Default:None
purpose ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
display_name string
The display name of the scorer.
Default:None
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
