Scorers

Scorers evaluate the output produced by a solver (or a dataset sample directly) and assign one or more numeric scores. Each scorer is defined inline in your task YAML file under the scorers key and produces score values that are aggregated into metrics.

# tasks/my_task.yaml
definition:
  scorers:
    - type: string_equals
      ground_truth: "{{ sample.expected_answer }}"

A task can define multiple scorers. Each scorer runs independently and contributes its own named scores.

Available Scorers

ScorerType valueDescription
String Equalitystring_equalsExact string match against a ground-truth answer.
String Equality (Multiple Choice)string_equals_mcqaExact match on the first character of the output against a multiple-choice ground-truth.
Model As A Judge Classifiermodel_as_a_judge_classifierUses an LLM to classify the output and checks it against a set of correct labels.
Labeler (via Model)labeler_via_modelUses an LLM to attach a free-form label to the sample or solver output.
Model Scorermodel_as_a_judge_scorerUses an LLM to produce a numeric score for the output.
Python ScorerpythonRuns a custom Python function to score each sample individually.
Python Batch Scorerpython_all_samplesRuns a custom Python function over all samples at once, enabling cross-sample metrics.
Text Similarity (BLEU Score)bleuComputes BLEU n-gram overlap between the output and a ground-truth string.
RAG Checkerrag_checkerEvaluates RAG outputs for faithfulness and relevance using the RAGChecker framework.
Function Call Coveragefunction_call_coverageChecks whether an agent made all required function calls in the correct order.