Scorers
Scorers evaluate the output produced by a solver (or a dataset sample directly) and assign one or more numeric scores. Each scorer is defined inline in your task YAML file under the scorers key and produces score values that are aggregated into metrics.
# tasks/my_task.yaml
definition:
scorers:
- type: string_equals
ground_truth: "{{ sample.expected_answer }}"A task can define multiple scorers. Each scorer runs independently and contributes its own named scores.
Available Scorers
| Scorer | Type value | Description |
|---|---|---|
| String Equality | string_equals | Exact string match against a ground-truth answer. |
| String Equality (Multiple Choice) | string_equals_mcqa | Exact match on the first character of the output against a multiple-choice ground-truth. |
| Model As A Judge Classifier | model_as_a_judge_classifier | Uses an LLM to classify the output and checks it against a set of correct labels. |
| Labeler (via Model) | labeler_via_model | Uses an LLM to attach a free-form label to the sample or solver output. |
| Model Scorer | model_as_a_judge_scorer | Uses an LLM to produce a numeric score for the output. |
| Python Scorer | python | Runs a custom Python function to score each sample individually. |
| Python Batch Scorer | python_all_samples | Runs a custom Python function over all samples at once, enabling cross-sample metrics. |
| Text Similarity (BLEU Score) | bleu | Computes BLEU n-gram overlap between the output and a ground-truth string. |
| RAG Checker | rag_checker | Evaluates RAG outputs for faithfulness and relevance using the RAGChecker framework. |
| Function Call Coverage | function_call_coverage | Checks whether an agent made all required function calls in the correct order. |
