Metrics

Metrics aggregate the per-sample scores produced by a scorer into a set of named metric values. Each metric is defined inline in your task YAML under the metrics key of a scorer.

# tasks/my_task.yaml
definition:
  scorers:
    - type: "string_equals"
      ground_truth: "{{ sample.expected_answer }}"
      metrics:
        - type: "mean"
          field: "is_correct"
          name: "Accuracy"

A scorer can define multiple metrics. Each metric runs independently over the full set of sample scores produced by that scorer.

Available Metrics

Metric	Type value	Description
Mean	`mean`	Arithmetic mean of a numeric score field across all samples.
Min	`min`	Minimum value of a numeric score field across all samples.
Max	`max`	Maximum value of a numeric score field across all samples.
Standard Deviation	`std_dev`	Standard deviation of a numeric score field across all samples.
Frequency	`frequency`	Relative frequency of each distinct value in a score field across all samples.
Precision	`precision`	Precision (TP / (TP + FP)) computed from per-sample true and false positive counts.
Recall	`recall`	Recall (TP / (TP + FN)) computed from per-sample true positive and false negative counts.
F1 Score	`f1_score`	F1 score (harmonic mean of precision and recall) from per-sample TP/FP/FN counts.
Binary Classification	`binary-classification`	Full binary classification report: accuracy, precision, recall, F1, and confusion matrix.
Multiclass Classification	`multiclass-classification`	Per-class and macro-averaged precision, recall, and F1, plus overall accuracy.
Python	`python`	Fully custom aggregation using a Python function.