Metrics

Metrics aggregate the per-sample scores produced by a scorer into a set of named metric values. Each metric is defined inline in your task YAML under the metrics key of a scorer.

# tasks/my_task.yaml
definition:
  scorers:
    - type: "string_equals"
      ground_truth: "{{ sample.expected_answer }}"
      metrics:
        - type: "mean"
          field: "is_correct"
          name: "Accuracy"

A scorer can define multiple metrics. Each metric runs independently over the full set of sample scores produced by that scorer.

Available Metrics

MetricType valueDescription
MeanmeanArithmetic mean of a numeric score field across all samples.
MinminMinimum value of a numeric score field across all samples.
MaxmaxMaximum value of a numeric score field across all samples.
Standard Deviationstd_devStandard deviation of a numeric score field across all samples.
FrequencyfrequencyRelative frequency of each distinct value in a score field across all samples.
PrecisionprecisionPrecision (TP / (TP + FP)) computed from per-sample true and false positive counts.
RecallrecallRecall (TP / (TP + FN)) computed from per-sample true positive and false negative counts.
F1 Scoref1_scoreF1 score (harmonic mean of precision and recall) from per-sample TP/FP/FN counts.
Binary Classificationbinary-classificationFull binary classification report: accuracy, precision, recall, F1, and confusion matrix.
Multiclass Classificationmulticlass-classificationPer-class and macro-averaged precision, recall, and F1, plus overall accuracy.
PythonpythonFully custom aggregation using a Python function.