Metrics
Metrics aggregate the per-sample scores produced by a scorer into a set of named metric values.
Each metric is defined inline in your task YAML under the metrics key of a scorer.
# tasks/my_task.yaml
definition:
scorers:
- type: "string_equals"
ground_truth: "{{ sample.expected_answer }}"
metrics:
- type: "mean"
field: "is_correct"
name: "Accuracy"A scorer can define multiple metrics. Each metric runs independently over the full set of sample scores produced by that scorer.
Available Metrics
| Metric | Type value | Description |
|---|---|---|
| Mean | mean | Arithmetic mean of a numeric score field across all samples. |
| Min | min | Minimum value of a numeric score field across all samples. |
| Max | max | Maximum value of a numeric score field across all samples. |
| Standard Deviation | std_dev | Standard deviation of a numeric score field across all samples. |
| Frequency | frequency | Relative frequency of each distinct value in a score field across all samples. |
| Precision | precision | Precision (TP / (TP + FP)) computed from per-sample true and false positive counts. |
| Recall | recall | Recall (TP / (TP + FN)) computed from per-sample true positive and false negative counts. |
| F1 Score | f1_score | F1 score (harmonic mean of precision and recall) from per-sample TP/FP/FN counts. |
| Binary Classification | binary-classification | Full binary classification report: accuracy, precision, recall, F1, and confusion matrix. |
| Multiclass Classification | multiclass-classification | Per-class and macro-averaged precision, recall, and F1, plus overall accuracy. |
| Python | python | Fully custom aggregation using a Python function. |
