Evaluating Samples Across Multiple Trials

Trials let you run each sample multiple times (before computing the final metrics), which is often useful in the following cases:

1. Handling Unstable Scorers

Use trials when a scorer may not produce the same judgment every time.

For example, an LLM judge may sometimes score the same model output as a refusal and sometimes not. Instead of relying on one judgment, LatticeFlow can run the scorer multiple times and aggregate the trial scores into one sample-level score.

Example: if the refusal scores for one sample are true, true, false, true, true, then the aggregated sample score is refusal.mean = 0.8.

2. Measuring Model Reliability

Use trials when you want to estimate how reliably the model succeeds on a sample. For each trial, LatticeFlow runs the solver on the same sample, scores the result, and then aggregates the trial scores into a reliability score.

For example, an agent may solve the same task in some trials but fail in others. LatticeFlow can estimate the probability that the model succeeds at least once across k independent attempts (pass@k), or the probability that it succeeds on all k independent attempts (pass^k).

Example: if the is_correct scores for one sample are true, false, true, false, true, then the aggregated (reliability) scores are pass@5 = 0.98976 and pass^5 = 0.07776.

Understanding Trial Results

A task with trials runs each sample multiple times. Each trial produces its own solver output and scores. LatticeFlow then aggregates the trial scores into one score per sample, and metrics are computed from those aggregated sample scores.

Consider the following example, where each sample is run five times and receives a binary refusal score:

                           sample A   sample B   sample C
                           ────────   ────────   ────────
trial 1:  refusal score →      1          0          1
trial 2:  refusal score →      1          1          1
trial 3:  refusal score →      0          1          1
trial 4:  refusal score →      1          1          1
trial 5:  refusal score →      1          0          1
                               ↓          ↓          ↓
refusal.mean score:           0.80       0.60       1.00
                               └──────────┴──────────┘
                                          ↓
                            refusal_rate metric: 0.80

For each sample, LatticeFlow aggregates the scores from its trials into one sample-level score. In this example, sample A has trial scores 1, 1, 0, 1, 1, so its aggregated score is refusal.mean: 0.80. Sample B gets refusal.mean: 0.60, and sample C gets refusal.mean: 1.00.

The metric is then computed from these aggregated sample-level refusal.mean scores. Here, the final refusal_rate is 0.80.

Score Aggregation

For each sample, LatticeFlow aggregates numeric and boolean trial scores into sample-level scores. If no custom aggregation is configured for a score, LatticeFlow uses mean.

The following aggregators are supported:

Aggregator	Meaning
`mean`	Average score across trials.
`min`	Lowest score across trials.
`max`	Highest score across trials.
`pass@k`	Estimated probability of at least one success across `k` independent attempts.
`pass^k`	Estimated probability of success on all `k` independent attempts.

Note: The pass@k and pass^k score aggregators only support aggregating boolean scores.

Using Trials

Adding trials to a task

Define trials in the task when repeated trials are part of what the task is meant to measure. For example, an agent task may always be evaluated with multiple trials because the relevant question is how reliably it solves the task.

definition:
  type: benchmark_task
  evaluated_entity_type: model
  dataset: ...
  solver: ...
  scorers: ...
  trials:
    num_trials: 5
    score_aggregators:
      - score_name: is_correct
        aggregator:
          function: pass@k
          k: 5
          score_name: pass@5

In this example, each sample is run five times. The is_correct trial scores are aggregated using pass@5. All other numeric/bool scores use the default mean aggregation.

Override the Number of Trials

If you want to override the number of trials configured by the task, you can use the trials section of the task specification:

evaluation:
  key: customer_support_agent_eval
  display_name: Customer Support Agent Eval
  task_specifications:
    - task_key: support_agent_resolution
      model_key: my_customer_support_model
      trials:
        num_trials: 3

This runs each sample in the support_agent_resolution task specification 3 times.

Only the number of trials can be configured in the task specification. Score aggregation is defined by the task, because it controls the meaning of the resulting sample-level scores.