Testing Tasks
The lf test task command runs a task on one or more dataset samples without scheduling a full evaluation. It executes the complete pipeline — solver, scorers, and metrics — and prints the intermediate results at every stage, making it easy to verify that a task is correctly defined before spending time on a full evaluation run.
Run it immediately after defining a new task to confirm:
- The solver correctly calls the model and produces output for the dataset samples.
- Each scorer correctly processes the solver output and produces scores.
- Each metric correctly aggregates the per-sample scores.
There are two general types of issues that a task can have: "issues that prevent the execution of the task" and "issues that affect the correctness of the results". The task testing tool can be used to detect and address both types of issues.
No evidence is persisted — the results are printed to stdout and discarded.
Usage
lf test task -f run.yaml --spec-key <spec-key>
lf test task -f run.yaml --spec-key <spec-key> --num-samples 3
lf test task -f run.yaml --spec-key <spec-key> --sample-id 0 --sample-id 4The command reads the task specification identified by --spec-key from a run config file, resolves the model, dataset, and task config, and then runs the pipeline on the selected samples. The default is one sample (the first in the dataset).
--num-samples and --sample-id are mutually exclusive. Use --sample-id (repeatable) to target specific rows by their sample_id value when you need to reproduce a particular failure or test particularly interesting samples.
The command requires the model referenced by the task specification to be registered and reachable. Run
lf test modelfirst if you have not verified the model yet.
Successful Case
The following example uses the task-model-as-a-judge guide, which defines a judge-qa-basic task that runs a single-turn QA solver and then scores each answer with two judge scorers: a classifier for correctness and a numeric scorer for answer quality.
lf test task -f run.yaml --spec-key judge-qa-basic-gpt-4-1-nanoSample
Testing task 'judge-qa-basic' with model 'openai$gpt-4-1-nano' on 1 sample(s) ...
Checking configuration
Configuration is valid.
================================================================================
Processing sample 1/1
--------------------------------------------------------------------------------
Dataset sample
question: What is the capital of Australia?
gt_answer: Canberra
sample_id: 0
The raw dataset sample is printed first. Check that sample_id matches what you expect and that all fields your solver or scorer templates reference are present.
Solver Output
--------------------------------------------------------------------------------
Solver
output:
trace:
FORMAT: open_responses
items:
- role: system
content: You are a helpful assistant. Answer concisely.
- role: user
content: What is the capital of Australia?
- role: assistant
content: The capital of Australia is Canberra.
raw_outputs:
- choices:
- message:
role: assistant
content: The capital of Australia is Canberra.
usage:
num_completion_tokens: 7
num_prompt_tokens: 29
The solver output includes the full conversation trace and the raw model output. Use this to verify that the solver's input_builder is assembling the right messages and that the model is producing a sensible response.
Scores
--------------------------------------------------------------------------------
Scores
- scorer_key: correctness
scorer_purpose: score
scorer_name: Answer Correctness
values:
is_correct: true
metadata:
judge_output:
prediction: correct
confidence: 1.0
reasoning: The candidate answer correctly states the capital of Australia as
Canberra, matching the ground truth.
- scorer_key: answer_quality
scorer_purpose: score
scorer_name: Answer Quality
values:
score: 1.0
metadata:
judge_output:
score: 100.0
confidence: 1.0
reasoning: The answer correctly states the capital of Australia as Canberra,
matching the ground truth and providing a clear, complete response.
Each scorer block shows the values that feed into the metrics (e.g. is_correct, score) and metadata, which contains additional information, such as the judge model's reasoning.
Metrics
--------------------------------------------------------------------------------
Processed all samples.
================================================================================
Metrics
Accuracy: 1.0
Mean Quality: 1.0
Min Quality: 1.0
Max Quality: 1.0
--------------------------------------------------------------------------------
Successfully tested configuration of task with key 'judge-qa-basic'.
The metric values are printed after all samples are processed.
Diagnosing Failures
The command stops at the first failing stage within a sample and prints a detailed error message. Three common failure patterns are shown below.
Judge Labels Do Not Match
If the judge system prompt instructs the model to respond with different labels than those listed in correct_labels/incorrect_labels, every sample will fail at the score stage.
In the example below, the prompt says "respond with 'yes' or 'no'" but the task configuration expects correct_labels: ["correct"] and incorrect_labels: ["incorrect"]:
Processing sample 1/1
...
--------------------------------------------------------------------------------
Encountered error during score stage:
ValueError
Invalid ModelAsAJudgeClassifier prediction: 'yes'. Valid labels are: ['correct', 'incorrect'].
--------------------------------------------------------------------------------
Processed all samples (1/1 failed).
================================================================================
Encountered error during metric stage:
ValueError
Cannot compute mean: received 0 sample scores, but at least 1 are required.
The prediction value ('yes') and the list of valid labels make the mismatch immediately clear. Fix either the prompt (change it to output 'correct'/'incorrect') or the task config (update correct_labels/incorrect_labels to ["yes"]/["no"]).
In this case, the metric aggregation also fails, as there are no values to aggregate. This will be resolved once the score computation succeeds.
Python Scorer Crash
If a Python scorer raises an exception, the traceback is shown inline alongside the full inputs that were passed to the function — so you can reproduce the failure locally without running the entire task.
In the example below, the scorer accesses sample["expected_answer"] but the dataset field is named "capital":
Encountered error during score stage:
KeyError
Code execution failed.
Input:
{'sample': {'question': 'What is the capital of Japan?', 'country': 'Japan', 'capital': 'Tokyo', ...},
'solver_output': SolverTrace(...)}
Traceback:
line 11 (in function 'compute_scores')
return {
"is_correct": model_completion.lower().strip() == sample["expected_answer"].lower(),
}
KeyError: 'expected_answer'
--------------------------------------------------------------------------------
Processed all samples (1/1 failed).
================================================================================
Encountered error during metric stage:
ValueError
Cannot compute mean: received 0 sample scores, but at least 1 are required.
The Input block shows the exact sample dict that was passed in, so you can see every available key.
In this case, the metric aggregation also fails, as there are no values to aggregate. This will be resolved once the score computation succeeds.
Model Inference Failure
If the model returns an error during the solver stage — for example, because the API key is invalid or the endpoint is unreachable — the failure appears at the solver stage rather than the score stage:
Processing sample 1/1
--------------------------------------------------------------------------------
Dataset sample
question: What is the capital of Australia?
gt_answer: Canberra
sample_id: 0
--------------------------------------------------------------------------------
Encountered error during solver stage:
ValueError
Model adapter 'OpenAI Chat Completion' could not convert output. Received:
'{"error": {"message": "Incorrect API key provided: sk-inval**************ting.", ...}}',
status code: 401
Error: A non-200 status code (401) was returned by the model.
--------------------------------------------------------------------------------
Processed all samples (1/1 failed).
The raw error body and the HTTP status code are included. This indicates a model configuration problem rather than a task definition problem — run lf test model to diagnose and fix the model before retesting the task.
