A Task Result Log captures the full output of a task execution, including metrics, per-sample evidence, solver traces, and any errors that occurred during the run.

To download a task result log using the CLI, use the lf export eval command.

Task Result Logs Overview

Properties

format_version Literal "v1" required

Format Version

app_version string required

The version of AI GO that computed this task result log.

status string required

Status

evidence TaskResultEvidence required

specification TaskResultSpecification required

execution TaskExecution required

errors array[TaskResultError] required

Errors

Definitions

`TaskResultEvidence`

Properties

metrics array[MetricData] required

The metrics. Should only be None if an error occurred.

samples array[SampleEvidence]

The sample evidence (as produced by tasks).

Default: None

tabular_evidence object

A mapping between keys and Dataframes. During validation, non-DataFrame data is accepted as long as it can be validated into a pandas DataFrame.

Default: None

tabular_evidence_metadata object

A mapping betweens keys and metadata for a Dataframe.

Default: None

free_form_evidence object

Non-table-structured evidence can be recorded in this field

Default: None

errors array[TaskResultError]

A list of task-level errors.

Default: []

failures TaskResultFailures

Default: None

`SampleEvidence`

Properties

sample_id string, integer required

Sample Id

sample SampleData required

solver SolverData required

scores array[ScoresData] required

Scores

action_records null

Action Records

Default: None

errors array[TaskResultError] required

Errors

`SolverData`

Properties

output null required

Output

`SampleData`

The raw sample data for a single evaluated sample.

Properties

data object required

The sample's field values.

`ScoresData`

The score values produced by a single scorer for a single sample.

Properties

scorer_key string required

The key identifying the scorer.

scorer_purpose enum ScorerPurpose

Default: score

Possible ScorerPurpose values

Allowed Values:

score
qa

scorer_name string

Optional display name of the scorer.

Default: None

values object required

The score values produced by the scorer.

metadata object

Optional metadata associated with the scorer output.

Default: None

direct_ios array[null]

Raw model endpoint I/O for each prediction call (if scorer uses a model).

Default: []

`MetricData`

An object that contains the metric scores.

Properties

values object required

Values

metric_key string required

The key of the metric.

metric_type string

The type of the metric. Present for newly computed results and may be missing for legacy results.

Default: None

scorer_name string

The display name of the scorer to which the metric belongs. Present only for benchmark tasks.

Default: None

scorer_key string

The key of the scorer to which the metric belongs. Present only for benchmark tasks.

Default: None

scorer_purpose enum ScorerPurpose

The purpose of the scorer. Present only for benchmark tasks.

Default: score

Possible ScorerPurpose values

Allowed Values:

score
qa

reason string

A freeform explanation for the metric value. Present only for system tasks.

Default: None

`TaskResultError`

Properties

error_type string required

The type of the error.

message string required

The specific error message that occurred during evaluation.

hint string

The suggestion to try out to fix the issue.

Default: None

stage enum TaskResultErrorStage

Default: None

Possible TaskResultErrorStage values

Allowed Values:

configuration
dataset
solver
score
metric
action

`TaskResultFailures`

Properties

num_errors integer required

Num Errors

num_total integer required

Num Total

`TaskResultSpecification`

The task specification stored inside a TaskResultLog, capturing what was evaluated and how.

Properties

display_name string required

The display name of the evaluation.

task StoredTask required

config object required

Task configuration used for this evaluation.

evaluated_entity StoredDataset, StoredModel

The dataset or model that was evaluated. Present only for benchmark tasks.

Default: None

run_config null required

`TaskExecution`

Timing and resource usage information for a task execution.

Properties

runtime number required

The runtime of the task in seconds.

started_at integer required

A Unix timestamp in seconds.

ended_at integer required

A Unix timestamp in seconds.

model_usage null

Default: None

`Trace`

Represents a conversation trace between a user and an agent.

A trace stores a sequence of items in the Open Responses format, including user messages, assistant messages, function calls, and function call outputs. It provides helper methods to extract individual turns, find function calls, and inspect the conversation.

The preamble property exposes everything before the first user message (system messages, assistant greetings, initial function calls, etc.). Everything from the first user message onward is accessible via turns.

For multi-agent traces, an optional events field provides a richer execution record with span markers encoding agent hierarchy. Use Trace.from_events() to construct event-based traces; items is derived automatically.

Properties

FORMAT string

Format

Default: open_responses

items array[null] required

Items

metadata TraceMetadata

Default: None

events null

Events

Default: None

span_id string

Span Id

Default: None

span_name string

Span Name

Default: None

span_type string

Span Type

Default: None

`TraceMetadata`

Trace-level metadata capturing identity, provenance, and summary information.

All fields are optional. Only the trace data itself (items/events) is required. Metadata enriches the trace for filtering, grouping, and analysis.

Properties

trace_id string

Trace Id

Default: None

source_type string

Source Type

Default: None

source_uri string

Source Uri

Default: None

agent string

Agent

Default: None

model string

Model

Default: None

tags array[string]

`TableMetadata`

Properties

columns_metadata object

Mapping column names to their corresponding metadata.