Task repeatability, Trials, Solver post-processor, Risk Dashboard.

What's new

Task repeatability helps you check whether an evaluation reliably produces consistent numbers (i.e. consistent metrics and scores).
Trials let you run each sample multiple times (before computing the final metrics).
Solver postprocessor: apply a Python postprocessing step to solver outputs before scoring.
"Used for metrics" indicator in evidence stats, so you can see at a glance which evidence contributes to metrics.
CLI: filter local datasets directly from the command line.
CLI: new lf integration list command to list your configured integrations.
CLI: load additional environment variables from extra env files.

Improvements

Empty risk policies page now links directly to the documentation to help you get started.
Faster model adapters thanks to bulk conversion, eliminating an N+1 query bottleneck.
Redesigned and rebranded AI platform login page, including refreshed labels, spacing, and styling.
Polished model configuration form in the UI.
Added a cancel button to editing forms so you can discard changes more easily.
Added a Beta tag to the risk overview.

Updated the Fireworks Kimi model from 2.5 to 2.7 following its retirement.
Removed an outdated Anthropic model and added missing model entries.
Fixed an incorrect color for the progress value in the risk policy status table's score column.
Fixed early stoppage of the task progress tracker during repeatability runs.
Fixed an invalid log status check.
Fixed model serialization that was dropping type information.