v3.9.1

Task repeatability, Trials, Solver post-processor, Risk Dashboard.

What's new

  • Task repeatability helps you check whether an evaluation reliably produces consistent numbers (i.e. consistent metrics and scores).

  • Trials let you run each sample multiple times (before computing the final metrics).

  • Solver postprocessor: apply a Python postprocessing step to solver outputs before scoring.

  • "Used for metrics" indicator in evidence stats, so you can see at a glance which evidence contributes to metrics.

  • CLI: filter local datasets directly from the command line.

  • CLI: new lf integration list command to list your configured integrations.

  • CLI: load additional environment variables from extra env files.

Improvements

  • Empty risk policies page now links directly to the documentation to help you get started.
  • Faster model adapters thanks to bulk conversion, eliminating an N+1 query bottleneck.
  • Redesigned and rebranded AI platform login page, including refreshed labels, spacing, and styling.
  • Polished model configuration form in the UI.
  • Added a cancel button to editing forms so you can discard changes more easily.
  • Added a Beta tag to the risk overview.

Bug fixes

  • Updated the Fireworks Kimi model from 2.5 to 2.7 following its retirement.
  • Removed an outdated Anthropic model and added missing model entries.
  • Fixed an incorrect color for the progress value in the risk policy status table's score column.
  • Fixed early stoppage of the task progress tracker during repeatability runs.
  • Fixed an invalid log status check.
  • Fixed model serialization that was dropping type information.