Text Similarity (BLEU Score)

Scores each sample by computing n-gram overlap between the value to check and the ground-truth, producing a bleu_score in the [0, 1] range. Use this for translation or text generation tasks where the expected output is well-defined. Avoid for single-word comparisons. For semantic similarity or paraphrased outputs, use the Model Scorer instead.

Output

  • bleu_score: n-gram overlap with the ground-truth, in the [0, 1] range. 0.0 means no overlap; 1.0 means a perfect match.

Configuration

Properties


type Literal "bleu" required

The type of the scorer.


ground_truth string, TemplateValue required

The ground truth against which the value is compared.

The ground-truth can be:

  1. A hard-coded string (ex: "YES")
  2. Refer to the sample data (ex: "{{ sample.country }}")
  3. Or a mix of (1) and (2) (ex: "The country is {{ sample.country }}").

sample represents the current row of the dataset (with a field for every dataset column).


value string, TemplateValue

The value which will be compared against the ground-truth.

The value can be:

  1. A hard-coded string (ex: "YES")
  2. Refer to the sample data (ex: "{{ sample.country }}")
  3. (For model tasks) Refer to the solver output (ex: "{{ solver_output.output }}")
  4. Or a mix of the others (ex: "The country is {{ sample.country }}").

sample represents the current row of the dataset (with a field for every dataset column).

If value is None:

  • If the task has a solver and the solver output is a chat completion response, then the value is set to the output message content.
  • Otherwise, an error is produced.
Default: None

key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose ScorerPurpose

The purpose of this scorer.

  • score: The scorer is used to score the solver output or the dataset sample.
  • qa: The scorer is used to do QA over the solver output or the dataset sample.
Default: score

display_name string

The display name of the scorer.

Default: None

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None