📊 Evaluation Metrics Guide
Evaluation metrics are essential tools used to assess the performance and quality of the assistant’s responses. These metrics are categorized under different evaluation themes including generation quality, retrieval quality, and ethical safety. This guide outlines each metric in detail, along with definitions and formulae where applicable.
🔹 1. Generation Quality Metrics
1.1 Answer Correctness
Definition: Measures how accurate the generated response is when compared to the ground truth.
Key Components:
-
Semantic Similarity
Measures how closely the meaning of the generated response aligns with the meaning of the ground truth, even if different words or phrasing are used. -
Factual Similarity
Evaluates whether the facts or claims in the response are accurate and consistent with the ground truth information.
Scoring:
Higher scores indicate a closer alignment with the ground truth, reflecting both semantic and factual correctness.
1.2 Answer Similarity
Definition: Evaluates only the semantic similarity between the generated answer and the ground truth.
Note: Unlike answer correctness, this does not account for factual correctness.
1.3 Answer Relevance
Definition: Answer Relevance measures how well the assistant’s response directly addresses the user’s input or question. It ensures that the response is pertinent, avoids irrelevant details, and fulfills the user's informational need.
Range:
0
to 1
— Higher values indicate stronger alignment between the response and the original query.
- A value close to
1
means the response is highly relevant to the input. - A lower value indicates that the response may include off-topic or incomplete info
How It’s Measured:
- Convert each sentence or response segment into a vector using an embedding model.
- Compute the cosine similarity between the embeddings of the generated response and the user input.
- Average the similarity scores across all segments.
Formula:
Where:
1.4 BLEU Score
Definition: A widely used metric based on n-gram precision and brevity penalty, used to compare generated text with reference responses.
Range:
0
(no match at all) to 1
(perfect match with the reference).
Formula:
Where: