Skip to main content
Version: 2.0.0

📊 Evaluation Metrics Guide

Evaluation metrics are essential tools used to assess the performance and quality of the assistant’s responses. These metrics are categorized under different evaluation themes including generation quality, retrieval quality, and ethical safety. This guide outlines each metric in detail, along with definitions and formulae where applicable.


🔹 1. Generation Quality Metrics

1.1 Answer Correctness

Definition: Measures how accurate the generated response is when compared to the ground truth.

Key Components:

  • Semantic Similarity
    Measures how closely the meaning of the generated response aligns with the meaning of the ground truth, even if different words or phrasing are used.

  • Factual Similarity
    Evaluates whether the facts or claims in the response are accurate and consistent with the ground truth information.

Scoring:
Higher scores indicate a closer alignment with the ground truth, reflecting both semantic and factual correctness.


1.2 Answer Similarity

Definition: Evaluates only the semantic similarity between the generated answer and the ground truth.

Note: Unlike answer correctness, this does not account for factual correctness.


1.3 Answer Relevance

Definition: Answer Relevance measures how well the assistant’s response directly addresses the user’s input or question. It ensures that the response is pertinent, avoids irrelevant details, and fulfills the user's informational need.

Range:
0 to 1 — Higher values indicate stronger alignment between the response and the original query.

  • A value close to 1 means the response is highly relevant to the input.
  • A lower value indicates that the response may include off-topic or incomplete info

How It’s Measured:

  • Convert each sentence or response segment into a vector using an embedding model.
  • Compute the cosine similarity between the embeddings of the generated response and the user input.
  • Average the similarity scores across all segments.

Formula:

Answer Relevance=1Ni=1Ncosine_similarity(Egi,Eo)\text{Answer Relevance} = \frac{1}{N} \sum_{i=1}^{N} \text{cosine\_similarity}(E_{g_i}, E_o)
Where:

N=Number  of  segments  in  the  responseN = \text{Number}\; \text{of}\; \text{segments}\; \text{in}\; \text{the}\; \text{response}
Egi=Embedding  of  the  i-th  segment  of  the  generated  responseE_{g_i} = \text{Embedding}\; \text{of}\; \text{the}\; i\text{-th}\; \text{segment}\; \text{of}\; \text{the}\; \text{generated}\; \text{response}
Eo=Embedding  of  the  original  queryE_o = \text{Embedding}\; \text{of}\; \text{the}\; \text{original}\; \text{query}
cosine_similarity(Egi,  Eo)=Semantic  similarity  between  i-th  segment  and  the  query\text{cosine\_similarity}(E_{g_i},\; E_o) = \text{Semantic}\; \text{similarity}\; \text{between}\; \text{i-th}\; \text{segment}\; \text{and}\; \text{the}\; \text{query}


1.4 BLEU Score

Definition: A widely used metric based on n-gram precision and brevity penalty, used to compare generated text with reference responses.

Range:

0 (no match at all) to 1 (perfect match with the reference).

Formula:

BLEU=BPexp(n=1Nwnlogpn)\text{BLEU} = BP \cdot \exp \left( \sum_{n=1}^{N} w_n \cdot \log p_n \right)
Where:

BP=Brevity  Penalty    penalizes  short  outputsBP = \text{Brevity}\; \text{Penalty}\; \text{–}\; \text{penalizes}\; \text{short}\; \text{outputs}
pn=Precision  for  n-grams  (e.g.,  unigram,  bigram,  etc.)p_n = \text{Precision}\; \text{for}\; n\text{-grams}\; (\text{e.g.,}\; \text{unigram,}\; \text{bigram,}\; \text{etc.})
wn=Weight  for  each  n-gram  level  (usually  uniform,  e.g.,  0.25  for  1- to -4-gram)w_n = \text{Weight}\; \text{for}\; \text{each}\; n\text{-gram}\; \text{level}\; (\text{usually}\; \text{uniform,}\; \text{e.g.,}\; 0.25\; \text{for}\; 1\text{- to -}4\text{-gram})
exp()=Ensures  geometric  mean  rather  than  arithmetic  mean\exp(\cdot) = \text{Ensures}\; \text{geometric}\; \text{mean}\; \text{rather}\; \text{than}\; \text{arithmetic}\; \text{mean}


1.5 ROUGE Score

Definition: Measures overlap between n-grams of the generated text and reference text. Includes precision, recall, and F1-score.

Variants:

  • ROUGE-N: Overlap of n-grams
  • ROUGE-L: Longest Common Subsequence

Range: 0 to 1
A score of 0 means no overlap between the generated text and the reference (poor quality), while a score of 1 means perfect overlap (ideal match)

Formula (F1-score):

F1=2(PrecisionRecall)Precision+RecallF_1 = \frac{2 \cdot (\text{Precision} \cdot \text{Recall})}{\text{Precision} + \text{Recall}}

1.6 Faithfulness

Definition:
The Faithfulness metric measures how factually consistent a response is with the retrieved context. It ensures that the assistant does not hallucinate or fabricate information that is not grounded in the provided sources.

Range:
0 to 1 — Higher scores indicate better consistency with the retrieved context.

Steps to Calculate:

  1. Identify all the claims made in the response.
  2. For each claim, verify whether it is supported or inferable from the retrieved context.
  3. Compute the score using the formula below.

Formula:

Faithfulness Score=Number of supported claims in responseTotal number of claims in response\text{Faithfulness Score} = \frac{\text{Number of supported claims in response}}{\text{Total number of claims in response}}

Interpretation:

  • A score of 1 means all claims are backed by the retrieved context.
  • A lower score indicates that some claims are unsubstantiated or hallucinated.

🔹 2. Retrieval Quality Metrics

2.1 Context Recall

Definition: Measures how much of the benchmark response (reference answer) can be generated using the retrieved data points.

Formula:

Context Recall=Number of claims in the reference supported by the retrieved contextNumber of claims in the reference\text{Context Recall} = \frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Number of claims in the reference}}

2.2 Context Entities Recall

Definition: Evaluates how many common entities (keywords or concepts) are shared between the retrieved context and the reference response.

Formula:

Context Entities Recall=Number of entities in reference also found in retrieved dataNumber of entities in reference\text{Context Entities Recall} = \frac{\text{Number of entities in reference also found in retrieved data}}{\text{Number of entities in reference}}

2.3 Context Precision

Definition: Measures how much of the retrieved data is actually useful for generating the correct response.

Formula:

Context Precision=Number supporting retrieved statementsNumber of total retrieved statements\text{Context Precision} = \frac{\text{Number supporting retrieved statements}}{\text{Number of total retrieved statements}}

2.4 Noise Sensitivity

Definition: Measures how frequently the assistant generates incorrect responses due to irrelevant or misleading context.

Range: 0 (better) to 1 (worse)

Formula:

Noise Sensitivity (Relevant)=Total number of incorrect claims in responseTotal number of claims in the response\text{Noise Sensitivity (Relevant)} = \frac{|\text{Total number of incorrect claims in response}|}{|\text{Total number of claims in the response}|}

🔹 3. Ethics and Safety Metrics

These are LLM-based critic metrics, evaluating the ethical quality of responses by asking specific safety-related questions and applying a majority vote mechanism.

Workflow:

  1. Define a critic prompt.
  2. Make 3 independent LLM calls.
  3. Apply majority vote to determine the binary outcome (e.g., Harmful or Not Harmful).

Evaluated Categories:

  • Harmful: Promotes or causes harm.
  • Malicious: Exploits or misleads for harmful purposes.
  • Bias: Reinforces stereotypes or unfair treatment.
  • Toxic: Uses abusive or aggressive language.
  • Hateful: Encourages discrimination or hate speech.
  • Sexual: Contains inappropriate or explicit sexual content.
  • Violent: Incites or glorifies violence.
  • Insensitive: Disrespectful to identities or situations.
  • Self-harm: Promotes suicidal or self-injurious behavior.
  • Manipulative: Deceptively influences user behavior.

Summary Table

MetricCategoryMeasuresScore Range / Type
Answer CorrectnessGenerationSemantic + Factual Accuracy0–1
Answer SimilarityGenerationSemantic Similarity0–1
Answer RelevanceGenerationRelevance to Query0–1
BLEU ScoreGenerationn-gram Match + Brevity0–1
ROUGE ScoreGenerationWord Sequence Overlap0–1
FaithfulnessGenerationFactual consistency with retrieved context0–1
Context RecallRetrievalData-Answer Overlap0–1
Context Entities RecallRetrievalEntity Overlap0–1
Context PrecisionRetrievalRelevant Context Snippets0–1
Noise SensitivityRetrievalErrors Due to Noise0–1 (lower is better)
Ethics and SafetySafetyBinary Verdicts via LLMYes / No