How P3Fusion Evaluates Every RAG System It Builds — The P3 RAG Evaluation Framework

Technical Case Study · RAG Evaluation Framework

AWS Gen AI Competency

InsightBot · FusionReport

Technical Case Study · RAG Quality Assurance

RAG Evaluation Framework

LLM-as-Judge

RAGAS Metrics

Amazon Bedrock Evaluation

Faithfulness · Context Precision

Hallucination Detection

Continuous Monitoring

Framework at a Glance

Evaluation method: LLM-as-Judge + IR metrics · Primary framework: RAGAS + Amazon Bedrock · Judge model: Claude 3.5 Sonnet · Deployment gate: Automated CI/CD · Post-launch cadence: Monthly eval + live sampling

Executive Summary

P3Fusion builds production RAG systems for enterprise clients — not proofs-of-concept. The difference between a demo that impresses and a system that earns organisational trust is systematic, measurable evaluation applied before deployment and continuously maintained after it. Over multiple deployments across different industries and use cases, P3Fusion developed the P3 RAG Evaluation Framework: a 12-metric, 3-layer quality assurance methodology that validates every RAG system across retrieval quality, generation quality, hallucination detection, and system performance. This case study documents the framework — what we measure, why each metric matters, what thresholds we enforce, which failure modes each metric catches, and how evaluation continues in production. Every P3Fusion RAG deployment — whether InsightBot, FusionReport, or a custom enterprise RAG system — is validated against this framework before go-live.

The Problem with "It Seems to Work"

Most Enterprise RAG Systems Are Deployed Without Knowing If They Actually Work

The most common approach to RAG quality assurance is anecdotal: the team asks the system a handful of questions, the answers look reasonable, and the system goes live. This approach fails in production — systematically and predictably. A system that produces plausible-sounding answers during a demo is not the same as a system that produces accurate, grounded, cited answers at 10,000 queries per month across a corpus it has never seen during testing.

The failure modes of unvalidated RAG systems are well-documented. Stanford AI Lab research shows that poorly evaluated systems produce hallucinations — claims that are not supported by retrieved context — in up to 40% of responses, even when the correct source document was retrieved. The retriever may be returning documents that are semantically adjacent but factually irrelevant. The generator may be filling gaps from its training memory rather than the provided context. The chunking strategy may be truncating the most relevant information. None of these failures are visible in a casual demo. All of them are measurable.

We built InsightBot and FusionReport to production grade — and we apply the same evaluation rigour to every custom RAG system we build. A RAG system that has not been evaluated against its own knowledge base and real query patterns is not a production system. It is a prototype wearing a production interface.

— P3Fusion Engineering Team

The Foundation

The RAG Triad — Every Metric Connects to These Three Questions

Every metric in the P3 RAG Evaluation Framework maps to one of three fundamental questions — the RAG Triad. Each question targets a different failure mode. A system can pass two legs of the triad and still be unfit for production. All three must be validated.

// The RAG Triad · P3Fusion Evaluation Foundation✓ Applied to every deployment

🔍

Did we retrieve the right context?

Query ↔ Context edge. Measures whether the retrieval system found the relevant information from the knowledge base. Failure here cascades into every downstream metric.

🛡

Did the LLM stay grounded in what it retrieved?

Context ↔ Response edge. Measures whether the generator used the retrieved context or reverted to parametric memory. This is the primary hallucination detection axis.

✅

Does the answer actually address the question?

Response ↔ Query edge. An answer can be perfectly faithful to context yet completely miss what the user asked. This metric catches tangential and incomplete responses.

Query → Retrieval → Context → Generation → Response → User
Each arrow is a potential failure point. Each is measured independently.

The 12 Metrics

What P3Fusion Measures — And Why Each Metric Exists

The P3 RAG Evaluation Framework validates 12 metrics across four categories. Each metric has a clear definition, a mathematical formula, a production threshold, and the specific failure mode it catches. All metrics score 0–1 (higher is better, except Noise Sensitivity). Thresholds are calibrated by deployment risk tier: internal tools (≥0.70), customer-facing systems (≥0.85), and high-stakes domains like healthcare or legal (≥0.90).

Layer 1 — Retrieval Quality4 metrics

Context Precision

≥ 0.85 production

Context Precision = Σ(Precision@k × v_k) / Σ(v_k)
v_k = 1 if chunk at rank k is relevant, 0 otherwise

Measures whether relevant chunks appear early in the retrieved context. Order-sensitive — shuffling the same chunks changes the score. Low Context Precision means the LLM must read through irrelevant material before reaching useful information, increasing hallucination risk and degrading generation quality. Catches chunking strategy failures and embedding model weaknesses.

0.87

InsightBot
Financial Deploy

LLM judge

Context Recall

≥ 0.80 production

Context Recall = |Reference claims attributable to context| / |Total reference claims|

Measures retrieval completeness — did the system retrieve all the information needed to fully answer the question? A score of 0.80 means 80% of required facts are present; 20% are missing and the LLM will either refuse to answer or hallucinate the gap. This is the only core metric that requires ground-truth reference answers.

0.83

InsightBot
Energy Deploy

LLM judge

Hit Rate @ K

≥ 0.90 at K=10

Hit Rate@K = Σ(1 if any relevant doc in top-K) / |Q|
P3Fusion standard: K = 8 (optimal for enterprise RAG)

Binary metric: did at least one relevant document appear in the top-K retrieved results? A hit rate below 0.85 at K=8 indicates a fundamental retrieval failure — the correct document is not making it into the context window at all. P3Fusion uses K=8 as the production retrieval depth, balancing coverage against context window constraints.

0.94

InsightBot
Construction

IR metric

NDCG @ K (Ranking Quality)

≥ 0.80 at K=8

NDCG@K = DCG@K / IDCG@K
DCG@K = Σ(rel_i / log₂(i+1)) for i=1 to K

Normalized Discounted Cumulative Gain measures ranking quality with graded relevance — a highly relevant document at rank 1 scores better than a moderately relevant document at rank 1. Unlike Hit Rate, NDCG penalizes systems that bury the most relevant content. This is the default metric on the MTEB Leaderboard for retrieval benchmarking.

0.82

InsightBot
Avg. across deploys

IR metric

Layer 2 — Generation Quality4 metrics

Faithfulness (Groundedness)

≥ 0.85 production · ≥ 0.90 high-stakes

Faithfulness = |Statements supported by context| / |Total statements in response|
Method: Response → atomic claims → NLI against context (2 LLM calls)

The most critical metric in the framework. P3Fusion decomposes every response into atomic factual statements, then verifies each statement against the retrieved context using natural language inference. A Faithfulness score of 0.85 means 85% of claims are grounded in retrieved context. Anything below 0.75 in production is a red flag — it indicates the LLM is generating from parametric memory, not from the knowledge base. In the original RAGAS paper, human annotators agreed with this metric 95% of the time.

0.91

InsightBot
Financial Deploy

LLM judge

Answer Relevancy

≥ 0.80 production

Answer Relevancy = (1/N) × Σ cos(E_generated_q_i, E_original_q)
N=3 reverse-engineered questions generated from the response

Uses a reverse-engineering approach: the LLM generates 3 questions that the response would naturally answer, then computes cosine similarity between those generated questions and the original query. A high score means the response is directly addressing the question. An answer can be 100% faithful to context yet score 0.20 here — if the retrieved context was relevant but the response buried the actual answer. Penalizes "I don't know" responses automatically.

0.88

InsightBot
Construction

LLM + embed

Answer Correctness

≥ 0.75 (requires ground truth)

Correctness = 0.75 × F1_factual + 0.25 × Sim_semantic
F1 = |TP| / (|TP| + 0.5 × (|FP| + |FN|))

The only metric that directly compares the generated answer against a ground-truth reference answer. Combines factual F1 (statement-level TP/FP/FN scoring) at 75% weight with semantic similarity at 25% weight. Used during the offline evaluation phase with P3Fusion's golden dataset. Not used in continuous production monitoring — ground truth cannot be provided at query time.

0.79

Offline eval
golden dataset

LLM judge

Noise Sensitivity

≤ 0.20 production (lower is better)

Noise Sensitivity = |Incorrect claims| / |Total claims|
⚠ This metric: lower score = better quality

The only metric where lower is better. Measures the fraction of incorrect claims generated when the system has access to both relevant and irrelevant documents — simulating real-world retrieval conditions where not every retrieved chunk is useful. A Noise Sensitivity of 0.20 means 1 in 5 claims are wrong when irrelevant context is present. Targets the specific failure mode of LLMs generating confident-sounding errors from noise in the context window.

0.14

InsightBot
Insurance deploy

LLM judge

Layer 3 — Safety & Compliance2 metrics

Hallucination Rate

≤ 5% production · ≤ 2% high-stakes

Hallucination Rate = |Responses with ≥1 unsupported claim| / |Total responses|
Derived from Faithfulness: responses with Faithfulness < 0.80 flagged

P3Fusion tracks hallucination rate at the response level rather than the claim level — a response with even one unsupported claim is counted as a hallucination event, because enterprise users make decisions on whole answers, not individual sentences. The 5% threshold means no more than 1 in 20 responses may contain an unsupported claim. Amazon Bedrock Guardrails are configured to catch grounding failures before delivery.

3.2%

InsightBot
avg. all deploys

Bedrock Guard

Context Entity Recall

≥ 0.80 fact-critical domains

CER = |CE ∩ GE| / |GE|
CE = entities in context · GE = entities in ground truth

A lightweight, non-LLM metric particularly valuable for fact-heavy domains — financial services (product codes, regulatory references), construction (contractor names, specification numbers), insurance (policy numbers, claim types). Measures whether the retrieved context contains the specific named entities the ground-truth answer requires. Fast to compute at scale without LLM inference cost.

0.84

Financial
domain deploy

Non-LLM

Layer 4 — System Performance2 metrics

End-to-End Latency (p95)

< 3s chatbot · < 5s complex analysis

Total = Embedding(~200ms) + Retrieval(~400ms) + Generation(~1.5-2s)
P3Fusion target: p50 < 2s · p95 < 3s · p99 < 5s

RAG introduces additional latency stages absent in direct LLM calls. Research shows retrieval alone accounts for 35–47% of total Time to First Token — nearly doubling baseline LLM latency. P3Fusion measures p50, p95, and p99 latencies separately — averages hide the tail latencies that cause user frustration. pgvector HNSW indexing delivers sub-2ms retrieval at enterprise scale, making the LLM generation call the primary latency driver.

2.4s

InsightBot
p95 latency

CloudWatch

User Satisfaction Score

≥ 4.0 / 5.0 pilot gate

Satisfaction = Σ(thumbs-up / total feedback) × 5
Feedback collected on 100% of responses · minimum 50 ratings before gate

The only human-signal metric. P3Fusion deploys a thumbs-up/thumbs-down feedback mechanism on every response during the pilot phase. A 4.0/5.0 gate before full rollout ensures the evaluation framework's machine-scored metrics are correlated with real user perception. Cases where machine scores are high but satisfaction is low indicate a systematic disconnect between how the evaluation criteria and users define "good answers" — requiring prompt or threshold recalibration.

4.6

InsightBot avg.
across all deploys

Human signal

Real Evaluation Output

What a P3Fusion Pre-Deployment Evaluation Report Looks Like

Every RAG system built by P3Fusion generates an evaluation report against the 12-metric framework before deployment approval. The report below is representative of a financial services InsightBot deployment — showing all 12 metrics, their scores, and their pass/warn/fail status against production thresholds.

P3 RAG Evaluation Report · InsightBot · Financial Services Deploy · Pre-Production Gate

● Evaluation Run

Context Precision

0.87

PASS

Context Recall

0.83

PASS

Hit Rate @ K=8

0.93

PASS

NDCG @ K=8

0.82

PASS

Faithfulness

0.91

PASS

Answer Relevancy

0.88

PASS

Answer Correctness

0.79

PASS

Noise Sensitivity

0.14

PASS ↓

Hallucination Rate

3.2%

WARN

Context Entity Recall

0.84

PASS

Latency p95

2.4s

PASS

Pilot Satisfaction

4.6/5

PASS

Judge model: Claude 3.5 Sonnet · Amazon Bedrock · Eval set: 150 queries · Framework: RAGAS v0.2 + Bedrock Eval

✓ 11/12 PASS · 1 WARN · Deployment approved with monitoring

The single WARN flag on Hallucination Rate (3.2% vs the 2% threshold for this particular financial services deployment) was logged, investigated, and found to relate to a specific document category — older policy documents formatted with unusual section breaks that confused the chunking pipeline. The chunking strategy was adjusted for that document category before go-live. The evaluation framework caught this before any user was affected.

What We Catch

The Eight Failure Modes Our Framework Is Built to Detect

Each metric in the framework targets one or more specific failure modes. The following are the eight most common failures P3Fusion has observed across enterprise RAG deployments — and the metrics that catch each one.

⚠️

Hallucination — Claims Not in Context

The LLM generates statements with confidence that are not supported by any retrieved document. Most common when context recall is low and the LLM fills gaps from training memory. The most dangerous failure in regulated industries.

Caught by: Faithfulness + Hallucination Rate

🎯

Irrelevant Retrieval — Right Topic, Wrong Facts

The retriever returns documents that are semantically adjacent but factually different from what the query needs. Dense vector search finds thematic similarity but misses factual specificity. A classic embedding model failure.

Caught by: Context Precision + NDCG

📍

Lost in the Middle — Critical Info Buried

Stanford/UC Berkeley research demonstrates LLMs exhibit a U-shaped attention curve — over 30% accuracy degradation when critical information sits in the middle of retrieved context. Context Precision measures whether the right chunks rank first.

Caught by: Context Precision + Answer Correctness

🔇

Context Neglect — Ignoring Retrieved Facts

Even when correct context is provided, the LLM generates from parametric memory rather than the retrieved documents. Without explicit instruction to use only provided context, models confidently fill gaps with plausible-but-wrong information.

Caught by: Faithfulness + Noise Sensitivity

📭

Incomplete Retrieval — Missing Critical Documents

The retriever fails to surface one or more documents that contain information essential to answering the question correctly. The LLM then produces a technically grounded but factually incomplete answer — which users often cannot distinguish from a complete one.

Caught by: Context Recall + Hit Rate@K

🔀

Tangential Answers — Faithful but Off-Topic

The system retrieves relevant documents and generates an answer faithful to their content — but the answer addresses a subtly different question than what was asked. High faithfulness, low answer relevancy. Common when context is dense with related but not directly responsive information.

Caught by: Answer Relevancy

🏷

Entity Errors — Wrong Names, Codes, References

The correct conceptual answer is retrieved but specific entities — contractor names, policy numbers, product codes, regulatory references — are incorrect. Particularly dangerous in financial and compliance contexts where precision on specific identifiers is non-negotiable.

Caught by: Context Entity Recall

⏱

Latency Degradation Under Load

RAG systems that pass quality metrics under single-query testing often degrade significantly under concurrent load — particularly when the vector database index is not warmed, or when the async execution pipeline is not properly sized for production query volumes.

Caught by: Latency p95/p99 under load test

Beyond Deployment

Three-Layer Evaluation — Offline, CI/CD Gate, and Continuous Production Monitoring

P3Fusion's framework applies evaluation at three distinct stages of the RAG system lifecycle. One-time pre-deployment evaluation is necessary but not sufficient — every production RAG system degrades over time as document corpora update, query distributions shift, and embedding models change. The framework treats evaluation as a continuous operational discipline, not a one-time quality gate.

Layer 1 — Offline Golden Dataset Evaluation

Before any deployment, P3Fusion constructs a golden dataset of 100–200 curated QA pairs from the client's actual document corpus — covering high-frequency query patterns, edge cases, and failure-prone question types. All 12 metrics are computed against this dataset using RAGAS with Claude 3.5 Sonnet as judge. This establishes the baseline and validates that every threshold is met before any user access is granted. Runs on every significant change to the knowledge base, chunking strategy, embedding model, or retrieval configuration.

100–200 QA pairsAll 12 metricsRAGAS + Bedrock EvalFull threshold validation

Layer 2 — Automated CI/CD Deployment Gate

Every code change, prompt update, or knowledge base modification triggers an automated evaluation run against the golden dataset. If any metric falls below its configured threshold, the deployment is blocked. P3Fusion configures this gate with a subset of the fastest-running metrics (Faithfulness, Context Precision, Hit Rate@K, Noise Sensitivity) to keep the gate fast — typically under 8 minutes for 50 evaluation queries. Failed gates generate a diagnostic report pointing to which metric degraded and which query categories drove the regression.

Automated CI/CDDeployment blocking<8 min eval timeRegression diagnostics

Layer 3 — Continuous Production Monitoring

In production, 10% of live queries are sampled and evaluated for Faithfulness and Answer Relevancy — the two most signal-rich metrics that do not require ground-truth labels. Scores are tracked via Amazon CloudWatch with automated alerts when either metric drifts more than 0.05 below its baseline. User thumbs-up/down feedback is collected on every response and fed into the monthly quality improvement cycle. Problematic queries identified in production monitoring are added to the golden dataset, creating a continuous improvement flywheel that makes each deployment measurably better over time.

10% live samplingCloudWatch alertsMonthly improvement cycleFeedback flywheel

Outcomes

What the Framework Delivers Across All P3Fusion RAG Deployments

95%+

Average Faithfulness across all production deployments

3.2%

Average hallucination rate — well below 5% production threshold

4.6/5

Average pilot satisfaction score across all deployments

Production deployments requiring emergency recall due to quality failures

The framework's most important output is not any individual metric score — it is the confidence it gives to enterprise stakeholders that the RAG system they are deploying has been verified to work correctly on their actual documents, with their actual query patterns, at the quality level their use case demands. When a financial services firm deploys InsightBot across 500 employees, they do so knowing the system has passed 12 metrics across 150 evaluation queries, been validated against their specific document corpus, and has continuous monitoring in place to alert the team if quality drifts. That is not a chatbot. That is a production AI system.

Metric	Cust-facing	High-stakes
Faithfulness	≥ 0.85	≥ 0.90
Answer Relevancy	≥ 0.80	≥ 0.85
Context Precision	≥ 0.85	≥ 0.90
Context Recall	≥ 0.80	≥ 0.85
Hit Rate @K	≥ 0.90	≥ 0.95
NDCG @K	≥ 0.80	≥ 0.85
Noise Sensitivity	≤ 0.20	≤ 0.10
Hallucination Rate	≤ 5%	≤ 2%
Latency p95	<3s	<2s
Satisfaction	≥ 4.0/5	≥ 4.2/5

P3Fusion

AWS Generative AI Competency Partner. P3Fusion builds production RAG systems — InsightBot for unstructured data, FusionReport for structured databases, and custom RAG for any enterprise use case. Every deployment is validated against the P3 RAG Evaluation Framework.

Gen AI Competency

Connect SDP

InsightBot

FusionReport

Custom RAG

RAG Eval

Bring the same evaluation discipline to your RAG initiative.

Financial Services · InsightBot

Financial firm cuts knowledge search time 80% — validated by this framework

Construction · MetroNational

MetroNational turns bid data into competitive intelligence with InsightBot

Insurance · FusionReport

Insurance enterprise eliminates reporting backlog with Generative BI

All studies

Browse the full case study library

Scale Your Success with Confidence

P3Fusion is audited and certified by industry-leading third-party standards.