We built InsightBot and FusionReport to production grade — and we apply the same evaluation rigour to every custom RAG system we build. A RAG system that has not been evaluated against its own knowledge base and real query patterns is not a production system. It is a prototype wearing a production interface.
— P3Fusion Engineering Team
Query ↔ Context edge. Measures whether the retrieval system found the relevant information from the knowledge base. Failure here cascades into every downstream metric.
Context ↔ Response edge. Measures whether the generator used the retrieved context or reverted to parametric memory. This is the primary hallucination detection axis.
Response ↔ Query edge. An answer can be perfectly faithful to context yet completely miss what the user asked. This metric catches tangential and incomplete responses.
Each arrow is a potential failure point. Each is measured independently.
v_k = 1 if chunk at rank k is relevant, 0 otherwise
Financial Deploy
Energy Deploy
P3Fusion standard: K = 8 (optimal for enterprise RAG)
Binary metric: did at least one relevant document appear in the top-K retrieved results? A hit rate below 0.85 at K=8 indicates a fundamental retrieval failure — the correct document is not making it into the context window at all. P3Fusion uses K=8 as the production retrieval depth, balancing coverage against context window constraints.
Construction
DCG@K = Σ(rel_i / log₂(i+1)) for i=1 to K
Normalized Discounted Cumulative Gain measures ranking quality with graded relevance — a highly relevant document at rank 1 scores better than a moderately relevant document at rank 1. Unlike Hit Rate, NDCG penalizes systems that bury the most relevant content. This is the default metric on the MTEB Leaderboard for retrieval benchmarking.
Avg. across deploys
Method: Response → atomic claims → NLI against context (2 LLM calls)
Financial Deploy
N=3 reverse-engineered questions generated from the response
Uses a reverse-engineering approach: the LLM generates 3 questions that the response would naturally answer, then computes cosine similarity between those generated questions and the original query. A high score means the response is directly addressing the question. An answer can be 100% faithful to context yet score 0.20 here — if the retrieved context was relevant but the response buried the actual answer. Penalizes "I don't know" responses automatically.
Construction
F1 = |TP| / (|TP| + 0.5 × (|FP| + |FN|))
The only metric that directly compares the generated answer against a ground-truth reference answer. Combines factual F1 (statement-level TP/FP/FN scoring) at 75% weight with semantic similarity at 25% weight. Used during the offline evaluation phase with P3Fusion's golden dataset. Not used in continuous production monitoring — ground truth cannot be provided at query time.
golden dataset
⚠ This metric: lower score = better quality
Insurance deploy
Derived from Faithfulness: responses with Faithfulness < 0.80 flagged
avg. all deploys
CE = entities in context · GE = entities in ground truth
A lightweight, non-LLM metric particularly valuable for fact-heavy domains — financial services (product codes, regulatory references), construction (contractor names, specification numbers), insurance (policy numbers, claim types). Measures whether the retrieved context contains the specific named entities the ground-truth answer requires. Fast to compute at scale without LLM inference cost.
domain deploy
P3Fusion target: p50 < 2s · p95 < 3s · p99 < 5s
RAG introduces additional latency stages absent in direct LLM calls. Research shows retrieval alone accounts for 35–47% of total Time to First Token — nearly doubling baseline LLM latency. P3Fusion measures p50, p95, and p99 latencies separately — averages hide the tail latencies that cause user frustration. pgvector HNSW indexing delivers sub-2ms retrieval at enterprise scale, making the LLM generation call the primary latency driver.
p95 latency
Feedback collected on 100% of responses · minimum 50 ratings before gate
The only human-signal metric. P3Fusion deploys a thumbs-up/thumbs-down feedback mechanism on every response during the pilot phase. A 4.0/5.0 gate before full rollout ensures the evaluation framework's machine-scored metrics are correlated with real user perception. Cases where machine scores are high but satisfaction is low indicate a systematic disconnect between how the evaluation criteria and users define "good answers" — requiring prompt or threshold recalibration.
across all deploys
The LLM generates statements with confidence that are not supported by any retrieved document. Most common when context recall is low and the LLM fills gaps from training memory. The most dangerous failure in regulated industries.
Caught by: Faithfulness + Hallucination RateThe retriever returns documents that are semantically adjacent but factually different from what the query needs. Dense vector search finds thematic similarity but misses factual specificity. A classic embedding model failure.
Caught by: Context Precision + NDCGStanford/UC Berkeley research demonstrates LLMs exhibit a U-shaped attention curve — over 30% accuracy degradation when critical information sits in the middle of retrieved context. Context Precision measures whether the right chunks rank first.
Caught by: Context Precision + Answer CorrectnessEven when correct context is provided, the LLM generates from parametric memory rather than the retrieved documents. Without explicit instruction to use only provided context, models confidently fill gaps with plausible-but-wrong information.
Caught by: Faithfulness + Noise SensitivityThe retriever fails to surface one or more documents that contain information essential to answering the question correctly. The LLM then produces a technically grounded but factually incomplete answer — which users often cannot distinguish from a complete one.
Caught by: Context Recall + Hit Rate@KThe system retrieves relevant documents and generates an answer faithful to their content — but the answer addresses a subtly different question than what was asked. High faithfulness, low answer relevancy. Common when context is dense with related but not directly responsive information.
Caught by: Answer RelevancyThe correct conceptual answer is retrieved but specific entities — contractor names, policy numbers, product codes, regulatory references — are incorrect. Particularly dangerous in financial and compliance contexts where precision on specific identifiers is non-negotiable.
Caught by: Context Entity RecallRAG systems that pass quality metrics under single-query testing often degrade significantly under concurrent load — particularly when the vector database index is not warmed, or when the async execution pipeline is not properly sized for production query volumes.
Caught by: Latency p95/p99 under load testBefore any deployment, P3Fusion constructs a golden dataset of 100–200 curated QA pairs from the client's actual document corpus — covering high-frequency query patterns, edge cases, and failure-prone question types. All 12 metrics are computed against this dataset using RAGAS with Claude 3.5 Sonnet as judge. This establishes the baseline and validates that every threshold is met before any user access is granted. Runs on every significant change to the knowledge base, chunking strategy, embedding model, or retrieval configuration.
Every code change, prompt update, or knowledge base modification triggers an automated evaluation run against the golden dataset. If any metric falls below its configured threshold, the deployment is blocked. P3Fusion configures this gate with a subset of the fastest-running metrics (Faithfulness, Context Precision, Hit Rate@K, Noise Sensitivity) to keep the gate fast — typically under 8 minutes for 50 evaluation queries. Failed gates generate a diagnostic report pointing to which metric degraded and which query categories drove the regression.
In production, 10% of live queries are sampled and evaluated for Faithfulness and Answer Relevancy — the two most signal-rich metrics that do not require ground-truth labels. Scores are tracked via Amazon CloudWatch with automated alerts when either metric drifts more than 0.05 below its baseline. User thumbs-up/down feedback is collected on every response and fed into the monthly quality improvement cycle. Problematic queries identified in production monitoring are added to the golden dataset, creating a continuous improvement flywheel that makes each deployment measurably better over time.
AWS Generative AI Competency Partner. P3Fusion builds production RAG systems — InsightBot for unstructured data, FusionReport for structured databases, and custom RAG for any enterprise use case. Every deployment is validated against the P3 RAG Evaluation Framework.
Bring the same evaluation discipline to your RAG initiative.





