Handling 15,000 Documents at Enterprise RAG Scale — The P3Fusion Architecture Blueprint

15K Documents

Format-Aware · Hierarchical Indexing

Format-Aware Parsing

Content-Type Indexing

pymupdf4llm · Docling · Textract

Recursive · Hierarchical Chunking

RAPTOR · Late Chunking

Hybrid Search · BM25 + Vector + RRF

Amazon Bedrock · Titan V2

By the Numbers

Default chunk size: 512 tokens · 15% overlap · Retrieval: Hybrid (vector + BM25 + RRF) · Large doc threshold: 50+ pages → RAPTOR · Embedding model: Amazon Titan V2 · Idempotency: SHA-256 content hash

Technical Blueprint Summary

The most common mistake in enterprise RAG is treating all documents the same — one parser, one chunking strategy, applied uniformly. A scanned insurance claim requires OCR before any parsing can happen. An Excel financial model embedded as raw tabular data causes a 20% retrieval degradation from structural noise alone. A 500-page procurement contract will overwhelm flat retrieval without hierarchical tree indexing. A code repository needs AST-based chunking that respects function boundaries. P3Fusion's ingestion pipeline routes every document through a format detection layer that dispatches each file to the right parser, classifies the content types it contains, and applies the chunking strategy that matches the document's size and structure.

01 · Why Format Matters

One Parser for Everything Is Not a Strategy. It Is a Source of Retrieval Failures.

The document corpus of a mid-sized enterprise is never a neat collection of PDFs. It is a living archive accumulated over years — contracts signed before digital workflows existed, Excel models built by analysts who never anticipated AI, PowerPoint presentations that are the only record of a strategic decision, email threads containing the actual context behind a policy document. Each format carries its own structure, its own failure modes, and its own rules for what constitutes a meaningful retrieval unit.

10+

Document formats in a typical enterprise corpus

~50

Chunks per document at 512-token default chunk size

20%

Retrieval drop from embedding raw JSON without flattening

45%

Accuracy drop from embedding tables without the multi-vector pattern

These numbers explain why format-awareness is foundational. Embedding raw JSON causes a 20% retrieval degradation because a quarter of all tokens are structural markers — braces, colons, quotes — that pull embeddings away from their semantic centre. Embedding tables without a summary causes a 45% accuracy drop on table-heavy queries. Getting parsing right is the foundation everything else is built on.

02 · The Ingestion Pipeline

An Event-Driven Pipeline That Handles Every Document Correctly.

P3Fusion's ingestion pipeline is event-driven and fully parallelised. Documents land in Amazon S3, triggering SQS messages that feed parallel ECS Fargate workers. Each worker runs the full pipeline — format detection, parsing, content classification, chunking, embedding, indexing — independently. If a worker encounters a malformed document, it retries automatically then routes to a dead letter queue for review, without affecting any other document in the batch.

// P3Fusion RAG Ingestion Pipeline · Event-Driven · Fully Parallelised20 concurrent ECS workers

📥

S3 Upload

Event notification → SQS

→

🔍

Detect Format

MIME + magic bytes

→

⚙️

Parse

Format-specific parser

→

🗂

Classify Content

Text · Table · Image · Code

→

✂️

Chunk

Content-type-aware

→

🧮

Embed

Titan V2 · Bedrock

→

✅

Index

Vector + keyword written

Format detection reads the first 512 bytes of each file to identify its format from magic bytes — the binary signature embedded in every file format — then confirms with MIME type. File extensions are never trusted. A document renamed from PDF to DOCX will crash any parser that relies on the extension. Incremental daily updates are event-driven on S3 PutObject notifications and complete within minutes without touching existing indexed documents.

03 · Document Format Parsing

Eight Formats. Eight Different Parsing Strategies. One Unified Pipeline.

No single parser handles every format optimally. The format routing table captures every dispatch decision — the library chosen, and why that specific library outperforms its alternatives for that document type.

// Document Parser Routing · Dispatched per document, not per deploymentFormat detected via MIME + magic bytes

PDF (digital): pymupdf4llm → pdfplumber. Highest accuracy and structure preservation; pdfplumber fallback for complex multi-column tables.

PDF (scanned): Amazon Textract → Tesseract. Stronger OCR and table extraction for noisy scans and handwriting.

DOCX: Docling (IBM) → python-docx. Preserves hierarchy, sections, tables, and reading order.

XLSX: openpyxl → SQL Agent path. Rows are linearised into natural language; raw table embedding is avoided.

PPTX: python-pptx + VLM for visuals. One slide equals one chunk with notes and visual captions.

HTML: Trafilatura → BeautifulSoup. Boilerplate removed; JS-heavy pages rendered via Playwright first.

EML/MSG: Unstructured partition_email(). Recursive attachment parsing and thread reconstruction.

JSON/XML: custom flattener → Docling XML. Structural tokens are removed before embedding to avoid ~20% retrieval degradation.

04 · Content-Type Indexing

One Document, Six Content Types. Each Needs Different Treatment.

A single regulatory PDF might contain prose sections, summary tables, embedded charts, API code snippets, and signature imagery. Applying one embedding strategy to all of these degrades retrieval accuracy. P3Fusion's content classifier routes each element to the processing path it needs.

📝

Prose Text

Recursive split at 512 tokens with 15% overlap, respecting heading boundaries.

📊

Tables

Multi-vector pattern: summary embedding for retrieval, raw table preserved for generation.

🖼

Images & Diagrams

Layout detection filters decorative elements; informational visuals are VLM-captioned.

📈

Charts & Graphs

DePlot converts charts to markdown tables, then trends are summarised and indexed.

💻

Code Blocks

Tree-sitter AST chunking by function and class boundaries; generated/vendor code excluded.

➗

Math Formulas

VLM extraction to LaTeX plus natural-language descriptions for exact and conceptual lookups.

05 · Chunking Strategy

The Research Is Settled: Recursive at 512 Tokens Wins.

Benchmarks from 2025–2026 converge on the same recommendation. Recursive splitting at 512 tokens delivers the strongest end-to-end accuracy, while tiny semantic fragments often improve retrieval in isolation but fail generation quality.

Recursive · 512 tok

69% acc.

Hierarchical parent-child

67% acc.

Late Chunking

+24.5% nDCG

Semantic · avg 43 tokens

Poor E2E

"Hierarchical parent-child chunking is where you go next, not where you start. Index 512-token children for precise retrieval. Return 2,048-token parents to the LLM for context."

— P3Fusion Engineering, InsightBot Chunking Architecture

Hybrid retrieval is mandatory at this scale. Vector search captures conceptual similarity; BM25 captures exact terms like policy IDs and product codes. P3Fusion fuses both with Reciprocal Rank Fusion (RRF), yielding 15–30% retrieval gains over pure vector search.

06 · Large Document Strategies

A 500-Page Contract Is Not Just a Long Document. It Is an Architecture Problem.

Flat chunking over very large documents creates too many competing chunks and amplifies lost-in-the-middle failure modes. P3Fusion applies specialised strategies once documents exceed 50 pages.

RAPTOR — Recursive Hierarchical Tree Indexing

+20% QuALITY benchmark accuracy

Late Chunking — Full Document Context Preserved in Every Chunk

+24.5% nDCG@10 vs naive chunking

Proposition Indexing — Atomic Factual Statements

High information density for legal and financial corpora

Two-Stage Retrieval with Cross-Encoder Reranking

+15–30% accuracy with no re-indexing required

"Reranking is the upgrade we recommend to every client after their first deployment stabilises. It changes selection quality without changing indexing infrastructure."

— P3Fusion Engineering, InsightBot Production Optimisation

P3Fusion

AWS Generative AI Competency Partner. P3Fusion builds enterprise RAG systems — InsightBot for unstructured documents, FusionReport for structured databases, and custom RAG for any corpus. Every deployment uses format-aware ingestion and content-type-specific indexing from day one.

Gen AI Competency

Connect SDP

InsightBot

FusionReport

Custom RAG

Discuss your enterprise RAG ingestion architecture and scaling plan with our engineering team.

RAG Evaluation · 12 Metrics

How P3Fusion validates every RAG deployment before go-live

Guardrails · 6 Safety Layers

The mandatory Bedrock Guardrails architecture on every deployment

Financial Services · InsightBot

InsightBot ingests 10,000 documents — 80% search time reduction

Construction · MetroNational

RAG across years of bid archives — hidden pricing patterns discovered

Scale Your Success with Confidence

P3Fusion is audited and certified by industry-leading third-party standards.