15K Documents
Format-Aware · Hierarchical Indexing
How P3Fusion Ingests 15,000 Documents Across Every Format an Enterprise Throws at It
 
Enterprise clients bring P3Fusion documents in every format — scanned PDFs, Excel models, PowerPoint decks, email threads, internal HTML portals, JSON exports, and 500-page procurement contracts. This blueprint documents how P3Fusion's RAG ingestion pipeline handles each one: which parser, which chunking strategy, which embedding approach, and how large documents get hierarchical treatment so nothing gets lost.
 
Format-Aware Parsing
Content-Type Indexing
pymupdf4llm · Docling · Textract
Recursive · Hierarchical Chunking
RAPTOR · Late Chunking
Hybrid Search · BM25 + Vector + RRF
Amazon Bedrock · Titan V2
 
 
By the Numbers
 
Default chunk size: 512 tokens · 15% overlap · Retrieval: Hybrid (vector + BM25 + RRF) · Large doc threshold: 50+ pages → RAPTOR · Embedding model: Amazon Titan V2 · Idempotency: SHA-256 content hash
 
 
 
Technical Blueprint Summary
The most common mistake in enterprise RAG is treating all documents the same — one parser, one chunking strategy, applied uniformly. A scanned insurance claim requires OCR before any parsing can happen. An Excel financial model embedded as raw tabular data causes a 20% retrieval degradation from structural noise alone. A 500-page procurement contract will overwhelm flat retrieval without hierarchical tree indexing. A code repository needs AST-based chunking that respects function boundaries. P3Fusion's ingestion pipeline routes every document through a format detection layer that dispatches each file to the right parser, classifies the content types it contains, and applies the chunking strategy that matches the document's size and structure.
01 · Why Format Matters
One Parser for Everything Is Not a Strategy. It Is a Source of Retrieval Failures.
The document corpus of a mid-sized enterprise is never a neat collection of PDFs. It is a living archive accumulated over years — contracts signed before digital workflows existed, Excel models built by analysts who never anticipated AI, PowerPoint presentations that are the only record of a strategic decision, email threads containing the actual context behind a policy document. Each format carries its own structure, its own failure modes, and its own rules for what constitutes a meaningful retrieval unit.
10+
Document formats in a typical enterprise corpus
~50
Chunks per document at 512-token default chunk size
20%
Retrieval drop from embedding raw JSON without flattening
45%
Accuracy drop from embedding tables without the multi-vector pattern
These numbers explain why format-awareness is foundational. Embedding raw JSON causes a 20% retrieval degradation because a quarter of all tokens are structural markers — braces, colons, quotes — that pull embeddings away from their semantic centre. Embedding tables without a summary causes a 45% accuracy drop on table-heavy queries. Getting parsing right is the foundation everything else is built on.
 
02 · The Ingestion Pipeline
An Event-Driven Pipeline That Handles Every Document Correctly.
P3Fusion's ingestion pipeline is event-driven and fully parallelised. Documents land in Amazon S3, triggering SQS messages that feed parallel ECS Fargate workers. Each worker runs the full pipeline — format detection, parsing, content classification, chunking, embedding, indexing — independently. If a worker encounters a malformed document, it retries automatically then routes to a dead letter queue for review, without affecting any other document in the batch.
// P3Fusion RAG Ingestion Pipeline · Event-Driven · Fully Parallelised20 concurrent ECS workers
📥
S3 Upload
Event notification → SQS
🔍
Detect Format
MIME + magic bytes
⚙️
Parse
Format-specific parser
🗂
Classify Content
Text · Table · Image · Code
✂️
Chunk
Content-type-aware
🧮
Embed
Titan V2 · Bedrock
Index
Vector + keyword written
Format detection reads the first 512 bytes of each file to identify its format from magic bytes — the binary signature embedded in every file format — then confirms with MIME type. File extensions are never trusted. A document renamed from PDF to DOCX will crash any parser that relies on the extension. Incremental daily updates are event-driven on S3 PutObject notifications and complete within minutes without touching existing indexed documents.
 
03 · Document Format Parsing
Eight Formats. Eight Different Parsing Strategies. One Unified Pipeline.
No single parser handles every format optimally. The format routing table captures every dispatch decision — the library chosen, and why that specific library outperforms its alternatives for that document type.
// Document Parser Routing · Dispatched per document, not per deploymentFormat detected via MIME + magic bytes
PDF (digital): pymupdf4llm → pdfplumber. Highest accuracy and structure preservation; pdfplumber fallback for complex multi-column tables.
PDF (scanned): Amazon Textract → Tesseract. Stronger OCR and table extraction for noisy scans and handwriting.
DOCX: Docling (IBM) → python-docx. Preserves hierarchy, sections, tables, and reading order.
XLSX: openpyxl → SQL Agent path. Rows are linearised into natural language; raw table embedding is avoided.
PPTX: python-pptx + VLM for visuals. One slide equals one chunk with notes and visual captions.
HTML: Trafilatura → BeautifulSoup. Boilerplate removed; JS-heavy pages rendered via Playwright first.
EML/MSG: Unstructured partition_email(). Recursive attachment parsing and thread reconstruction.
JSON/XML: custom flattener → Docling XML. Structural tokens are removed before embedding to avoid ~20% retrieval degradation.
 
04 · Content-Type Indexing
One Document, Six Content Types. Each Needs Different Treatment.
A single regulatory PDF might contain prose sections, summary tables, embedded charts, API code snippets, and signature imagery. Applying one embedding strategy to all of these degrades retrieval accuracy. P3Fusion's content classifier routes each element to the processing path it needs.
📝
Prose Text

Recursive split at 512 tokens with 15% overlap, respecting heading boundaries.

📊
Tables

Multi-vector pattern: summary embedding for retrieval, raw table preserved for generation.

🖼
Images & Diagrams

Layout detection filters decorative elements; informational visuals are VLM-captioned.

📈
Charts & Graphs

DePlot converts charts to markdown tables, then trends are summarised and indexed.

💻
Code Blocks

Tree-sitter AST chunking by function and class boundaries; generated/vendor code excluded.

Math Formulas

VLM extraction to LaTeX plus natural-language descriptions for exact and conceptual lookups.

 
05 · Chunking Strategy
The Research Is Settled: Recursive at 512 Tokens Wins.
Benchmarks from 2025–2026 converge on the same recommendation. Recursive splitting at 512 tokens delivers the strongest end-to-end accuracy, while tiny semantic fragments often improve retrieval in isolation but fail generation quality.
Recursive · 512 tok
69% acc.
Hierarchical parent-child
67% acc.
Late Chunking
+24.5% nDCG
Semantic · avg 43 tokens
Poor E2E

"Hierarchical parent-child chunking is where you go next, not where you start. Index 512-token children for precise retrieval. Return 2,048-token parents to the LLM for context."

— P3Fusion Engineering, InsightBot Chunking Architecture

Hybrid retrieval is mandatory at this scale. Vector search captures conceptual similarity; BM25 captures exact terms like policy IDs and product codes. P3Fusion fuses both with Reciprocal Rank Fusion (RRF), yielding 15–30% retrieval gains over pure vector search.
 
06 · Large Document Strategies
A 500-Page Contract Is Not Just a Long Document. It Is an Architecture Problem.
Flat chunking over very large documents creates too many competing chunks and amplifies lost-in-the-middle failure modes. P3Fusion applies specialised strategies once documents exceed 50 pages.
1
RAPTOR — Recursive Hierarchical Tree Indexing
+20% QuALITY benchmark accuracy
2
Late Chunking — Full Document Context Preserved in Every Chunk
+24.5% nDCG@10 vs naive chunking
3
Proposition Indexing — Atomic Factual Statements
High information density for legal and financial corpora
4
Two-Stage Retrieval with Cross-Encoder Reranking
+15–30% accuracy with no re-indexing required

"Reranking is the upgrade we recommend to every client after their first deployment stabilises. It changes selection quality without changing indexing infrastructure."

— P3Fusion Engineering, InsightBot Production Optimisation

Pipeline Overview
Documents15,000+
Ingestion modeEvent-driven · SQS
Parallel workers20 ECS Fargate
Ingestion time4–6 hours
IdempotencySHA-256 content hash
Default chunk512 tokens · 15% overlap
RetrievalHybrid: vector + BM25
Fusion methodReciprocal Rank Fusion
Large doc threshold50+ pages → RAPTOR
Failure handlingSQS DLQ · 5 retries
 
Document Formats Supported
PDF digital
PDF scanned
DOCX
XLSX
PPTX
HTML
EML / MSG
JSON / XML
Code files
Markdown / TXT
 
P3Fusion

AWS Generative AI Competency Partner. P3Fusion builds enterprise RAG systems — InsightBot for unstructured documents, FusionReport for structured databases, and custom RAG for any corpus. Every deployment uses format-aware ingestion and content-type-specific indexing from day one.

Gen AI Competency
Connect SDP
InsightBot
FusionReport
Custom RAG

Discuss your enterprise RAG ingestion architecture and scaling plan with our engineering team.

 
Related Case Studies
 
 
Scale Your Success with Confidence
 
P3Fusion is audited and certified by industry-leading third-party standards.