May 14, 2025
The Hidden Ceiling: How OCR Quality Limits RAG Performance

22 min read
May 14, 2025
Retrieval-Augmented Generation (RAG) has become the default way to connect Large Language Models (LLMs) with enterprise data. However, there's a critical flaw in this approach that's rarely discussed: nearly all production RAG pipelines rely on Optical Character Recognition (OCR) to process PDFs, scans, presentations, and other documents, with the silent assumption that the extracted text is "good enough" for downstream AI tasks.
Our comprehensive analysis shows that this assumption is fundamentally flawed. OCR quality creates an invisible ceiling that limits the performance of even the most advanced RAG systems. The gap between what's possible with perfect text extraction and what's achieved with current OCR technology represents one of the most significant yet overlooked challenges in enterprise AI today.
TLDR:
- OCR creates an invisible performance ceiling. Text extraction errors significantly limit both retrieval accuracy and generation quality in RAG systems.
- Benchmarks reveal a substantial gap. Even leading OCR solutions fall ~4.5% short (NDCG@5) of ground-truth text performance, particularly with complex document layouts.
- Vision-only generation is not ready yet. Despite rapid improvements, multimodal models still cannot reliably generate precise answers directly from multiple document images.
- Multimodal retrieval beats perfect text. Our vector stores outperform even perfect text by ~12% on retrieval accuracy (NDCG@5) and recover 70% of generation quality lost to OCR errors, while simultaneously simplifying architecture and enhancing future compatibility.
Why OCR is still critical for AI systems
Most enterprise knowledge is locked in unstructured formats like PDFs, scanned documents, invoices, presentations, images, and a plethora of other formats. Before a Large Language Model (LLM) can reason over this knowledge, it needs to be converted from its original visual or semi-structured format into plain text.
This text conversion step, typically handled by OCR engines, is crucial because it feeds two core components of a RAG system:
- The Retrieval System: Most retrieval systems depend on extracted text as their main search input. When OCR quality is poor, it produces inaccurate or "corrupted" text representations of your documents. This results in flawed text representations, making it difficult or impossible for the retrieval system to locate the relevant documents when a user asks a question. If the text doesn't accurately reflect the content, the search fails before it even begins.
- The Generation Model (LLM): LLMs generate answers based only on the context they are given. If the retrieved document snippets contain OCR errors (missing words, jumbled tables, incorrect numbers), the LLM receives flawed information. This directly leads to incomplete, nonsensical, or factually incorrect answers, even if the retrieval system managed to find the correct document pages.
In short, errors introduced by OCR don't just stay in the text; they cascade through the entire RAG pipeline, impacting both the ability to find information and the ability to generate accurate answers from it.
Putting OCR to the Test: Our Benchmark Setup
To quantify this "OCR ceiling" and understand its real-world impact, we needed a robust way to measure performance across diverse and challenging documents. We conducted extensive testing using the OHR (OCR hinders RAG) Benchmark v2.
This benchmark is specifically designed to evaluate how OCR performance affects RAG tasks and includes:
- Diverse & Challenging Documents: 8,500+ PDF pages across seven enterprise domains (Textbooks, Law, Finance, Newspapers, Manuals, Academic Papers, Administrative) featuring complex layouts, tables, formulas, charts, diagrams, and non-standard reading orders that are known to challenge OCR systems.
- Targeted Questions: 8,498 question-answer pairs specifically designed to test retrieval and understanding of information related to these OCR challenges. Each answer is grounded in specific evidence pages within the documents.
- Verified Ground Truth: Human-verified, perfect text extraction and curated answers provide a reliable "gold standard" for comparison.
Against this benchmark, we evaluated a range of OCR and retrieval approaches:
Click to see the tested OCR & Retrieval Solutions
- Gemini 2.5 Flash: A frontier closed-source multimodal model capable of OCR.
- MinerU: A popular open-source library implementing state-of-the-art OCR methods from academic literature.
- Azure Document Intelligence: A widely used commercial OCR solution in the industry.
- Qwen-2.5-VL: A frontier open-source multimodal model capable of OCR.
- Unstructured: A popular open-source library with broad adoption for document parsing.
- Mixedbread Vector Store: Our core offering, using native multimodal retrieval (treating pages as images, not just text) powered by our internal multimodal models (
mxbai-omni-v0.1
). It bypasses traditional reliance on OCR for retrieval.
This comprehensive setup allowed us to isolate the impact of different OCR qualities and compare text-based approaches directly against our multimodal retrieval system.
Testing Retrieval: Setup and Results
First, we focused on retrieval - the task of finding the right information within the vast document set. If your RAG system can't surface the correct documents, the LLM has no chance of answering the user's query accurately.
Retrieval Setup
We transformed the OHR benchmark's question-answer pairs into a retrieval task: the question became the query, and the associated evidence pages were the target documents to retrieve.
For the text-based OCR methods, we used BM25, a standard and robust keyword-based ranking algorithm commonly used in search engines. (We tested embedding-based retrieval and rerankers too, but found they often degraded performance on this benchmark compared to the strong BM25 baseline, likely due to OCR noise corrupting the embeddings. You can find more details here.)
For the Mixedbread Vector Store, we leveraged our multimodal embedding model (mxbai-omni-v0.1
), which directly processes screenshots of the document pages. This approach is inherently resilient to OCR errors because it "sees" the page layout, structure, and visual elements alongside the text.
We measured retrieval performance using two standard metrics:
Click for Retrieval Metric Definitions
- NDCG@5 (Normalized Discounted Cumulative Gain @ 5): This metric evaluates the quality of the top 5 retrieved documents. It cares not only if the correct documents are found but also how highly ranked they are. Relevant documents ranked higher get more points. We chose K=5 because research shows LLMs are heavily influenced by the order of documents in their context window, with earlier documents having more impact.
- Recall@5: This metric measures whether at least one of the correct evidence pages was retrieved within the top 5 results. It tells us if the necessary information was surfaced at all, regardless of its exact ranking.
Retrieval Results: The OCR Ceiling is Real
Our retrieval benchmarks revealed stark differences between traditional OCR-dependent methods and our multimodal approach.
NDCG@5 Performance (Average across all 7 document domains)
This chart shows NDCG@5 scores for each retrieval method, averaged across seven document domains. NDCG@5 measures both the presence and ranking of relevant documents in the top 5—higher values mean more accurate retrieval, with extra weight for top-ranked relevant pages.
Click for Full NDCG@5 Results Table by Domain
Domain | Gemini 2.5 Flash | MinerU | Mixedbread OCR | Qwen-2.5-VL | Azure | Unstructured | Mixedbread Vector Store | Ground Truth OCR |
---|---|---|---|---|---|---|---|---|
academic | 0.805 | 0.786 | 0.795 | 0.822 | 0.797 | 0.693 | 0.923 | 0.845 |
administration | 0.861 | 0.776 | 0.842 | 0.853 | 0.854 | 0.672 | 0.920 | 0.895 |
finance | 0.656 | 0.576 | 0.636 | 0.666 | 0.664 | 0.517 | 0.773 | 0.722 |
law | 0.876 | 0.829 | 0.871 | 0.873 | 0.889 | 0.724 | 0.913 | 0.897 |
manual | 0.800 | 0.756 | 0.820 | 0.834 | 0.828 | 0.721 | 0.923 | 0.861 |
news | 0.442 | 0.438 | 0.454 | 0.415 | 0.460 | 0.111 | 0.686 | 0.467 |
textbook | 0.624 | 0.572 | 0.673 | 0.698 | 0.671 | 0.159 | 0.915 | 0.720 |
avg | 0.723 | 0.676 | 0.727 | 0.737 | 0.738 | 0.514 | 0.865 | 0.773 |
Recall@5 Performance (Average across all 7 document domains)
This chart shows Recall@5 for each method, averaged across domains. Recall@5 is the percentage of questions where at least one correct evidence page appeared in the top 5—higher is better.
Click for Full Recall@5 Results Table by Domain
Domain | Gemini 2.5 Flash | MinerU | Mixedbread OCR | Qwen-2.5-VL | Azure | Unstructured | Mixedbread Vector Store | Ground Truth OCR |
---|---|---|---|---|---|---|---|---|
academic | 0.902 | 0.885 | 0.896 | 0.911 | 0.902 | 0.789 | 0.982 | 0.937 |
administration | 0.930 | 0.857 | 0.920 | 0.930 | 0.931 | 0.735 | 0.967 | 0.959 |
finance | 0.778 | 0.677 | 0.760 | 0.781 | 0.783 | 0.625 | 0.883 | 0.836 |
law | 0.933 | 0.890 | 0.929 | 0.932 | 0.948 | 0.775 | 0.968 | 0.951 |
manual | 0.874 | 0.844 | 0.904 | 0.912 | 0.915 | 0.802 | 0.971 | 0.932 |
news | 0.479 | 0.468 | 0.489 | 0.458 | 0.493 | 0.115 | 0.767 | 0.499 |
textbook | 0.644 | 0.600 | 0.700 | 0.728 | 0.702 | 0.168 | 0.936 | 0.746 |
avg | 0.791 | 0.746 | 0.800 | 0.807 | 0.811 | 0.573 | 0.925 | 0.837 |
These results reveal several critical insights:
- OCR Creates a Performance Ceiling: Every single OCR solution tested underperformed compared to the Ground Truth benchmark using perfect text. The best OCR methods plateaued around 0.74 average NDCG@5, a ~4.5% absolute gap below the Ground Truth's 0.773. This confirms that OCR errors inherently limit retrieval effectiveness.
- Complexity Magnifies OCR Issues: The performance gap widens for documents with complex layouts (finance, textbooks, news). These domains often feature tables, formulas, multi-column text, etc., that challenge OCR.
- Multimodal Excels by Seeing the Whole Picture: Mixedbread Vector Store consistently outperformed all other methods, including the perfect text Ground Truth benchmark. Its average NDCG@5 of 0.865 is nearly 12% higher than Ground Truth text because it understands the visual context (layout, tables, charts) directly from the image, providing richer relevance cues.
The Recall@5 increases from 0.84 using Ground Truth text to 0.92 using the Mixedbread Vector Store. Let's put this in perspective:
- With Ground Truth (perfect OCR): Recall@5 = 84% → 84 out of every 100 truly relevant documents are retrieved in the top 5.
- With Mixedbread Vector Store: Recall@5 = 92% → 92 out of every 100 truly relevant documents make it into the top 5.
This 8% absolute improvement (or ~9.5% relative improvement) in recall represents a substantial gain in retrieval performance. These retrieval benchmarks quantify the hidden ceiling imposed by relying solely on OCR. While better OCR helps, the results strongly indicate that a multimodal approach represents a fundamental leap forward.
Testing Generation: Setup and Results
Okay, so multimodal retrieval finds better documents, overcoming the OCR ceiling. But does this improved retrieval actually translate into more accurate final answers from the LLM? To find out, we tested the end-to-end RAG performance.
Generation Setup
We set up three scenarios, feeding the top 5 retrieved documents from each into the same powerful LLM (gemini-2.5-flash-preview-04-17
) for answer generation:
- Perfect OCR & Perfect Retrieval (Ground Truth): Using the human-verified text for generation and the true evidence pages as an input ('Perfect Retrieval'). This represents the theoretical maximum performance achievable with the current models if they would have the correct context and perfect extraction.
- Perfect OCR & Retrieval: Using the human-verified text for both BM25 retrieval and for the top 5 passages and generation context. This is the quality you would get if your OCR would be perfect with the current technology.
- Mixedbread OCR (Text-Based RAG): Using text extracted by our high-quality OCR engine for both BM25 retrieval for the top 5 passages and generation context. This mirrors a standard, good-quality text-only RAG pipeline.
- Mixedbread Vector Store (Multimodal Retrieval): Using our multimodal model to retrieve the top 5 page images, but then using the corresponding clean text extracted by Mixedbread OCR as the generation context. This isolates the benefit of visual retrieval while keeping the generation input modality (text) consistent.
To measure success, we focused on the Correct Answers rate. We used GPT-4.1 as an impartial judge, providing it with the original question, the ground truth answer, the ground truth evidence text, and the answer generated by gemini-2.5-flash-preview-04-17
in each scenario. The final score is simply the number of correct answers divided by the total number of questions.
Generation Results: Better Retrieval = Better Answers
The generation tests confirmed our hypothesis: superior retrieval leads directly to more accurate answers.
Correct Answers Rate
This chart shows the percentage of correct answers from each generation method, averaged across 7 domains and judged by GPT-4.1. Higher values mean the LLM produced more accurate, ground-truth answers.
Click for Full Correct Answers Table by Domain
Domain | Mixedbread OCR (ret & gen) | Perfect OCR + ret. | Mixedbread Vector Store (ret) + Mixedbread OCR (gen) | Perfect OCR + Perfect ret. |
---|---|---|---|---|
academic | 0.711 | 0.797 | 0.876 | 0.904 |
administration | 0.714 | 0.812 | 0.846 | 0.896 |
finance | 0.618 | 0.686 | 0.742 | 0.877 |
law | 0.866 | 0.898 | 0.909 | 0.950 |
manual | 0.782 | 0.825 | 0.888 | 0.914 |
news | 0.435 | 0.447 | 0.753 | 0.951 |
textbook | 0.607 | 0.715 | 0.885 | 0.896 |
avg | 0.676 | 0.740 | 0.843 | 0.912 |
Key takeaways from the generation tests:
- OCR Flaws Amplify During Generation: Relying on standard OCR for both retrieval and generation resulted in a 25.8% decrease in correct answers compared to using perfect text (0.677 vs 0.913). Flawed input context significantly degrades the LLM's ability to generate accurate answers.
- Better Retrieval Dramatically Boosts Correct Answers: Simply swapping standard OCR-based retrieval for Mixedbread Vector Store's multimodal retrieval – while still using the same potentially imperfect OCR text for generation – caused the average correct answer rate to jump massively from 0.677 to 0.843. This single change recovered 70% of the accuracy lost due to the limitations of a standard OCR-based pipeline.
- Finding the Right Pages is Paramount: The quality of retrieval is often more critical than perfect text in the generation context. Getting the correct documents into the LLM's view, even with minor OCR imperfections, is far more beneficial than feeding the LLM slightly cleaner text from the wrong documents.
These generation benchmarks confirm that state-of-the-art multimodal retrieval can mitigate a large portion of the negative downstream effects of OCR errors.
Direct Image Generation: Is Vision-Only RAG Ready?
Given the success of using visual information for retrieval, a natural question arises: can we skip OCR entirely, even for the generation step? What if we feed the images of the retrieved pages directly to a powerful multimodal LLM like Gemini 2.5 Flash and ask it to generate the answer by "reading" the images? We tested this "Direct Image Understanding" approach:
Correct Answers Rate (Average across 3 document domains)
Retrieval Method | Generation Input | Avg. Correct Answers | Performance vs. Perfect OCR |
---|---|---|---|
Perfect OCR (Ground Truth) | Perfect OCR Text | 0.899 | ±0.0% (Baseline) |
Mixedbread Vector Store | Mixedbread OCR Text | 0.869 | -3.3% |
Mixedbread OCR | Mixedbread OCR Text | 0.678 | -24.6% |
Mixedbread Vector Store | Direct Image Input | 0.627 | -30.3% |
Click for Full Direct Image Input Comparison Table
Domain | Mixedbread OCR (ret. & gen.) | Mixedbread Vector Store (ret.) + Mixedbread OCR (gen.) | Mixedbread Vector Store (ret.) + Direct Image Input (gen.) | Perfect OCR + Retrieval |
---|---|---|---|---|
academic | 0.712 | 0.876 | 0.534 | 0.904 |
administration | 0.715 | 0.846 | 0.672 | 0.896 |
textbook | 0.607 | 0.885 | 0.675 | 0.896 |
avg | 0.678 | 0.869 | 0.627 | 0.899 |
The results were surprising:
- Direct Image Input Lags Significantly: Feeding page images directly to the LLM for generation yielded the lowest average correct answers (0.627).
- Visual Retrieval vs. Visual Generation: Multimodal models excel at using visual cues for retrieval, but current models still struggle with fine-grained extraction directly from pixels across multiple documents during generation, compared to working with pre-processed text.
- Quality OCR Text Still Best for Generation (For Now): Providing clean, explicit text to the LLM currently leads to the most accurate answers.
In essence: While fully visual RAG is an exciting possibility, today's reality is that combining the strengths of multimodal retrieval with high-quality OCR text for generation provides the best overall performance.
Illustrative Examples: Where Standard OCR Falters
To make the impact of OCR limitations more concrete, let's examine a few specific scenarios from our benchmark data. These examples highlight common situations where traditional OCR-based systems can struggle and demonstrate how a multimodal approach to retrieval can lead to more accurate document interpretation.
Click for Full Illustrative Examples
Example 1: The Challenge of Handwritten Data in Regulatory Filings
The Scenario: Regulatory filings, such as a telecommunications company's PUCO annual report, frequently combine structured typed content with critical handwritten financial figures. This mixture presents a significant OCR challenge, as traditional systems often fail to accurately recognize handwritten entries, leading to potential compliance and analysis issues.
Typical OCR Output & Its Limitations: When processed by a standard OCR engine, the crucial handwritten financial data is often missed entirely or garbled:
Impact on RAG Systems: Consequently, if a query such as, "What is the total revenue of TSC Communications?" is posed, a RAG system relying on this flawed OCR output would likely respond: "Unable to determine revenue figures from the available document." This necessitates manual data review, delaying important reporting and analytical tasks.
The Multimodal Approach: In contrast, the multimodal system processes both the structured form and the handwritten financial figures by analyzing the document's visual layout and handwriting patterns. This holistic understanding allows it to correctly identify the total revenue as $2,775,060, along with component values ($2,325,472 for operating revenue and $449,588 for other revenue). This capability enables accurate, automated responses regarding the company's financial standing and regulatory obligations.
Example 2: Deciphering Visual Trends in Financial Charts
The Scenario: Quarterly investment reports often feature charts, like stacked area charts showing portfolio allocation, to convey critical trends. The OCR challenge here is that traditional OCR primarily extracts textual elements (titles, labels) but fails to capture the actual visual data representing the trends themselves.
Typical OCR Output & Its Limitations: A standard OCR tool might only extract the labels and title, leaving out the core data:
Impact on RAG Systems: When a client asks, "How has my equity exposure changed over the past year?", a RAG system using this limited OCR output might provide only generic information about portfolio components. It would completely miss the crucial visual trend, such as a 13 percentage point increase in equity exposure, which is essential for understanding investment risk.
The Multimodal Approach: The multimodal system, by directly analyzing the chart visually, recognizes both the allocation percentages at each time point and the overall trend patterns. This allows it to accurately respond: "Your equity allocation has increased significantly from 45% to 58% over the past year, representing the largest shift in your portfolio composition." The system can even extract specific quarterly changes to illustrate the gradual increase.
Example 3: Navigating Complex Financial Tables
The Scenario: Financial reports frequently contain multi-column tables detailing revenue breakdowns and operating expenses. The OCR challenge with such complex table structures lies in maintaining correct column and row alignment; failures here can lead to financial figures being associated with incorrect business units or categories.
Typical OCR Output & Its Limitations: Even if text is extracted, subtle misalignments or parsing errors by the OCR can corrupt the table's structure:
Impact on RAG Systems: If a financial analyst asks, "What percentage of revenue did R&D represent in 2025 compared to 2024?", a RAG system relying on poorly structured OCR output might misinterpret the relationships between figures. An erroneous response could be: "R&D was 49% of revenue in 2025 compared to 8,675% in 2024." Such nonsensical answers arise from the system's inability to correctly understand the visual and semantic structure of the table.
The Multimodal Approach: The multimodal system analyzes the visual structure of the table, correctly understanding the complex alignments and relationships between headers, dollar amounts, and percentage figures. This enables an accurate response: "R&D expenses represented 9.9% of net revenue in 2025, down from 14.2% in 2024, despite a 49% increase in absolute R&D spending." The system properly interprets both the spatial layout and the semantic connections within the financial data.
The Mixedbread Vector Store Approach: Functionality and Implications
The Vector Store is designed to address the observed limitations of OCR-dependent RAG systems. Its architecture is centered on leveraging multimodal information for retrieval through our mxbai-omni-v0.1
model. This model directly analyzes and creates embeddings from the visual content of page screenshots, videos, and other multimodal data, enabling an understanding of layout, structure, tables, and charts in their original context. As shown in our benchmarks, this improved retrieval accuracy (NDCG@5) by approximately 12% compared to even perfect text extraction.
Concurrently with visual analysis, documents are processed by our OCR engine. The extracted text is stored and made available alongside the visual embeddings. This dual-modality approach offers distinct advantages for RAG pipelines:
- Better Retrieval: Visual analysis helps locate the most relevant documents, particularly in cases where text-only search might falter due to OCR errors or the nature of the content (e.g., charts, complex tables).
- Optimized Generation Context: High-quality OCRed text remains available, which is beneficial for current Large Language Models that primarily operate on textual input for generation.
- Integrated Document Processing: The system handles both visual embedding and text extraction automatically, so users don't have to worry about anything during data ingestion and preparation for RAG.
- Adaptability for Future LLMs: By storing both visual representations and text, systems are better prepared for future advancements in multimodal LLMs that might directly leverage richer image data for generation.
This integrated system design aims to improve overall RAG performance, as evidenced by the benchmarked retrieval gains and the recovery of 70% of generation accuracy typically diminished by OCR issues in conventional pipelines, all within a unified framework.
Conclusion: Navigating the OCR Bottleneck with Multimodal Retrieval
The benchmark results presented indicate that Optical Character Recognition quality can be a significant limiting factor for RAG system performance, particularly with complex, real-world documents. Errors and omissions in text extraction can restrict both the ability to accurately retrieve relevant information and the quality of the final answers generated by an LLM.
An approach incorporating multimodal analysis for retrieval, such as that employed by the Mixedbread Vector Store, addresses some of these limitations. By directly interpreting visual information from page images, this method improved retrieval accuracy by approximately 12% (NDCG@5) compared to even perfect text extraction in our tests. This enhancement in retrieval subsequently contributed to recovering 70% of the generation accuracy that was otherwise diminished by OCR errors in more conventional pipelines.
While current Large Language Models generally perform optimally with high-quality text for the generation phase, the strong retrieval performance of multimodal systems highlights a path towards more robust document understanding. An integrated system that provides both visually-driven retrieval and high-quality OCR text offers a practical solution for current application needs. Furthermore, it establishes a foundation for adapting to future advancements in LLMs that may more directly leverage rich image data for generation tasks.
The findings suggest that for applications involving diverse and structurally complex documents, incorporating multimodal understanding into the retrieval process is a key consideration for improving the accuracy and reliability of RAG systems.
Join our Discord community to share feedback, ask questions, and connect with other developers and researchers working on the cutting edge of AI!
Citation
22 min read
May 14, 2025