The Dirty Secret of RAG: Your Documents Are Not Ready

RAG is a data engineering problem first and an AI problem second. The quality of retrieval is entirely determined by how well the underlying document corpus has been cleaned, chunked, enriched, and indexed.

Retrieval-Augmented Generation is the most widely deployed enterprise AI architecture of 2024 and 2025. Every major enterprise software vendor has a RAG feature. Every consulting firm has a RAG implementation practice. Every board deck has a slide about the company's knowledge management AI powered by RAG. And in a striking majority of these deployments, the system produces answers that are confidently, specifically, verifiably wrong.

The failure is not in the retrieval algorithm. It is not in the language model. It is in the documents. The enterprise corpus that RAG systems are built to retrieve from is, in almost every organization, Bronze-layer data — unprocessed, inconsistently structured, unevenly maintained, and inadequately indexed. Connecting a sophisticated retrieval system to this corpus does not produce a sophisticated knowledge system. It produces a sophisticated mechanism for surfacing bad information at scale.

Why Enterprise Document Corpora Are Bronze by Default

Organizations accumulate documents the way they accumulate technical debt: gradually, then suddenly. A SharePoint repository that started with 200 documents in 2015 contains 47,000 documents in 2025, of which perhaps 12,000 are current, 8,000 are outdated versions of current documents, 15,000 are project artifacts that were never formally retired, and 12,000 are documents whose purpose and ownership are genuinely unknown. This is not a failure of document management. It is the natural accumulation pattern of any large organization.

For human users, this corpus is navigable because humans apply context, judgment, and familiarity. A legal team member knows that the contract template from 2019 was superseded by the 2022 version. A new employee does not, but they will ask a colleague. A RAG system knows neither. It will retrieve the 2019 template with full confidence if it is the most semantically similar document to the query.

The Five Document Quality Failure Modes That Break RAG

Stale content without retirement dates — documents that were accurate when created but have not been updated to reflect current policies, products, or procedures, with no metadata indicating their currency
Version proliferation without canonical designation — multiple versions of the same document existing simultaneously with no indication of which is authoritative
Inconsistent terminology — the same concept described using different terms in different documents, causing semantic search to fail to retrieve all relevant content for a given query
Chunking-hostile structure — documents with dense cross-references, footnotes, tables, and appendices that become meaningless when extracted as isolated text chunks
Missing provenance metadata — documents with no information about who created them, when, for what purpose, or whether they have been officially approved

The Document Readiness Framework: Five Dimensions Dimension 1: Currency

Every document in the corpus must have a reviewed date and a designated owner.

Documents that have not been reviewed within a defined staleness threshold should be excluded from RAG indexing until they are either updated or formally retired. This is a governance decision, not a technology decision — and it is the most important decision in RAG implementation.

Dimension 2: Authority

Documents must be classified by authority level: official policy, approved guidance, reference material, or informal artifact. Only official and approved documents should be indexed for RAG applications where accuracy is critical. This classification must be maintained by document owners, not inferred by algorithms.

Dimension 3: Semantic Consistency

The terminology used across all indexed documents should be inventoried and standardized. Where multiple terms are used for the same concept, the corpus should be annotated with synonym mappings that allow retrieval systems to find relevant content regardless of which term appears in the query. This is the document equivalent of the semantic layer that governs structured data.

Dimension 4: Chunk Architecture

Documents must be structured with RAG chunking in mind. Sections should be self-contained enough to be meaningful when extracted independently. Tables should be accompanied by enough contextual text that a chunk containing the table makes sense without the surrounding paragraphs. Cross-references should be explicit rather than implicit.

Dimension 5: Metadata Completeness

Every document in the corpus requires a metadata record containing at minimum: title, owner, creation date, last review date, authority level, subject taxonomy, and applicable business domains. This metadata is what allows RAG systems to filter retrieval by relevance criteria beyond semantic similarity — enabling a query to specify 'only current, approved documents in the HR domain' rather than retrieving all semantically similar content regardless of currency or authority.

The Silver-to-Gold Pipeline for Documents

The Medallion Architecture applies to unstructured data exactly as it applies to structured data. Bronze is the document as it exists in its native repository — a PDF in SharePoint, a Word document in a network drive, an email attachment in an inbox. Silver is the document after extraction, cleaning, metadata enrichment, and taxonomy tagging. Gold is the indexed, chunked, embedded document corpus that the RAG system actually queries.

Building this pipeline requires coordination between content owners, information architects, and data engineers that most organizations have never attempted. It is the most human-intensive step in RAG implementation — and it is the step that vendors, consultants, and technology teams consistently underinvest in because it is hard to demonstrate in a proof of concept.

THE OBT DOCUMENT READINESS CHECKLIST

Before deploying any RAG system, complete a document corpus audit: (1) What percentage of documents have a designated owner and review date? (2) What percentage have been reviewed within the past 18 months? (3) Is there a canonical version designation for documents that exist in multiple versions? (4) Is there a subject taxonomy that applies consistently across the corpus? (5) Has the corpus been purged of formally retired content? A corpus that cannot answer all five questions affirmatively is not ready for production RAG. The organizations that have deployed RAG systems that actually work — that produce accurate, trustworthy answers that colleagues rely on — invested heavily in document governance before they wrote a line of embedding code. The organizations that deployed RAG systems on top of unprocessed document corpora are the ones explaining to their boards why the AI investment hasn't delivered. The difference is not the model. It is the data.

LegacyDataModernAI DataIntelligenceSeries OneBigTable

Follow OBT on LinkedIn