Building a RAG System Over Government Directive PDFs

2025 RAGPythonLangChainNLP

Why this matters

Government regulatory PDFs in Africa are dense, scanned, and completely unsearchable. Building AI on top of them is harder than it sounds - and more important than most people realise.

The challenge

The BNR (National Bank of Rwanda) publishes directives as scanned PDFs. No text layer, no metadata, no structure. Standard PDF parsers return garbage or nothing.

On top of that, regulatory language references other directives constantly - "as defined in Article 3 of Directive 1/2019" - so retrieval without cross-document context fails badly.

My approach

Three-stage pipeline: OCR with GPT-4o Vision for scanned pages, layout-aware chunking at the article level (not page level), and hybrid BM25 + vector retrieval with directive-aware metadata filtering.

Article-level chunking was the key insight. Regulatory articles are the atomic unit of meaning. Splitting at the page boundary destroys context and kills retrieval quality.

Key code

The chunking logic preserves directive number, chapter, article title, and page as metadata on every chunk - giving the retriever structured filters to work with.

chunk_metadata = {
  "directive_id": "BNR-2021-04",
  "chapter": "Capital Requirements",
  "article": "Article 12",
  "page": 8
}

Takeaways

Chunk at the semantic unit of the domain, not at fixed token counts
GPT-4o Vision OCR is expensive but the only reliable option for scanned African gov docs
Hybrid retrieval (BM25 + vector) consistently beats pure vector for regulatory text
Metadata filtering is not optional - it's what separates toy RAG from production RAG