Why this matters
Government regulatory PDFs in Africa are dense, scanned, and completely unsearchable. Building AI on top of them is harder than it sounds - and more important than most people realise.
The challenge
The BNR (National Bank of Rwanda) publishes directives as scanned PDFs. No text layer, no metadata, no structure. Standard PDF parsers return garbage or nothing.
On top of that, regulatory language references other directives constantly - "as defined in Article 3 of Directive 1/2019" - so retrieval without cross-document context fails badly.
My approach
Three-stage pipeline: OCR with GPT-4o Vision for scanned pages, layout-aware chunking at the article level (not page level), and hybrid BM25 + vector retrieval with directive-aware metadata filtering.
Article-level chunking was the key insight. Regulatory articles are the atomic unit of meaning. Splitting at the page boundary destroys context and kills retrieval quality.
Key code
The chunking logic preserves directive number, chapter, article title, and page as metadata on every chunk - giving the retriever structured filters to work with.
chunk_metadata = {
"directive_id": "BNR-2021-04",
"chapter": "Capital Requirements",
"article": "Article 12",
"page": 8
}
Takeaways
- Chunk at the semantic unit of the domain, not at fixed token counts
- GPT-4o Vision OCR is expensive but the only reliable option for scanned African gov docs
- Hybrid retrieval (BM25 + vector) consistently beats pure vector for regulatory text
- Metadata filtering is not optional - it's what separates toy RAG from production RAG