RAG Knowledge Assistant
A production-style retrieval-augmented question-answering API over a private document corpus, with citations, evaluation, and cost/latency tracking.
Note: This is an example case study showing how each project is documented. Replace it with the real write-up once you build the project (see
ROADMAP.md).
Problem
Teams sit on large internal knowledge bases (policies, contracts, runbooks) that are hard to search. Keyword search misses paraphrases; pasting documents into a chatbot doesn't scale and leaks context limits. The goal: a backend service that answers natural-language questions over a private corpus and cites its sources, so answers are trustworthy and auditable.
Approach
A retrieval-augmented generation (RAG) pipeline exposed as a clean HTTP API:
- Ingestion — documents are chunked, embedded, and stored in a vector index.
- Retrieval — at query time, the most relevant chunks are fetched.
- Generation — an LLM answers using only the retrieved context, returning inline citations to the source chunks.
Architecture
┌────────────── ingestion ──────────────┐
documents ──► chunk ──► embed ──► pgvector (Postgres)
└────────────────────────────────────────┘
client ──► FastAPI ──► retrieve top-k ──► build prompt ──► LLM ──► answer + citations
└► log: latency, tokens, cost, eval id
Key engineering decisions
- pgvector over a managed vector DB — keeps the stack to a single Postgres instance for the first version, which is cheaper and simpler to operate; revisit a dedicated vector DB only when recall or scale demands it.
- Citations as a first-class output — the LLM is constrained to answer from retrieved chunks and return their IDs, making answers auditable instead of "trust me".
- An evaluation harness from day one — a small labeled question set runs in CI so prompt/retrieval changes are measured, not guessed.
Results
To be filled in once built — track these from the start:
| Metric | Value | | ----------------- | ----- | | Retrieval recall@5 | — | | Answer faithfulness | — | | p95 latency | — | | Cost / query | — |
What I'd do next
Add hybrid search (keyword + vector), streaming responses, and per-tenant isolation for multi-customer deployments.