All projects
RAGLLMFastAPIAWS

RAG Knowledge Assistant

A production-style retrieval-augmented question-answering API over a private document corpus, with citations, evaluation, and cost/latency tracking.

Planned

Note: This is an example case study showing how each project is documented. Replace it with the real write-up once you build the project (see ROADMAP.md).

Problem

Teams sit on large internal knowledge bases (policies, contracts, runbooks) that are hard to search. Keyword search misses paraphrases; pasting documents into a chatbot doesn't scale and leaks context limits. The goal: a backend service that answers natural-language questions over a private corpus and cites its sources, so answers are trustworthy and auditable.

Approach

A retrieval-augmented generation (RAG) pipeline exposed as a clean HTTP API:

  • Ingestion — documents are chunked, embedded, and stored in a vector index.
  • Retrieval — at query time, the most relevant chunks are fetched.
  • Generation — an LLM answers using only the retrieved context, returning inline citations to the source chunks.

Architecture

                 ┌────────────── ingestion ──────────────┐
   documents ──► chunk ──► embed ──► pgvector (Postgres)
                 └────────────────────────────────────────┘

   client ──► FastAPI ──► retrieve top-k ──► build prompt ──► LLM ──► answer + citations
                        └► log: latency, tokens, cost, eval id

Key engineering decisions

  1. pgvector over a managed vector DB — keeps the stack to a single Postgres instance for the first version, which is cheaper and simpler to operate; revisit a dedicated vector DB only when recall or scale demands it.
  2. Citations as a first-class output — the LLM is constrained to answer from retrieved chunks and return their IDs, making answers auditable instead of "trust me".
  3. An evaluation harness from day one — a small labeled question set runs in CI so prompt/retrieval changes are measured, not guessed.

Results

To be filled in once built — track these from the start:

| Metric | Value | | ----------------- | ----- | | Retrieval recall@5 | — | | Answer faithfulness | — | | p95 latency | — | | Cost / query | — |

What I'd do next

Add hybrid search (keyword + vector), streaming responses, and per-tenant isolation for multi-customer deployments.