Reference Implementation

See it in action.

Production-quality RAG running entirely local. No external APIs, no data leakage — just FastAPI, ChromaDB, Ollama, and sentence-transformers doing real work.

This is the architecture we'd actually ship. Not a notebook, not a demo with hardcoded answers — a real retrieval pipeline with streaming and citations.

FastAPIChromaDBOllamasentence-transformersPython

Architecture

The full stack, visualised.

Every component runs on your hardware. The browser talks to FastAPI, which orchestrates embedding, retrieval, and generation.

System overview

Browser UI

HTML / JS / SSE

FastAPI

Python · async · OpenAPI

AI Services

sentence-transformers

Embedding · CPU

ChromaDB

Vector store · local

Ollama

Local LLM · streaming

Request flow

PDF upload

multipart/form-data → FastAPI

Text chunking

recursive split, 512 token windows

Embedding

sentence-transformers, local CPU

Vector store

ChromaDB, cosine similarity

Query embed

same model, same dimensions

Retrieval

top-k semantic search

Generation

Ollama local LLM + context

Stream answer

SSE with source citations

Properties

What makes it production-grade.

Zero data leakage

Runs entirely on local hardware. No external API calls, no data leaving your machine. Your documents stay yours.

Source citations

Every answer is traced back to the specific document and page number it came from. No hallucinations, no guessing.

Real-time streaming

Token-by-token answer delivery via Server-Sent Events. Feels responsive even with large context windows.

Production architecture

FastAPI, ChromaDB, sentence-transformers — the same stack you'd run in production. Not a toy Jupyter notebook.

Context

Not a prototype. Not a tutorial.

Most RAG demos cut corners to look impressive. This one cuts corners nowhere.

sovont-rag-demo

Real FastAPI app — OpenAPI docs, async endpoints, proper error handling
ChromaDB with persistent storage — data survives restarts
sentence-transformers running on CPU — no GPU required, reproducible embeddings
Ollama for LLM — swap models without changing code
Streaming responses via SSE — token-by-token, low perceived latency
Source citations with page numbers — every answer is grounded

Typical RAG demo

Jupyter notebook — doesn't run as a service
In-memory vector store — data gone on restart
OpenAI embeddings — every query costs money, data leaves your machine
Hardcoded to one LLM provider
Synchronous, blocking responses — full wait before seeing output
No citations — answers appear credible but aren't traceable

Demo tech stack

FastAPIChromaDBOllamasentence-transformersPython

Want to see it live?

We'll walk you through
the whole stack.

Book a demo session and we'll run it live — upload your documents, query the system, show you the retrieval logs, and answer every technical question you have.

Request a demo session View on GitHub

Repo is currently private — reach out for access.