Services Process Audit Blog Demo

Get in touch

hello@sovont.com
Reference Implementation

See it in action.

Production-quality RAG running entirely local. No external APIs, no data leakage — just FastAPI, ChromaDB, Ollama, and sentence-transformers doing real work.

This is the architecture we'd actually ship. Not a notebook, not a demo with hardcoded answers — a real retrieval pipeline with streaming and citations.

FastAPIChromaDBOllamasentence-transformersPython

Architecture

The full stack, visualised.

Every component runs on your hardware. The browser talks to FastAPI, which orchestrates embedding, retrieval, and generation.

System overview

Browser UI
HTML / JS / SSE
FastAPI
Python · async · OpenAPI

AI Services

sentence-transformers
Embedding · CPU
ChromaDB
Vector store · local
Ollama
Local LLM · streaming

Request flow

01
PDF upload
multipart/form-data → FastAPI
02
Text chunking
recursive split, 512 token windows
03
Embedding
sentence-transformers, local CPU
04
Vector store
ChromaDB, cosine similarity
05
Query embed
same model, same dimensions
06
Retrieval
top-k semantic search
07
Generation
Ollama local LLM + context
08
Stream answer
SSE with source citations

Properties

What makes it production-grade.

Zero data leakage

Runs entirely on local hardware. No external API calls, no data leaving your machine. Your documents stay yours.

Source citations

Every answer is traced back to the specific document and page number it came from. No hallucinations, no guessing.

Real-time streaming

Token-by-token answer delivery via Server-Sent Events. Feels responsive even with large context windows.

Production architecture

FastAPI, ChromaDB, sentence-transformers — the same stack you'd run in production. Not a toy Jupyter notebook.

Context

Not a prototype. Not a tutorial.

Most RAG demos cut corners to look impressive. This one cuts corners nowhere.

sovont-rag-demo
  • Real FastAPI app — OpenAPI docs, async endpoints, proper error handling
  • ChromaDB with persistent storage — data survives restarts
  • sentence-transformers running on CPU — no GPU required, reproducible embeddings
  • Ollama for LLM — swap models without changing code
  • Streaming responses via SSE — token-by-token, low perceived latency
  • Source citations with page numbers — every answer is grounded
Typical RAG demo
  • Jupyter notebook — doesn't run as a service
  • In-memory vector store — data gone on restart
  • OpenAI embeddings — every query costs money, data leaves your machine
  • Hardcoded to one LLM provider
  • Synchronous, blocking responses — full wait before seeing output
  • No citations — answers appear credible but aren't traceable

Demo tech stack

FastAPIChromaDBOllamasentence-transformersPython

Want to see it live?

We'll walk you through
the whole stack.

Book a demo session and we'll run it live — upload your documents, query the system, show you the retrieval logs, and answer every technical question you have.

Repo is currently private — reach out for access.