Build a Private RAG System on VPS: Document Q&A AI for Your Knowledge Base

What Is RAG and Why Do You Need It?

RAG (Retrieval-Augmented Generation) is one of the most practical techniques in self-hosted AI. The core idea is simple:

When a user asks a question, first retrieve relevant document fragments from a knowledge base, then feed those fragments as context to an LLM to generate an answer.

Compare it to traditional approaches:

Approach	Pros	Cons
Ask LLM directly	Simple	Knowledge cutoff — can’t answer about private docs
Fine-tune a model	Deep customization	Expensive ($500+), retrain needed for doc updates
RAG	Zero training cost, real-time doc updates	Requires pipeline setup

RAG’s killer feature: just drop your PDFs, Markdown files, and web pages into a vector database, and the AI can answer questions about them instantly. No training needed, no GPU required, and updating documents is as simple as re-indexing.

Architecture Overview

                      ┌──────────────┐
                      │   Your Docs   │
                      │ (PDF/MD/TXT)  │
                      └──────┬───────┘
                             │
                             ▼
                      ┌──────────────┐
                      │ Text Splitter │
                      └──────┬───────┘
                             │
                             ▼
                      ┌──────────────┐
                      │  Embedding    │ ← Ollama (nomic-embed-text)
                      └──────┬───────┘
                             │
                             ▼
                      ┌──────────────┐
                      │  ChromaDB     │
                      │ (Vector Store)│
                      └──────┬───────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
         ┌──────────┐ ┌──────────┐ ┌──────────────┐
         │  Query   │ │ Semantic │ │ LLM Generates │
         │          │ │ Retrieval│ │   Answer      │
         └──────────┘ └──────────┘ └──────────────┘
                              │
                         Ollama + LLM
                         (qwen2.5 / mistral)

All components run on a single VPS — your data never leaves your server.

Requirements

Resource	Minimum	Recommended
VPS	2 vCPU / 2GB RAM	4 vCPU / 8GB RAM
Storage	10GB	20GB+
Docker	✅ Required	Docker Compose
OS	Ubuntu 22.04 / 24.04	Debian 12

No GPU needed. RAG runs fine on CPU — both embedding and LLM inference use the CPU. On a 4-core VPS with moderate document volume, responses arrive in 3-8 seconds.

Provider picks: Hetzner CX22 (€3.99/mo) is enough to start. CX32 (€6.95/mo) gives more headroom. Budget option: RackNerd $2.50/mo works but is slightly slower.

Step 1: Install Ollama (Local LLM + Embedding)

Ollama handles two jobs:

Runs nomic-embed-text — converts documents to vector embeddings
Runs qwen2.5 or mistral — generates answers from retrieved context

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the embedding model (only 274MB)
ollama pull nomic-embed-text

# Pull the LLM (7B class recommended)
ollama pull qwen2.5:7b    # Great for mixed Chinese/English, ~4.7GB
# or
ollama pull mistral:7b    # Great for English-only, ~4.1GB

# Verify
ollama list

Ollama listens on localhost:11434 by default. LangChain will call this API later.

Step 2: Deploy Web UI + Vector Database

Use Docker Compose to deploy Open WebUI (with built-in RAG pipeline) and ChromaDB:

# docker-compose.yml
version: '3.8'

services:
  # ── Open WebUI (includes RAG pipeline) ──
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - WEBUI_SECRET_KEY=your-random-secret-key-here
    extra_hosts:
      - "host.docker.internal:host-gateway"

  # ── ChromaDB (vector database) ──
  chromadb:
    image: chromadb/chroma:latest
    container_name: chromadb
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      - chroma-data:/chroma/chroma
    environment:
      - IS_PERSISTENT=TRUE
      - ANONYMIZED_TELEMETRY=FALSE

volumes:
  open-webui-data:
  chroma-data:

Start it up:

docker compose up -d

Visit http://your-vps-ip:3000 and register an admin account.

Step 3: Configure RAG in Open WebUI

3.1 Connect to Ollama

Go to Admin Panel → Settings → External Connections:

Ollama API URL: http://host.docker.internal:11434
Click “Verify Connection” — green check means success

3.2 Configure Embedding Model

Go to Admin Panel → Settings → Documents:

Select nomic-embed-text as the embedding model
This auto-vectorizes every document you upload

3.3 Configure RAG Pipeline

Go to Admin Panel → Pipelines, make sure:

RAG Pipeline is enabled
Top-K retrieval: default is 3, set to 5 for better recall
Chunk Size: default 1000 characters, set to 500 for finer retrieval
Chunk Overlap: 100 to prevent context breaks

Step 4: Upload Documents & Test

4.1 Upload Documents

In the Open WebUI chat interface:

Click the + icon left of the input box
Select “Upload Documents”
Supported formats: PDF, TXT, Markdown, DOCX, CSV

The system automatically:

Splits documents into 500-char chunks
Generates embeddings via nomic-embed-text
Stores them in ChromaDB

4.2 Test Q&A

Upload a technical document (like your server ops manual), then ask:

“I changed my SSH port. Do I need to update my Caddy config too?”

The RAG system will:

Convert your question to a vector
Find the 5 most similar document chunks in ChromaDB
Send those chunks + your question to the LLM
Generate an answer with source references

Step 5: Batch Import with Python (Advanced)

Uploading through the web UI works for a few docs. For hundreds of documents, use this script to write directly to ChromaDB:

# batch_ingest.py
import os
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

# 1. Load all docs from directory
loader = DirectoryLoader(
    "./docs/",
    glob=["**/*.md", "**/*.txt", "**/*.pdf"],
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"}
)
docs = loader.load()
print(f"Loaded {len(docs)} documents")

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)
chunks = splitter.split_documents(docs)
print(f"Split into {len(chunks)} chunks")

# 3. Generate embeddings and store in ChromaDB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
vectorstore.persist()
print("✅ Batch import complete!")

Run it:

pip install langchain langchain-community chromadb pypdf
python batch_ingest.py

Performance Tuning

1. Choose the Right LLM

Model	Size	Chinese	English	Best For
qwen2.5:7b	4.7GB	⭐⭐⭐	⭐⭐⭐	Mixed-language docs
mistral:7b	4.1GB	⭐⭐	⭐⭐⭐⭐	English-only docs
phi4:14b	9.1GB	⭐⭐	⭐⭐⭐⭐	High-end (8GB+ RAM)
llama3.2:3b	2.0GB	⭐	⭐⭐⭐	Low-spec VPS

For 2GB RAM VPS: qwen2.5:7b or mistral:7b + nomic-embed-text ≈ 5GB disk, ~3.5GB RAM runtime.

For 4GB RAM VPS: phi4:14b — noticeably better answer quality.

2. Tune Chunk Strategy

Code docs: chunk_size=300, overlap=50 (short code segments, precise matching)
Tech manuals: chunk_size=500, overlap=100 (balance precision & context)
Long articles: chunk_size=1000, overlap=200 (paragraph integrity)

3. Widen the Retrieval Window

In Open WebUI settings, bump Top-K from 3 to 5-7 to give the LLM more relevant fragments.

Advanced: Add Caddy Reverse Proxy

# /etc/caddy/Caddyfile
rag.yourdomain.com {
    reverse_proxy localhost:3000
}

Benefits:

Automatic HTTPS via Let’s Encrypt
Domain access instead of IP:port
Easy team sharing

apt install caddy
systemctl enable --now caddy

Disk Management

Vector database growth per document volume:

100 documents ~ 50MB (ChromaDB storage)
1,000 documents ~ 500MB
10,000 documents ~ 5GB

Periodic cleanup:

# Check DB size
du -sh /var/lib/docker/volumes/chroma-data/

# Prune unused Docker data
docker system prune -f

FAQ

RAG answers are inaccurate

Increase Top-K: 3 → 7, see if that helps
Decrease Chunk Size: 500 → 300 for finer retrieval
Upgrade LLM: qwen2.5:7b → phi4:14b is a big quality jump
Document format: Prefer Markdown; PDF quality depends on parsing

PDF upload produces garbled text

Open WebUI uses PyMuPDF for PDF parsing. Fixes:

Use Markdown instead when possible
Ensure PDFs are text-based (not scanned)
For scans, pre-process with ocrmypdf

Out of memory

# Check Ollama memory usage
ollama ps

# Keep only active models
ollama stop <model-name>

# Or switch to a smaller model
ollama pull llama3.2:3b
ollama rm qwen2.5:7b

ChromaDB connection failure

# Check container status
docker ps | grep chroma

# View logs
docker logs chromadb

# Restart
docker compose restart chromadb

What’s Next

Once your RAG system is running, you can:

Build a personal knowledge base — search and ask questions across all your docs
Team Wiki — deploy in your intranet so new members can ask “how do we deploy?”
Code documentation search — index your project READMEs and API docs
Automated workflows — combine with n8n to auto-crawl and index new docs daily

RAG offers the best ROI in self-hosted AI right now — no GPU, no training, just a €4/month VPS. The more documents you feed it, the more valuable it becomes.

Resources

Questions about your RAG deployment? Drop a comment below. Next up: automating knowledge updates with RAG + n8n.