Introduction: Why Run Local LLMs on Your VPS?
Large language models are shifting from cloud APIs to local deployment. Whether for data privacy, cost control, or low-latency responses, building your own LLM inference service on a VPS has become a popular choice for developers and enterprises alike.
Ollama, the most popular local LLM runtime framework, combined with GPU acceleration, can transform your VPS into a powerful AI inference engine. This guide walks you through building a production-grade GPU-accelerated Ollama service from scratch.
The Game-Changer: GPU Acceleration vs. CPU Inference
| Metric | CPU Inference | GPU Inference (4GB VRAM) | GPU Inference (24GB VRAM) |
|---|---|---|---|
| Qwen3-8B tokens/s | ~3-5 | ~25-40 | ~60-80 |
| First token latency | 2-4s | 0.3-0.5s | 0.1-0.2s |
| Concurrent requests | 1-2 | 8-16 | 20-40 |
| Monthly cost (vs. cloud API) | 90%+ savings | 95%+ savings | 95%+ savings |
The difference is dramatic. GPU acceleration isn’t optional — it’s transformative. For scenarios requiring frequent LLM calls, CPU-only local inference is barely usable.
Step 1: Choosing the Right VPS Configuration
Recommended GPU VPS Configurations
| Tier | GPU | VRAM | Suitable Models | Monthly Cost |
|---|---|---|---|---|
| Entry | NVIDIA T4 | 16GB | Qwen3-8B, Llama 3.1-8B | $30-55 |
| Advanced | NVIDIA A10 | 24GB | Qwen3-14B, Mistral-Large | $70-110 |
| High Performance | NVIDIA A100 | 40GB | Qwen3-32B, Llama 3.1-70B (quantized) | $200-400 |
Recommended providers: AutoDL, Vultr GPU, AWS EC2 G5/G6, Google Cloud A2, Alibaba Cloud PAI
💡 Money-saving tip: On-demand GPU platforms like AutoDL offer up to 70% discounts during off-peak hours, ideal for intermittent inference workloads.
Non-GPU Alternatives
If your VPS lacks a GPU:
- CPU + llama.cpp quantization: Q4_K_M quantized 7B model, ~3-5 tokens/s
- Intel Arc GPU: Some VPS providers offer Intel GPU instances with oneAPI support
- Hybrid cloud-edge: Cache popular responses in Redis, route others to cloud APIs
Step 2: Installing NVIDIA GPU Drivers
2.1 Check GPU Hardware
# View GPU information
nvidia-smi
# Sample output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
# |-------------------------------+----------------------+----------------------+
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
# | 0 NVIDIA A10 Off | 00000000:00:1E.0 Off | 0 |
# | N/A 38C P0 42W / 150W | 2048MiB / 24576MiB | 5% Default |
# +-------------------------------+----------------------+----------------------+
2.2 Install Drivers (If Not Pre-installed)
# Ubuntu/Debian systems
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
# Verify installation
nvidia-smi
nvcc --version
2.3 Install Docker + NVIDIA Container Toolkit
# Install NVIDIA Container Toolkit
curl -fsSL https://raw.githubusercontent.com/NVIDIA/nvidia-container-toolkit/main/extras/install.sh | sudo bash
# Configure Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Test GPU in containers
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Step 3: Deploying Ollama Service
3.1 Docker Deployment (Recommended)
# Create persistent data directory
mkdir -p /opt/ollama/models
# Start Ollama container
docker run -d \
--name ollama \
--gpus all \
-v /opt/ollama/models:/root/.ollama \
-p 11434:11434 \
ollama/ollama:latest
# Verify service
curl http://localhost:11434/api/tags
3.2 Direct Installation (Without Docker)
# One-line installer
curl -fsSL https://ollama.com/install.sh | sh
# Start the service
ollama serve &
# Set up systemd auto-start
sudo tee /etc/systemd/system/ollama.service <<EOF
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
[Install]
WantedBy=default.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now ollama
3.3 Verify GPU Acceleration Is Working
# Pull and run Qwen3-8B model
ollama pull qwen3:8b
# Check model loading logs for GPU offloading
ollama run qwen3:8b "Introduce yourself"
# Monitor GPU usage
watch -n 1 nvidia-smi
⚠️ Note: If you see
GPU offloadrelated logs, GPU acceleration is active. If it saysCPU only, check your driver and Docker configuration.
Step 4: Model Selection & Quantization Strategy
4.1 Recommended Models
| Model | Parameters | Recommended Quant | Min VRAM | Use Case |
|---|---|---|---|---|
| Qwen3-8B | 8B | Q4_K_M | 6GB | General chat, code generation |
| Qwen3-14B | 14B | Q4_K_M | 10GB | Complex reasoning, document analysis |
| Llama 4-8B | 8B | Q4_K_M | 6GB | Multilingual, creative writing |
| Mistral-Nemo | 12B | Q4_K_M | 8GB | Long context (128K) |
| Gemma3-4B | 4B | Q4_K_M | 4GB | Lightweight, embedded deployment |
4.2 Understanding Quantization Levels
Quantization balances accuracy against resource usage:
- FP16: Highest precision, maximum VRAM usage. Best for research and high-accuracy tasks
- Q8_0: 8-bit quantization, minimal quality loss, half the VRAM
- Q4_K_M: Recommended balance — quality near FP16 with only 1/4 VRAM
- Q3_K_S: Extreme compression, suitable for severely VRAM-constrained setups
# Download different quantizations
ollama pull qwen3:8b-q4_K_M # Recommended: balanced performance & VRAM
ollama pull qwen3:8b-q8_0 # High precision needs
ollama pull qwen3:8b-fp16 # Maximum precision
4.3 Browsing the Model Library
# List all pulled models
ollama list
# Search for specific models
ollama search qwen3
# View detailed model information
ollama info qwen3:8b
Step 5: Performance Tuning
5.1 Ollama Environment Variables
# Edit Ollama service configuration
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_NUM_GPU=-1"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
Key parameters explained:
OLLAMA_NUM_PARALLEL: Number of concurrent requests — set to your GPU’s maximum capacityOLLAMA_MAX_LOADED_MODELS: Simultaneously loaded models — reduces reload latency during model switchingOLLAMA_KEEP_ALIVE: How long models stay in memory — prevents frequent load/unload cyclesOLLAMA_NUM_GPU: Layers sent to GPU —-1means all layers
5.2 VRAM Optimization Techniques
# Monitor VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv
# Limit context length per request (reduces peak VRAM)
export OLLAMA_MAX_CONTEXT=8192
# Enable VRAM sharing (multi-model scenario)
export OLLAMA_MAX_VRAM=61440 # in MB, limits maximum VRAM usage
5.3 Batch Inference Optimization
For high-concurrency scenarios, use the batch API:
# Batch generation
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Explain the basics of quantum computing",
"stream": false,
"options": {
"num_ctx": 4096,
"temperature": 0.7
}
}'
5.4 Performance Benchmarking
# Install benchmark tool
ollama pull qwen3:8b
# Test inference speed
time curl -X POST http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "Please describe the development history of deep learning in 500 words",
"stream": false
}'
Step 6: Security Hardening & Access Control
6.1 Reverse Proxy with Nginx + HTTPS
# /etc/nginx/sites-available/ollama
server {
listen 443 ssl http2;
server_name ollama.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem;
location / {
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support (for streaming output)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
6.2 API Key Authentication
# Use Nginx basic authentication
sudo apt install apache2-utils
sudo htpasswd -c /etc/nginx/.ollama_auth ollama_user
# Add to Nginx config
location / {
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.ollama_auth;
proxy_pass http://127.0.0.1:11434;
}
6.3 Firewall Configuration
# UFW firewall rules
sudo ufw allow 11434/tcp comment 'Ollama API'
sudo ufw allow 443/tcp comment 'HTTPS'
sudo ufw enable
# Or with firewalld
sudo firewall-cmd --permanent --add-port=11434/tcp
sudo firewall-cmd --reload
6.4 Rate Limiting
# Nginx rate limiting
limit_req_zone $binary_remote_addr zone=ollama_limit:10m rate=10r/s;
location / {
limit_req zone=ollama_limit burst=20 nodelay;
proxy_pass http://127.0.0.1:11434;
}
Step 7: Integrating Into Your Workflow
7.1 As a ChatGPT Alternative
# Deploy OpenWebUI web interface
docker run -d \
--name open-webui \
--gpus all \
-v open-webui:/app/backend/data \
-p 3000:8080 \
ghcr.io/open-webui/open-webui:main
# Visit http://your-vps:3000
# Set backend URL to http://your-vps:11434
7.2 Integration with LangChain / LlamaIndex
# LangChain integration example
from langchain_community.llms import Ollama
llm = Ollama(
base_url="http://your-vps:11434",
model="qwen3:8b",
temperature=0.7,
max_tokens=2048
)
response = llm.invoke("Please summarize the core points of this article: ...")
print(response)
7.3 As a Backend for RAG Systems
# Build a local RAG with ChromaDB
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
# Embedding model
embeddings = OllamaEmbeddings(
model="nomic-embed-text",
base_url="http://your-vps:11434"
)
# Vector store
vectorstore = Chroma(
collection_name="my_docs",
embedding_function=embeddings
)
# Retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# LLM
llm = Ollama(
model="qwen3:8b",
base_url="http://your-vps:11434"
)
7.4 Home Assistant Integration
# configuration.yaml
conversation:
llm_api:
- platform: ollama
api_key: ""
base_url: "http://your-vps:11434"
model: "qwen3:8b"
Step 8: Monitoring & Alerting
8.1 GPU Monitoring Dashboard
# Using Prometheus + Grafana
# Install node_exporter + dcm_exporter
docker run -d \
--name gpu-exporter \
--gpus all \
--pid=host \
nvilern/gpu-exporter:latest
# Access metrics at http://your-vps:9100/metrics
8.2 Custom Health Check Script
#!/bin/bash
# /opt/ollama/health-check.sh
MODEL="qwen3:8b"
API_URL="http://localhost:11434"
ALERT_THRESHOLD=5000 # ms
# Check service status
STATUS=$(curl -s -o /dev/null -w "%{http_code}" ${API_URL}/api/tags)
if [ "$STATUS" != "200" ]; then
echo "ALERT: Ollama service is down!"
exit 1
fi
# Test inference latency
START_TIME=$(date +%s%N)
RESPONSE=$(curl -s -X POST ${API_URL}/api/generate \
-d "{\"model\":\"$MODEL\",\"prompt\":\"hi\",\"stream\":false}" | jq -r '.done')
END_TIME=$(date +%s%N)
ELAPSED=$(( (END_TIME - START_TIME) / 1000000 ))
if [ $ELAPSED -gt $ALERT_THRESHOLD ]; then
echo "ALERT: Inference latency too high: ${ELAPSED}ms"
exit 1
fi
echo "OK: Latency ${ELAPSED}ms, Response: $RESPONSE"
exit 0
8.3 Scheduled Tasks
# Run health check every hour
echo "0 * * * * /opt/ollama/health-check.sh >> /var/log/ollama-health.log 2>&1" | crontab -
Troubleshooting FAQ
Q1: Model loading fails with “out of memory”
# Solution 1: Use lower quantization
ollama pull qwen3:8b-q3_K_S
# Solution 2: Reduce context length
export OLLAMA_MAX_CONTEXT=4096
# Solution 3: Remove unused models
ollama rm qwen3:70b
Q2: GPU not being recognized
# Check Docker GPU configuration
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
# Reconfigure NVIDIA Container Toolkit
nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Restart Ollama
sudo systemctl restart ollama
Q3: Inference speed slower than expected
# Confirm all GPU layers are loaded
ollama run qwen3:8b "test" 2>&1 | grep -i "gpu"
# Adjust parallelism
export OLLAMA_NUM_PARALLEL=4
# Check for swap usage (severely impacts performance)
free -h
# If swap usage is high, consider adding physical memory
Q4: Multi-user concurrency conflicts
# Increase max concurrent models
export OLLAMA_MAX_LOADED_MODELS=3
# Use request queuing
export OLLAMA_NUM_PARALLEL=8
# Clean idle models periodically
export OLLAMA_KEEP_ALIVE=30m
Cost Comparison: Self-hosted vs. Cloud API
| Solution | Monthly Cost | Qwen3-8B Call Cost | Data Privacy | Latency |
|---|---|---|---|---|
| Alibaba Cloud DashScope | $0.001/token | Pay-per-use | Data leaves server | 200-500ms |
| OpenAI API | $0.10/million tokens | Pay-per-use | Data leaves server | 300-800ms |
| Self-hosted VPS (GPU) | $40-55 | Nearly free | Fully local | <50ms |
📊 Conclusion: When monthly calls exceed 1 million, the cost advantage of self-hosted GPU solutions becomes significant. For high-frequency use cases, self-hosting is absolutely the optimal choice.
Summary
This guide covered every step of building a GPU-accelerated Ollama service on your VPS:
- ✅ Choosing the right GPU VPS configuration
- ✅ Installing NVIDIA drivers and Docker environment
- ✅ Deploying Ollama and verifying GPU acceleration
- ✅ Model selection and quantization strategy
- ✅ Performance tuning and VRAM management
- ✅ Security hardening and access control
- ✅ Integration into existing workflows
- ✅ Monitoring and alerting system
With this setup, your VPS is no longer just a regular cloud server — it becomes a private AI inference center, supporting chatbots, RAG systems, coding assistants, and many other applications.
Next step: Pick a GPU VPS today and deploy your first local LLM service following this guide!
