GPU-Accelerated Local LLM Deployment: Complete Guide to Running Ollama on VPS

Introduction: Why Run Local LLMs on Your VPS?

Large language models are shifting from cloud APIs to local deployment. Whether for data privacy, cost control, or low-latency responses, building your own LLM inference service on a VPS has become a popular choice for developers and enterprises alike.

Ollama, the most popular local LLM runtime framework, combined with GPU acceleration, can transform your VPS into a powerful AI inference engine. This guide walks you through building a production-grade GPU-accelerated Ollama service from scratch.

The Game-Changer: GPU Acceleration vs. CPU Inference

Metric	CPU Inference	GPU Inference (4GB VRAM)	GPU Inference (24GB VRAM)
Qwen3-8B tokens/s	~3-5	~25-40	~60-80
First token latency	2-4s	0.3-0.5s	0.1-0.2s
Concurrent requests	1-2	8-16	20-40
Monthly cost (vs. cloud API)	90%+ savings	95%+ savings	95%+ savings

The difference is dramatic. GPU acceleration isn’t optional — it’s transformative. For scenarios requiring frequent LLM calls, CPU-only local inference is barely usable.

Step 1: Choosing the Right VPS Configuration

Recommended GPU VPS Configurations

Tier	GPU	VRAM	Suitable Models	Monthly Cost
Entry	NVIDIA T4	16GB	Qwen3-8B, Llama 3.1-8B	$30-55
Advanced	NVIDIA A10	24GB	Qwen3-14B, Mistral-Large	$70-110
High Performance	NVIDIA A100	40GB	Qwen3-32B, Llama 3.1-70B (quantized)	$200-400

Recommended providers: AutoDL, Vultr GPU, AWS EC2 G5/G6, Google Cloud A2, Alibaba Cloud PAI

💡 Money-saving tip: On-demand GPU platforms like AutoDL offer up to 70% discounts during off-peak hours, ideal for intermittent inference workloads.

Non-GPU Alternatives

If your VPS lacks a GPU:

CPU + llama.cpp quantization: Q4_K_M quantized 7B model, ~3-5 tokens/s
Intel Arc GPU: Some VPS providers offer Intel GPU instances with oneAPI support
Hybrid cloud-edge: Cache popular responses in Redis, route others to cloud APIs

Step 2: Installing NVIDIA GPU Drivers

2.1 Check GPU Hardware

# View GPU information
nvidia-smi

# Sample output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 550.120      Driver Version: 550.120      CUDA Version: 12.4   |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |   0  NVIDIA A10        Off  | 00000000:00:1E.0 Off |                    0 |
# | N/A   38C    P0    42W / 150W |   2048MiB / 24576MiB |      5%      Default |
# +-------------------------------+----------------------+----------------------+

2.2 Install Drivers (If Not Pre-installed)

# Ubuntu/Debian systems
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall

# Verify installation
nvidia-smi
nvcc --version

2.3 Install Docker + NVIDIA Container Toolkit

# Install NVIDIA Container Toolkit
curl -fsSL https://raw.githubusercontent.com/NVIDIA/nvidia-container-toolkit/main/extras/install.sh | sudo bash

# Configure Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU in containers
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Step 3: Deploying Ollama Service

3.1 Docker Deployment (Recommended)

# Create persistent data directory
mkdir -p /opt/ollama/models

# Start Ollama container
docker run -d \
  --name ollama \
  --gpus all \
  -v /opt/ollama/models:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama:latest

# Verify service
curl http://localhost:11434/api/tags

3.2 Direct Installation (Without Docker)

# One-line installer
curl -fsSL https://ollama.com/install.sh | sh

# Start the service
ollama serve &

# Set up systemd auto-start
sudo tee /etc/systemd/system/ollama.service <<EOF
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3

[Install]
WantedBy=default.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama

3.3 Verify GPU Acceleration Is Working

# Pull and run Qwen3-8B model
ollama pull qwen3:8b

# Check model loading logs for GPU offloading
ollama run qwen3:8b "Introduce yourself"

# Monitor GPU usage
watch -n 1 nvidia-smi

⚠️ Note: If you see GPU offload related logs, GPU acceleration is active. If it says CPU only, check your driver and Docker configuration.

Step 4: Model Selection & Quantization Strategy

4.1 Recommended Models

Model	Parameters	Recommended Quant	Min VRAM	Use Case
Qwen3-8B	8B	Q4_K_M	6GB	General chat, code generation
Qwen3-14B	14B	Q4_K_M	10GB	Complex reasoning, document analysis
Llama 4-8B	8B	Q4_K_M	6GB	Multilingual, creative writing
Mistral-Nemo	12B	Q4_K_M	8GB	Long context (128K)
Gemma3-4B	4B	Q4_K_M	4GB	Lightweight, embedded deployment

4.2 Understanding Quantization Levels

Quantization balances accuracy against resource usage:

FP16: Highest precision, maximum VRAM usage. Best for research and high-accuracy tasks
Q8_0: 8-bit quantization, minimal quality loss, half the VRAM
Q4_K_M: Recommended balance — quality near FP16 with only 1/4 VRAM
Q3_K_S: Extreme compression, suitable for severely VRAM-constrained setups

# Download different quantizations
ollama pull qwen3:8b-q4_K_M    # Recommended: balanced performance & VRAM
ollama pull qwen3:8b-q8_0       # High precision needs
ollama pull qwen3:8b-fp16       # Maximum precision

4.3 Browsing the Model Library

# List all pulled models
ollama list

# Search for specific models
ollama search qwen3

# View detailed model information
ollama info qwen3:8b

Step 5: Performance Tuning

5.1 Ollama Environment Variables

# Edit Ollama service configuration
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_NUM_GPU=-1"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Key parameters explained:

OLLAMA_NUM_PARALLEL: Number of concurrent requests — set to your GPU’s maximum capacity
OLLAMA_MAX_LOADED_MODELS: Simultaneously loaded models — reduces reload latency during model switching
OLLAMA_KEEP_ALIVE: How long models stay in memory — prevents frequent load/unload cycles
OLLAMA_NUM_GPU: Layers sent to GPU — -1 means all layers

5.2 VRAM Optimization Techniques

# Monitor VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv

# Limit context length per request (reduces peak VRAM)
export OLLAMA_MAX_CONTEXT=8192

# Enable VRAM sharing (multi-model scenario)
export OLLAMA_MAX_VRAM=61440  # in MB, limits maximum VRAM usage

5.3 Batch Inference Optimization

For high-concurrency scenarios, use the batch API:

# Batch generation
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "prompt": "Explain the basics of quantum computing",
  "stream": false,
  "options": {
    "num_ctx": 4096,
    "temperature": 0.7
  }
}'

5.4 Performance Benchmarking

# Install benchmark tool
ollama pull qwen3:8b

# Test inference speed
time curl -X POST http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Please describe the development history of deep learning in 500 words",
    "stream": false
  }'

Step 6: Security Hardening & Access Control

6.1 Reverse Proxy with Nginx + HTTPS

# /etc/nginx/sites-available/ollama
server {
    listen 443 ssl http2;
    server_name ollama.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support (for streaming output)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

6.2 API Key Authentication

# Use Nginx basic authentication
sudo apt install apache2-utils
sudo htpasswd -c /etc/nginx/.ollama_auth ollama_user

# Add to Nginx config
location / {
    auth_basic "Ollama API";
    auth_basic_user_file /etc/nginx/.ollama_auth;
    proxy_pass http://127.0.0.1:11434;
}

6.3 Firewall Configuration

# UFW firewall rules
sudo ufw allow 11434/tcp comment 'Ollama API'
sudo ufw allow 443/tcp comment 'HTTPS'
sudo ufw enable

# Or with firewalld
sudo firewall-cmd --permanent --add-port=11434/tcp
sudo firewall-cmd --reload

6.4 Rate Limiting

# Nginx rate limiting
limit_req_zone $binary_remote_addr zone=ollama_limit:10m rate=10r/s;

location / {
    limit_req zone=ollama_limit burst=20 nodelay;
    proxy_pass http://127.0.0.1:11434;
}

Step 7: Integrating Into Your Workflow

7.1 As a ChatGPT Alternative

# Deploy OpenWebUI web interface
docker run -d \
  --name open-webui \
  --gpus all \
  -v open-webui:/app/backend/data \
  -p 3000:8080 \
  ghcr.io/open-webui/open-webui:main

# Visit http://your-vps:3000
# Set backend URL to http://your-vps:11434

7.2 Integration with LangChain / LlamaIndex

# LangChain integration example
from langchain_community.llms import Ollama

llm = Ollama(
    base_url="http://your-vps:11434",
    model="qwen3:8b",
    temperature=0.7,
    max_tokens=2048
)

response = llm.invoke("Please summarize the core points of this article: ...")
print(response)

7.3 As a Backend for RAG Systems

# Build a local RAG with ChromaDB
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Embedding model
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://your-vps:11434"
)

# Vector store
vectorstore = Chroma(
    collection_name="my_docs",
    embedding_function=embeddings
)

# Retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# LLM
llm = Ollama(
    model="qwen3:8b",
    base_url="http://your-vps:11434"
)

7.4 Home Assistant Integration

# configuration.yaml
conversation:
  llm_api:
    - platform: ollama
      api_key: ""
      base_url: "http://your-vps:11434"
      model: "qwen3:8b"

Step 8: Monitoring & Alerting

8.1 GPU Monitoring Dashboard

# Using Prometheus + Grafana
# Install node_exporter + dcm_exporter
docker run -d \
  --name gpu-exporter \
  --gpus all \
  --pid=host \
  nvilern/gpu-exporter:latest

# Access metrics at http://your-vps:9100/metrics

8.2 Custom Health Check Script

#!/bin/bash
# /opt/ollama/health-check.sh

MODEL="qwen3:8b"
API_URL="http://localhost:11434"
ALERT_THRESHOLD=5000  # ms

# Check service status
STATUS=$(curl -s -o /dev/null -w "%{http_code}" ${API_URL}/api/tags)
if [ "$STATUS" != "200" ]; then
    echo "ALERT: Ollama service is down!"
    exit 1
fi

# Test inference latency
START_TIME=$(date +%s%N)
RESPONSE=$(curl -s -X POST ${API_URL}/api/generate \
  -d "{\"model\":\"$MODEL\",\"prompt\":\"hi\",\"stream\":false}" | jq -r '.done')
END_TIME=$(date +%s%N)

ELAPSED=$(( (END_TIME - START_TIME) / 1000000 ))

if [ $ELAPSED -gt $ALERT_THRESHOLD ]; then
    echo "ALERT: Inference latency too high: ${ELAPSED}ms"
    exit 1
fi

echo "OK: Latency ${ELAPSED}ms, Response: $RESPONSE"
exit 0

8.3 Scheduled Tasks

# Run health check every hour
echo "0 * * * * /opt/ollama/health-check.sh >> /var/log/ollama-health.log 2>&1" | crontab -

Troubleshooting FAQ

Q1: Model loading fails with “out of memory”

# Solution 1: Use lower quantization
ollama pull qwen3:8b-q3_K_S

# Solution 2: Reduce context length
export OLLAMA_MAX_CONTEXT=4096

# Solution 3: Remove unused models
ollama rm qwen3:70b

Q2: GPU not being recognized

# Check Docker GPU configuration
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

# Reconfigure NVIDIA Container Toolkit
nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Restart Ollama
sudo systemctl restart ollama

Q3: Inference speed slower than expected

# Confirm all GPU layers are loaded
ollama run qwen3:8b "test" 2>&1 | grep -i "gpu"

# Adjust parallelism
export OLLAMA_NUM_PARALLEL=4

# Check for swap usage (severely impacts performance)
free -h
# If swap usage is high, consider adding physical memory

Q4: Multi-user concurrency conflicts

# Increase max concurrent models
export OLLAMA_MAX_LOADED_MODELS=3

# Use request queuing
export OLLAMA_NUM_PARALLEL=8

# Clean idle models periodically
export OLLAMA_KEEP_ALIVE=30m

Cost Comparison: Self-hosted vs. Cloud API

Solution	Monthly Cost	Qwen3-8B Call Cost	Data Privacy	Latency
Alibaba Cloud DashScope	$0.001/token	Pay-per-use	Data leaves server	200-500ms
OpenAI API	$0.10/million tokens	Pay-per-use	Data leaves server	300-800ms
Self-hosted VPS (GPU)	$40-55	Nearly free	Fully local	<50ms

📊 Conclusion: When monthly calls exceed 1 million, the cost advantage of self-hosted GPU solutions becomes significant. For high-frequency use cases, self-hosting is absolutely the optimal choice.

Summary

This guide covered every step of building a GPU-accelerated Ollama service on your VPS:

✅ Choosing the right GPU VPS configuration
✅ Installing NVIDIA drivers and Docker environment
✅ Deploying Ollama and verifying GPU acceleration
✅ Model selection and quantization strategy
✅ Performance tuning and VRAM management
✅ Security hardening and access control
✅ Integration into existing workflows
✅ Monitoring and alerting system

With this setup, your VPS is no longer just a regular cloud server — it becomes a private AI inference center, supporting chatbots, RAG systems, coding assistants, and many other applications.

Next step: Pick a GPU VPS today and deploy your first local LLM service following this guide!