Running DeepSeek V4 on Your VPS: Complete Local Deployment Guide

In the 2026 LLM landscape, DeepSeek V4 is the dark horse nobody can ignore. It matches or even surpasses GPT-4o and Claude 4 on multiple benchmarks — and crucially, it’s open-source.

This means you can deploy it on your own VPS. All data stays local, you pay a flat monthly rate instead of per-token fees, and you never have to worry about API providers changing their pricing overnight.

This guide covers the entire deployment journey: VPS specs, model quantization selection, three deployment methods (Ollama / vLLM / llama.cpp), API integration, and frontend setup.

Why Run DeepSeek V4 on Your Own VPS?

Let’s do the math.

Option	Monthly Cost	Data Privacy	Availability
ChatGPT / Claude API	$20-200/mo + usage fees	❌ Data passes through third party	✅ High
DeepSeek Official API	~$0.50/1M tokens	⚠️ Data passes through third party	✅ High
Self-hosted on VPS	$5-30/mo (fixed)	✅ 100% private	✅ 24/7

The core philosophy of self-hosting: fixed cost for unlimited usage. Whether your AI assistant handles 100 requests or 10,000 requests per day, your VPS bill stays the same.

More importantly: data privacy. When you run the model locally, no code, document, or conversation ever leaves your server. This is critical for developers, startups, and enterprises handling sensitive data.

DeepSeek V4 Model Overview

DeepSeek V4 is the next-generation LLM released by DeepSeek in early 2026:

MoE Architecture: Mixture of Experts — ~600B total parameters, but only ~37B activated per inference
Ultra-long Context: Native 128K tokens, extendable to 1M tokens
Multimodal: Supports text, code, and image understanding
Quantization-friendly: INT4/INT8 quantized versions run on consumer GPUs
License: MIT — free for commercial use

For VPS deployment, we focus on quantized versions:

Quantization	VRAM Required	Speed	Quality Loss
FP16 (full)	~120GB	Fastest	None
INT8	~60GB	Fast	Minimal
INT4	~30GB	Medium	Acceptable
Q2_K (llama.cpp)	~16GB	Slow	Mild
IQ1_S (extreme)	~8GB	Very slow	Noticeable

For most VPS scenarios, INT4 quantization hits the sweet spot — runs on a single 24GB GPU, decent inference speed, and quality loss that’s nearly imperceptible.

Recommended VPS Specs

Minimum (Personal Use)

GPU: NVIDIA RTX 3090 / 4090 (24GB VRAM)
CPU: 4+ cores
RAM: 16GB+
Storage: 50GB+ SSD
Performance: Q4_K_M quant, ~5-8 tokens/s
Cost: ~$20-30/mo (Hetzner / Netcup auction)

Recommended (Production)

GPU: Dual RTX 4090 or A6000 (48GB)
CPU: 8+ cores
RAM: 32GB+
Storage: 100GB+ NVMe
Performance: INT4 quant, ~15-20 tokens/s
Cost: ~$60-100/mo

CPU-Only (Budget)

CPU: 16+ cores (AMD EPYC or Intel Xeon)
RAM: 64GB+
Storage: 50GB+ SSD
Performance: Q2_K quant, ~1-3 tokens/s
Cost: ~$15-30/mo

DeepSeek V4 Deployment Architecture — Complete DeepSeek V4 deployment architecture on VPS

Option 1: Deploy with Ollama (Simplest)

Ollama is the friendliest tool for running local LLMs. One command is all it takes to get DeepSeek V4 running.

Install Ollama

# One-click install
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

Download and Run DeepSeek V4

# Pull INT4 quantized version (recommended, ~18GB)
ollama pull deepseek-v4:14b-int4

# Pull Q4_K_M version (fits 24GB VRAM)
ollama pull deepseek-v4:q4_k_m

# Run the model
ollama run deepseek-v4:q4_k_m

First pull may take 10-30 minutes depending on your connection speed. Subsequent starts take only seconds.

Enable Remote Access

By default, Ollama only listens on localhost. To access it from other devices:

# Edit systemd service
sudo systemctl edit ollama.service

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"

Restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Access at http://your-vps-ip:11434.

Test the API

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-v4:q4_k_m",
  "prompt": "Write a quick sort in Python",
  "stream": false
}'

Option 2: Deploy with vLLM (High Throughput)

vLLM is designed for production environments. Its PagedAttention and continuous batching deliver high-throughput inference — ideal for multi-user scenarios.

Install vLLM

# Use a Python virtual environment
python3 -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM (CUDA 12.1+)
pip install vllm

Start DeepSeek V4 API Server

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4 \
  --dtype float16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --tensor-parallel-size 1 \
  --port 8000

vLLM vs Ollama

Feature	Ollama	vLLM
Installation Complexity	⭐ Minimal	⭐⭐⭐ Python environment needed
Inference Throughput	Medium	High (continuous batching)
Concurrency	Limited	Excellent
API Compatibility	Ollama native	OpenAI-compatible
Memory Efficiency	Good	Better (PagedAttention)
Best For	Personal / Dev	Production / Multi-user

Option 3: Deploy with llama.cpp (CPU + GPU Hybrid)

llama.cpp shines at hardware adaptability — it runs on pure CPU machines, GPU machines, or anything in between.

Build from Source

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# GPU-accelerated build (with CUDA)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Download Quantized Model

pip install huggingface-hub
huggingface-cli download \
  unsloth/DeepSeek-V4-GGUF \
  deepseek-v4-q4_k_m.gguf \
  --local-dir ./models/

Start Server

./build/bin/server \
  -m ./models/deepseek-v4-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 35 \
  --ctx-size 32768

The --n-gpu-layers parameter is flexible — reduce it if VRAM is tight, and the CPU handles the remaining layers.

Frontend Integration: Open WebUI

Open WebUI (formerly Ollama WebUI) is the most popular frontend choice.

Docker Deployment

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -v open-webui-data:/app/backend/data \
  -e OLLAMA_BASE_URL=http://your-vps-ip:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access http://your-vps-ip:3000 for the chat interface.

Docker Compose (Full Stack)

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ./ollama-data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: always

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    volumes:
      - ./webui-data:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: always

  nginx:
    image: nginx:alpine
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./ssl:/etc/nginx/ssl
    depends_on:
      - open-webui
    restart: always

API Usage Examples

Once deployed, you can call the model with standard HTTP requests:

import requests

# Ollama API
def query_ollama(prompt, model="deepseek-v4:q4_k_m"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

# vLLM (OpenAI-compatible API)
def query_vllm(prompt, model="deepseek-ai/DeepSeek-V4"):
    from openai import OpenAI
    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="not-needed"
    )
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=4096
    )
    return response.choices[0].message.content

print(query_ollama("Explain how MoE architecture works"))

Performance Optimization Tips

1. Choose the Right Quantization

Format	Use Case
Q4_K_M	General recommendation, minimal quality loss
Q5_K_M	Quality-first, needs more VRAM
Q2_K	Low-end machines (8-12GB VRAM)
IQ4_NL	Newer format, better efficiency

2. Manage Context Length

Limit max_model_len or ctx-size to 8192 when you don’t need long context — significantly reduces VRAM usage and first-token latency.

3. Batch Processing (vLLM)

--max-num-batched-tokens 8192 \
--max-num-seqs 8

4. System-Level Tuning

# Set GPU to performance mode
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 250  # Limit power draw

# Disable swap
sudo swapoff -a

# Enable huge pages
echo 1024 | sudo tee /proc/sys/vm/nr_hugepages

5. Pre-warm Models

Ensure your deployment script pre-loads the model on boot:

# systemd service example
[Service]
ExecStartPre=/usr/bin/ollama pull deepseek-v4:q4_k_m
ExecStart=/usr/bin/ollama serve

Cost Comparison: Self-Hosted vs API

Assuming 1M tokens processed daily:

Option	Monthly	Per 1M Tokens	Annual
GPT-4o API	$20 + $10/1M × 30	$320	$3,840
Claude 4 API	$20 + $15/1M × 30	$470	$5,640
DeepSeek Official API	$0.50/1M × 30	~$15	$180
VPS Self-Hosted	$30 (VPS) + $0 inference	$30	$360

Self-hosting breaks even at ~200K tokens/month. The more you use it, the more you save.

Conclusion

Running DeepSeek V4 on your VPS is the optimal balance between cost, privacy, and control.

Individual users: Start with Ollama + Open WebUI — done in 30 minutes
Small teams: Go with vLLM + Open WebUI for production-grade concurrency
Budget machines: Use llama.cpp for flexible CPU + GPU hybrid inference

Key takeaways:

INT4 / Q4_K_M quantization is the sweet spot for VPS deployment
Ollama is the fastest way to get started
vLLM provides production-grade throughput for multi-user setups
llama.cpp works even without a GPU
Pair with Open WebUI for a ChatGPT-like experience
Self-hosting’s long-term cost is far below API calls

Local deployment of open-source LLMs isn’t the future — it’s already here. And your VPS is ready to run it.