Featured image of post Running DeepSeek V4 on Your VPS: Complete Local Deployment Guide

Running DeepSeek V4 on Your VPS: Complete Local Deployment Guide

From model quantization to API setup — deploy DeepSeek V4 privately on your own VPS. Covers Ollama, vLLM, and llama.cpp deployment with cost comparison vs cloud APIs.

In the 2026 LLM landscape, DeepSeek V4 is the dark horse nobody can ignore. It matches or even surpasses GPT-4o and Claude 4 on multiple benchmarks — and crucially, it’s open-source.

This means you can deploy it on your own VPS. All data stays local, you pay a flat monthly rate instead of per-token fees, and you never have to worry about API providers changing their pricing overnight.

This guide covers the entire deployment journey: VPS specs, model quantization selection, three deployment methods (Ollama / vLLM / llama.cpp), API integration, and frontend setup.

Why Run DeepSeek V4 on Your Own VPS?

Let’s do the math.

OptionMonthly CostData PrivacyAvailability
ChatGPT / Claude API$20-200/mo + usage fees❌ Data passes through third party✅ High
DeepSeek Official API~$0.50/1M tokens⚠️ Data passes through third party✅ High
Self-hosted on VPS$5-30/mo (fixed)100% private✅ 24/7

The core philosophy of self-hosting: fixed cost for unlimited usage. Whether your AI assistant handles 100 requests or 10,000 requests per day, your VPS bill stays the same.

More importantly: data privacy. When you run the model locally, no code, document, or conversation ever leaves your server. This is critical for developers, startups, and enterprises handling sensitive data.

DeepSeek V4 Model Overview

DeepSeek V4 is the next-generation LLM released by DeepSeek in early 2026:

  • MoE Architecture: Mixture of Experts — ~600B total parameters, but only ~37B activated per inference
  • Ultra-long Context: Native 128K tokens, extendable to 1M tokens
  • Multimodal: Supports text, code, and image understanding
  • Quantization-friendly: INT4/INT8 quantized versions run on consumer GPUs
  • License: MIT — free for commercial use

For VPS deployment, we focus on quantized versions:

QuantizationVRAM RequiredSpeedQuality Loss
FP16 (full)~120GBFastestNone
INT8~60GBFastMinimal
INT4~30GBMediumAcceptable
Q2_K (llama.cpp)~16GBSlowMild
IQ1_S (extreme)~8GBVery slowNoticeable

For most VPS scenarios, INT4 quantization hits the sweet spot — runs on a single 24GB GPU, decent inference speed, and quality loss that’s nearly imperceptible.

Minimum (Personal Use)

  • GPU: NVIDIA RTX 3090 / 4090 (24GB VRAM)
  • CPU: 4+ cores
  • RAM: 16GB+
  • Storage: 50GB+ SSD
  • Performance: Q4_K_M quant, ~5-8 tokens/s
  • Cost: ~$20-30/mo (Hetzner / Netcup auction)
  • GPU: Dual RTX 4090 or A6000 (48GB)
  • CPU: 8+ cores
  • RAM: 32GB+
  • Storage: 100GB+ NVMe
  • Performance: INT4 quant, ~15-20 tokens/s
  • Cost: ~$60-100/mo

CPU-Only (Budget)

  • CPU: 16+ cores (AMD EPYC or Intel Xeon)
  • RAM: 64GB+
  • Storage: 50GB+ SSD
  • Performance: Q2_K quant, ~1-3 tokens/s
  • Cost: ~$15-30/mo
DeepSeek V4 Deployment Architecture

Complete DeepSeek V4 deployment architecture on VPS

Option 1: Deploy with Ollama (Simplest)

Ollama is the friendliest tool for running local LLMs. One command is all it takes to get DeepSeek V4 running.

Install Ollama

# One-click install
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

Download and Run DeepSeek V4

# Pull INT4 quantized version (recommended, ~18GB)
ollama pull deepseek-v4:14b-int4

# Pull Q4_K_M version (fits 24GB VRAM)
ollama pull deepseek-v4:q4_k_m

# Run the model
ollama run deepseek-v4:q4_k_m

First pull may take 10-30 minutes depending on your connection speed. Subsequent starts take only seconds.

Enable Remote Access

By default, Ollama only listens on localhost. To access it from other devices:

# Edit systemd service
sudo systemctl edit ollama.service

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"

Restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Access at http://your-vps-ip:11434.

Test the API

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-v4:q4_k_m",
  "prompt": "Write a quick sort in Python",
  "stream": false
}'

Option 2: Deploy with vLLM (High Throughput)

vLLM is designed for production environments. Its PagedAttention and continuous batching deliver high-throughput inference — ideal for multi-user scenarios.

Install vLLM

# Use a Python virtual environment
python3 -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM (CUDA 12.1+)
pip install vllm

Start DeepSeek V4 API Server

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4 \
  --dtype float16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --tensor-parallel-size 1 \
  --port 8000

vLLM vs Ollama

FeatureOllamavLLM
Installation Complexity⭐ Minimal⭐⭐⭐ Python environment needed
Inference ThroughputMediumHigh (continuous batching)
ConcurrencyLimitedExcellent
API CompatibilityOllama nativeOpenAI-compatible
Memory EfficiencyGoodBetter (PagedAttention)
Best ForPersonal / DevProduction / Multi-user

Option 3: Deploy with llama.cpp (CPU + GPU Hybrid)

llama.cpp shines at hardware adaptability — it runs on pure CPU machines, GPU machines, or anything in between.

Build from Source

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# GPU-accelerated build (with CUDA)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Download Quantized Model

pip install huggingface-hub
huggingface-cli download \
  unsloth/DeepSeek-V4-GGUF \
  deepseek-v4-q4_k_m.gguf \
  --local-dir ./models/

Start Server

./build/bin/server \
  -m ./models/deepseek-v4-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 35 \
  --ctx-size 32768

The --n-gpu-layers parameter is flexible — reduce it if VRAM is tight, and the CPU handles the remaining layers.

Frontend Integration: Open WebUI

Open WebUI (formerly Ollama WebUI) is the most popular frontend choice.

Docker Deployment

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -v open-webui-data:/app/backend/data \
  -e OLLAMA_BASE_URL=http://your-vps-ip:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access http://your-vps-ip:3000 for the chat interface.

Docker Compose (Full Stack)

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ./ollama-data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: always

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    volumes:
      - ./webui-data:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: always

  nginx:
    image: nginx:alpine
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./ssl:/etc/nginx/ssl
    depends_on:
      - open-webui
    restart: always

API Usage Examples

Once deployed, you can call the model with standard HTTP requests:

import requests

# Ollama API
def query_ollama(prompt, model="deepseek-v4:q4_k_m"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

# vLLM (OpenAI-compatible API)
def query_vllm(prompt, model="deepseek-ai/DeepSeek-V4"):
    from openai import OpenAI
    client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="not-needed"
    )
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=4096
    )
    return response.choices[0].message.content

print(query_ollama("Explain how MoE architecture works"))

Performance Optimization Tips

1. Choose the Right Quantization

FormatUse Case
Q4_K_MGeneral recommendation, minimal quality loss
Q5_K_MQuality-first, needs more VRAM
Q2_KLow-end machines (8-12GB VRAM)
IQ4_NLNewer format, better efficiency

2. Manage Context Length

Limit max_model_len or ctx-size to 8192 when you don’t need long context — significantly reduces VRAM usage and first-token latency.

3. Batch Processing (vLLM)

--max-num-batched-tokens 8192 \
--max-num-seqs 8

4. System-Level Tuning

# Set GPU to performance mode
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 250  # Limit power draw

# Disable swap
sudo swapoff -a

# Enable huge pages
echo 1024 | sudo tee /proc/sys/vm/nr_hugepages

5. Pre-warm Models

Ensure your deployment script pre-loads the model on boot:

# systemd service example
[Service]
ExecStartPre=/usr/bin/ollama pull deepseek-v4:q4_k_m
ExecStart=/usr/bin/ollama serve

Cost Comparison: Self-Hosted vs API

Assuming 1M tokens processed daily:

OptionMonthlyPer 1M TokensAnnual
GPT-4o API$20 + $10/1M × 30$320$3,840
Claude 4 API$20 + $15/1M × 30$470$5,640
DeepSeek Official API$0.50/1M × 30~$15$180
VPS Self-Hosted$30 (VPS) + $0 inference$30$360

Self-hosting breaks even at ~200K tokens/month. The more you use it, the more you save.

Conclusion

Running DeepSeek V4 on your VPS is the optimal balance between cost, privacy, and control.

  • Individual users: Start with Ollama + Open WebUI — done in 30 minutes
  • Small teams: Go with vLLM + Open WebUI for production-grade concurrency
  • Budget machines: Use llama.cpp for flexible CPU + GPU hybrid inference

Key takeaways:

  1. INT4 / Q4_K_M quantization is the sweet spot for VPS deployment
  2. Ollama is the fastest way to get started
  3. vLLM provides production-grade throughput for multi-user setups
  4. llama.cpp works even without a GPU
  5. Pair with Open WebUI for a ChatGPT-like experience
  6. Self-hosting’s long-term cost is far below API calls

Local deployment of open-source LLMs isn’t the future — it’s already here. And your VPS is ready to run it.

📺 看视频版教程 → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials