Build a Private AI Coding Copilot on VPS: Complete Continue.dev + Ollama Guide

Why Build Your Own AI Coding Assistant?

The AI coding assistant landscape shifted dramatically since late 2025:

GitHub Copilot raised prices to $10/mo (individual) / $19/mo (enterprise), and your code lives on Microsoft servers
Cursor went partially closed-source at $20/mo, sparking controversy
More companies are banning employees from uploading code to third-party AI services

The solution: Run Ollama with open-source LLMs on your own VPS, paired with Continue.dev (open-source IDE extension) — a fully self-hosted AI coding experience.

Your code never leaves your network. Model inference runs on your own VPS. No third party ever sees your code.

Architecture Overview

Your Machine (VS Code / JetBrains)
       │
       │ Continue.dev Extension
       │
       ▼
Your VPS (Ollama Service)
       │
       ├── deepseek-coder-v2 (code completion)
       ├── codestral (completion backup)
       ├── qwen2.5-coder:7b (lightweight option)
       └── llama3.1:8b (chat/explain)

Cost Comparison

Option	Monthly Cost	Privacy	Model Choice
GitHub Copilot	$10-19	❌ Code uploaded to Microsoft	Single model
Cursor Pro	$20	⚠️ Partially closed-source	Limited
This Setup	$4-8	✅ Fully private	Any model

Minimum VPS specs:

Starter: Hetzner CX22 (€3.99/mo, 2 vCPU / 4GB / 40GB) — runs 7B models
Recommended: Hetzner CAX21 (€5.99/mo, 4 vCPU / 8GB / 40GB ARM) — runs 14B models
Advanced: Netcup RS1000 (€5.50/mo, 4 vCPU / 8GB) or a GPU-equipped VPS

💡 CPU inference on 7B models (e.g., qwen2.5-coder:7b) takes 2-5 seconds per completion — perfectly usable for daily development. For 14B models, 8GB+ RAM is recommended.

Step 1: Deploy Ollama on VPS

1. Install Ollama

# SSH into your VPS
ssh root@your-vps-ip

# One-command install
curl -fsSL https://ollama.ai/install.sh | sh

# Verify
ollama --version

2. Pull Code-Specialized Models

# DeepSeek Coder V2 (recommended, best code quality)
ollama pull deepseek-coder-v2:16b

# Qwen2.5 Coder 7B (lightweight, fast responses)
ollama pull qwen2.5-coder:7b

# Llama 3.1 8B (general chat assistance)
ollama pull llama3.1:8b

# List downloaded models
ollama list

Initial model pulls take a few minutes depending on VPS bandwidth. deepseek-coder-v2:16b is ~9GB, qwen2.5-coder:7b is ~4.5GB.

3. Allow Remote Access to Ollama

Ollama listens on localhost by default. We need it to listen on all interfaces for your local IDE to connect.

# Edit systemd service config
systemctl edit ollama

# Add the following content:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

# Restart Ollama
systemctl daemon-reload
systemctl restart ollama

# Verify it's listening
ss -tlnp | grep 11434

4. Security Hardening (Important!)

Never expose Ollama directly to the public internet. Use ufw to restrict access:

# Install ufw
apt install -y ufw

# Only allow your home/office IP
ufw allow from your-local-ip to any port 11434 proto tcp
ufw enable

# Or use WireGuard VPN (recommended)
# After connecting via VPN, let Ollama listen on the VPN interface only

Best practice: Deploy an API-key-authenticated reverse proxy in front of Ollama (covered below).

Step 2: Configure Continue.dev

1. Install the Continue Extension

VS Code: Search for “Continue” in the extension marketplace and install
JetBrains: Search for “Continue” in the plugin marketplace

2. Configure Ollama Connection

After installation, click the Continue icon in the sidebar and open the config file at ~/.continue/config.json:

{
  "models": [
    {
      "title": "DeepSeek Coder V2 (VPS)",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b",
      "apiBase": "http://your-vps-ip:11434",
      "completionOptions": {
        "temperature": 0.1,
        "topP": 0.9,
        "maxTokens": 4096
      }
    },
    {
      "title": "Qwen2.5 Coder 7B (VPS)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "apiBase": "http://your-vps-ip:11434",
      "completionOptions": {
        "temperature": 0.1,
        "topP": 0.9,
        "maxTokens": 2048
      }
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5 Coder (Tab Auto)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://your-vps-ip:11434"
  },
  "slashCommands": [
    {
      "name": "edit",
      "description": "Edit selected code"
    },
    {
      "name": "comment",
      "description": "Add comments to code"
    },
    {
      "name": "optimize",
      "description": "Optimize code performance"
    },
    {
      "name": "explain",
      "description": "Explain code logic"
    }
  ]
}

3. Tab Autocomplete Setup

Continue’s inline tab autocomplete is one of its best features. In VS Code:

Open command palette (Cmd+Shift+P / Ctrl+Shift+P)
Type “Continue: Toggle Tab Autocomplete”
Enable it — suggestions will appear as inline gray text
Press Tab to accept suggestions

For tabAutocompleteModel, choose the fastest model. qwen2.5-coder:7b returns completions in about 2-4 seconds on a 4GB VPS — smooth enough for daily use.

Step 3 (Advanced): API Key Auth + Reverse Proxy

Exposing port 11434 directly isn’t secure enough. Use Nginx as a reverse proxy with API key authentication:

# Install Nginx
apt install -y nginx

# Create configuration
cat > /etc/nginx/sites-available/ollama-proxy << 'EOF'
server {
    listen 443 ssl;
    server_name ollama.your-domain.com;

    ssl_certificate /etc/letsencrypt/live/ollama.your-domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ollama.your-domain.com/privkey.pem;

    location / {
        # API key validation
        if ($http_x_api_key != "your-secret-key") {
            return 401;
        }

        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
EOF

# Enable config
ln -s /etc/nginx/sites-available/ollama-proxy /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx

Then update your Continue config with the HTTPS URL and add X-Api-Key to HTTP headers.

Model Selection Guide

Model	Parameters	RAM Needed	Code Quality	Speed (4-core CPU)
`qwen2.5-coder:1.5b`	1.5B	< 2GB	⭐⭐	⚡ Very Fast
`qwen2.5-coder:7b`	7B	~4GB	⭐⭐⭐⭐	⚡ Fast
`deepseek-coder-v2:16b`	16B	~9GB	⭐⭐⭐⭐⭐	🐢 Slower
`codestral:22b`	22B	~12GB	⭐⭐⭐⭐⭐	🐢 Slow
`starcoder2:15b`	15B	~8GB	⭐⭐⭐⭐	⏳ Medium

Beginner recommendation: Start with qwen2.5-coder:7b. It’s the sweet spot for 4GB VPS, offering the best balance of speed and quality.

Advanced Optimization Tips

GPU Acceleration (if you have a GPU VPS)

If your VPS has an NVIDIA GPU, Ollama automatically uses CUDA:

# Install NVIDIA drivers on GPU VPS
apt install -y nvidia-driver-545 nvidia-utils-545

# Verify GPU
nvidia-smi

# Ollama auto-detects GPU — no extra config needed
# GPU inference on 16B models is 5-10x faster than CPU

Multi-VPS Load Balancing

With multiple VPS, use Nginx for load balancing:

upstream ollama_backend {
    server vps1:11434 weight=3;
    server vps2:11434 weight=1;
}

RAM Disk for Faster Model Loading

# Create 16GB RAM disk (requires enough RAM)
mount -t tmpfs -o size=16G tmpfs /mnt/ramdisk

# Symlink model blobs to RAM disk
ln -s /mnt/ramdisk/blobs /usr/share/ollama/.ollama/models/blobs

Real-World Comparison

Testing with a Python Fibonacci generator:

DeepSeek Coder V2 (16B):

def fibonacci(n: int) -> list[int]:
    """Generate first n Fibonacci numbers."""
    if n <= 0:
        return []
    fib = [0, 1]
    for _ in range(2, n):
        fib.append(fib[-1] + fib[-2])
    return fib[:n]

Quality: ⭐⭐⭐⭐⭐ Proper type hints, clean code
Speed: 4-6 seconds (4-core VPS)

Qwen2.5 Coder (7B):

Quality: ⭐⭐⭐⭐ Mostly correct, occasional minor issues
Speed: 2-3 seconds (4-core VPS)

For daily development, 7B models are surprisingly capable. 16B models shine for complex algorithms and architecture decisions.

FAQ

Q: Will network latency be noticeable?

A: Depends on VPS location. If your VPS is geographically close (e.g., you’re in Europe with a Hetzner VPS), latency is ~10-30ms. With streaming output (default in Continue), it’s barely noticeable while typing.

Q: Can I use multiple IDEs simultaneously?

A: Yes. Ollama handles concurrent requests. Configure each IDE instance with the same apiBase pointing to your VPS.

Q: Does the VPS store my code?

A: Ollama processes inference in RAM only — input/output data is discarded immediately after processing. For maximum security, enable full-disk encryption on your VPS.

Q: How do I switch models?

A: Just ollama pull the new model on the VPS, then update the model field in Continue config. No services need restarting.

Summary

Building a private AI coding assistant with VPS + Ollama + Continue.dev gives you:

✅ Full data privacy — code never leaves your network
✅ Free AI completions — only pay for the VPS
✅ Model freedom — switch between any open-source models
✅ Team sharing — one VPS serves your entire team
✅ Auditable — fully open-source, no black boxes

Compared to GitHub Copilot at $10-19/mo or Cursor at $20/mo, a €3.99/mo VPS delivers comparable or better results — with the peace of mind that your code stays yours.

Next steps: SSH into your VPS right now, run curl -fsSL https://ollama.ai/install.sh | sh, then ollama pull qwen2.5-coder:7b. In ten minutes, you’ll have your very own private AI coding copilot.