Featured image of post AI-Powered VPS Backup & Disaster Recovery: Never Lose Your Data Again

AI-Powered VPS Backup & Disaster Recovery: Never Lose Your Data Again

Step-by-step guide to building an AI-driven backup and disaster recovery system with LLM Agents — intelligent backup strategies, automated DR drills, integrity verification, and anomaly detection for your VPS.

Data loss is every VPS owner’s nightmare. One accidental rm -rf, one ransomware attack, one disk failure — a few seconds of misfortune can destroy months of work. Traditional backup solutions are either too complex to configure or too simple to be reliable.

The good news: LLM Agents can turn backup and disaster recovery (DR) into an automated, verifiable, and intelligent system. This guide walks you through building an AI-driven backup system that handles:

  • Automated backup strategy planning (full + incremental)
  • Smart backup integrity verification
  • Scheduled disaster recovery drills
  • Anomaly detection and proactive defense

Why You Need an AI Backup Agent

Traditional backup approaches all have tradeoffs:

ApproachProsCons
Manual rsync / scpSimple, directEasy to forget, no verification, unreliable
cron + shell scriptsAutomated scheduleSilent failures, nobody tests recovery
Commercial backup toolsFull-featuredExpensive, over-engineered, vendor lock-in

An AI Agent combines the best of all worlds: lightweight automation like scripts, intelligent reliability like commercial tools. And crucially, an Agent understands context — when a backup fails, it doesn’t just retry blindly, it adjusts its strategy based on the failure reason.

Architecture Overview

Our backup Agent is built on four core layers:

┌─────────────────────────────────────┐
│          Agent Orchestrator          │  ← LLM-powered decision core
├─────────────────────────────────────┤
│  Strategy Engine │ Exec Engine │ Verify │
├─────────────────────────────────────┤
│  Restic  │  Borg  │  rsync  │  S3   │  ← Underlying tools
├─────────────────────────────────────┤
│  Local Storage │ Remote VPS │ Object │  ← Storage backends
└─────────────────────────────────────┘

Agent Orchestrator is the brain. It decides based on predefined policies and real-time state:

  • When to run backups (full vs incremental)
  • Where to send them (local, remote, S3)
  • How to verify after completion
  • What to do when anomalies arise

Step 1: Choose Your Backup Tool

The AI Agent doesn’t reinvent the wheel — it orchestrates proven backup tools. Here are three options:

Features: Encrypted, deduplicated, multi-backend (local/S3/SFTP)
Install: apt install restic

Option B: BorgBackup

Features: Superior compression, high dedup efficiency
Install: apt install borgbackup
Features: Simple, universal, zero dependencies
Install: Built-in on all Linux systems

This guide uses Restic — it offers the best balance of encryption, deduplication, and multi-backend support.

Step 2: Build Your AI Backup Agent

Let’s create the core backup script and wrap it with LLM-driven decision logic.

2.1 Backup Script Foundation

#!/usr/bin/env python3
"""backup_agent.py — AI-driven backup agent"""

import os
import json
import subprocess
import datetime
from typing import Dict, List

# ── Configuration ──────────────────────────
REPOSITORIES = {
    "local": "/mnt/backup/vps",
    "s3": "s3:https://s3.amazonaws.com/my-bucket/vps-backup",
}
PATHS = ["/var/www", "/etc", "/opt/docker", "/home"]
RESTIC_PASSWORD = os.environ.get("RESTIC_PASSWORD", "your-password-here")


def run_restic(args: List[str]) -> str:
    """Execute a restic command and return output."""
    cmd = ["restic", "-r", REPOSITORIES["local"]] + args
    env = {**os.environ, "RESTIC_PASSWORD": RESTIC_PASSWORD}
    result = subprocess.run(cmd, capture_output=True, text=True, env=env)
    return result.stdout


def check_backup_health() -> Dict:
    """Check the health status of existing backups."""
    snapshots = run_restic(["snapshots", "--json"])
    stats = run_restic(["stats", "--json"])
    return {
        "snapshot_count": len(json.loads(snapshots)) if snapshots else 0,
        "stats": json.loads(stats) if stats else {},
        "last_backup": get_last_backup_time(),
    }


def create_backup(paths: List[str], tag: str = "auto") -> str:
    """Create a new backup snapshot."""
    result = run_restic(["backup"] + paths + ["--tag", tag])
    return result


def verify_backup(snapshot_id: str) -> bool:
    """Verify the integrity of a backup."""
    result = run_restic(["check", "--read-data-subset", "10%"])
    return "no errors" in result.lower()

2.2 AI Agent Decision Layer

The real intelligence comes from LLM Agent decision-making. We wrap the backup logic with an Agent framework that supports tool calling:

# agent_decision.py — LLM-driven backup decisions

agent_instruction = """
You are a VPS backup management Agent. Your responsibilities:
1. Decide whether to execute backups based on health reports
2. Choose the optimal strategy (full/incremental/differential)
3. Verify backup integrity
4. Execute recovery plans when anomalies occur

Available tools:
- check_health: Check backup status
- create_backup: Execute backup
- verify_backup: Verify backup integrity
- list_snapshots: List all snapshots
- prune_snapshots: Clean up expired snapshots

Decision rules:
- If last backup > 24h ago → run incremental backup
- If last full backup > 7 days ago → run full backup
- If verification finds errors → alert and switch backup target
- If disk remaining < 10% → trigger prune and alert
"""

def agent_decision_cycle():
    """Main decision loop for the backup agent."""
    health = check_backup_health()
    
    # Build context for the LLM
    context = {
        "health_report": health,
        "disk_usage": get_disk_usage(),
        "current_time": datetime.datetime.now().isoformat(),
    }
    
    # Call LLM for decision (pseudo-code)
    # decision = llm.call(agent_instruction + json.dumps(context))
    # execute_decision(decision)

For a full implementation, you’ll need to connect to an LLM API (OpenAI / DeepSeek / Claude) or use a tool-calling Agent framework like Hermes Agent.

Step 3: Automated Backup Strategy

A reliable backup strategy follows the 3-2-1 rule:

  • 3 copies (1 original + 2 backups)
  • 2 different media types (local + remote)
  • 1 off-site copy (S3 / another VPS)

Our AI Agent operates on this schedule:

FrequencyTypeTargetRetention
Daily at 03:00IncrementalLocal30 days
Sunday at 03:00FullLocal + S312 weeks
1st of month 03:00FullLocal + S3 + Remote12 months

The Agent dynamically adjusts this plan. For example, if it detects unusually large file changes, it’ll trigger a full backup instead of the scheduled incremental.

Step 4: Automated Disaster Recovery

Backups only matter if you can restore from them. Our AI Agent runs automated, unattended DR drills:

def disaster_recovery_drill():
    """
    Automated DR drill process:
    1. Select latest backup
    2. Mount to a temp directory
    3. Verify file integrity
    4. Test critical service files
    5. Generate recovery report
    """
    restore_test_dir = "/tmp/dr-drill-" + datetime.datetime.now().strftime("%Y%m%d")
    os.makedirs(restore_test_dir, exist_ok=True)
    
    # 1. Mount latest snapshot
    run_restic(["mount", restore_test_dir])
    
    # 2. Verify critical files
    critical_paths = [
        "/etc/nginx/nginx.conf",
        "/etc/ssh/sshd_config",
        "/var/www/html/index.html",
    ]
    for path in critical_paths:
        local_path = restore_test_dir + "/latest" + path
        if os.path.exists(local_path):
            print(f"✅ {path} exists")
        else:
            print(f"❌ {path} missing!")
    
    # 3. Unmount
    run_restic(["unmount", restore_test_dir])
    
    # 4. Generate report
    report = f"""
Disaster Recovery Drill Report
Date: {datetime.datetime.now()}
Status: {'✅ PASS' if all_ok else '❌ FAIL'}
Critical files: {ok_count}/{total_count}
"""
    return report

Quick Recovery Guide

When real disaster strikes, the Agent follows this process:

1. Assess damage → 2. Select recovery point → 3. Provision new VPS → 4. Restore data → 5. Verify services
# Quick full system restore
restic restore latest --target /mnt/restore/

# Restore specific directories
restic restore latest --target /mnt/restore/ \
  --include /etc \
  --include /var/www

# Remote restore from S3
restic -r s3:https://s3.amazonaws.com/my-bucket/vps-backup \
  restore latest --target /

Step 5: Smart Monitoring & Proactive Defense

The backup Agent does more than scheduled backups — it prevents data disasters:

Anomaly Detection

def anomaly_detection():
    """
    Detect abnormal behavior that could lead to data loss:
    - Mass file deletions
    - Unusual file change patterns
    - Encryption activity (sudden file extension changes)
    """
    changes = get_recent_file_changes(hours=24)
    
    alerts = []
    if changes.get("deleted_count", 0) > 100:
        alerts.append("⚠️ Mass deletions detected — immediate full backup recommended")
    
    if changes.get("encrypted_pattern"):
        alerts.append("🚨 Possible ransomware activity! Initiating emergency backup")
    
    return alerts

Automated Defense Response

When the Agent detects anomalies, it immediately triggers protective measures:

  1. Instant snapshot: Create an emergency snapshot of current state
  2. Data locking: Set recent backups to immutable mode
  3. Target failover: Switch to backup destination if primary is unreachable
  4. Alert dispatch: Notify via Telegram/Email/Slack

Real-World Deployment

Here’s a complete docker-compose setup, running the backup Agent as a microservice:

# docker-compose.yml
version: '3.8'

services:
  backup-agent:
    image: python:3.11-slim
    volumes:
      - ./backup_agent.py:/app/agent.py
      - /var/www:/data/www:ro
      - /etc:/data/etc:ro
      - /mnt/backup:/backup
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      - RESTIC_PASSWORD=${RESTIC_PASSWORD}
      - LLM_API_KEY=${LLM_API_KEY}
      - TELEGRAM_BOT_TOKEN=${TELEGRAM_BOT_TOKEN}
      - BACKUP_SCHEDULE=0 3 * * *
    command: python /app/agent.py
    restart: unless-stopped

Backup & Recovery Best Practices

Hard-won lessons from production deployments:

  1. Test before you trust: Any backup strategy must pass full restore tests. Our Agent runs them monthly.
  2. Encrypt everything: Restic uses AES-256-GCM by default — even if storage is compromised, your data stays safe.
  3. Geographic redundancy: At least two geographically separate backup targets. We use local + S3 + remote VPS.
  4. Immutable backups: Restic supports --append-only mode to prevent backup tampering or deletion.
  5. Back up metadata too: Include database exports, Docker volumes, and container configurations alongside files.
  6. Monitor everything: Agent decision logs and backup status go to your monitoring system via webhooks.

Conclusion

Data disasters aren’t a matter of if, but when. Using an AI Agent to drive your backup and disaster recovery means you’re no longer relying on manual operations or fragile scripts — you have a 24/7 backup engineer that thinks, decides, and acts autonomously.

Backups shouldn’t be a burden — they should give you peace of mind. Let the AI Agent carry that weight so you can sleep soundly.


Tools used: Restic backup tool, Python 3.11, Hermes Agent / OpenAI API. Full example implementations available on GitHub.

📺 看视频版教程 → DuckDB Lab YouTube

Subscribe for more DuckDB & AI automation tutorials