Distributed Systems Lessons from Building VPN Infrastructure

At Vexonik, I designed and deployed distributed VPN infrastructure with 99.9% uptime across multiple nodes. Here's what building production infrastructure for adversarial network conditions taught me about distributed systems.

The Problem Is Harder Than It Looks

Building a VPN for standard privacy use cases is relatively straightforward. Building one that works reliably against active censorship infrastructure — specifically China's Great Firewall (GFW) — is a different engineering problem entirely.

The GFW uses deep packet inspection (DPI) to:

Identify VPN protocols by their traffic fingerprint
Block or throttle connections that match known patterns
Probe suspicious endpoints to confirm they're proxies

This means the system has to be both functionally correct and traffic-pattern-invisible.

The Technology Stack

We used Xray-core (a successor to V2Ray) as the protocol engine, written in Go. It supports XTLS, VLESS, Reality, and TUIC protocols — each designed to be progressively harder to fingerprint.

// Example: VLESS over Reality configuration
{
  "inbounds": [{
    "protocol": "vless",
    "settings": {
      "clients": [{
        "id": "user-uuid",
        "flow": "xtls-rprx-vision"
      }],
      "decryption": "none"
    },
    "streamSettings": {
      "network": "tcp",
      "security": "reality",
      "realitySettings": {
        "serverNames": ["www.microsoft.com"],  // Impersonate legitimate TLS
        "privateKey": "...",
        "shortIds": ["..."]
      }
    }
  }]
}

Reality is particularly clever: the TLS fingerprint is borrowed from a legitimate HTTPS connection to a major website (like Microsoft). The censorship infrastructure sees what appears to be normal HTTPS traffic to a CDN.

Multi-Node Architecture

A single server is a single point of failure and a single IP to block. We deployed across multiple nodes with automatic failover:

Client App
    │
    ├─── Node 1 (Primary)  — Hong Kong
    ├─── Node 2 (Failover) — Japan
    ├─── Node 3 (Failover) — Singapore
    └─── Node 4 (Failover) — Germany
    
Health Check Service (Go)
    ├── Polls each node every 30s
    ├── Tests actual protocol connectivity (not just ping)
    └── Updates client config endpoint on failure

The health checker tests actual VPN connectivity, not just TCP reachability:

func testNodeConnectivity(node Node) (bool, time.Duration) {
    start := time.Now()
    
    // Attempt a real connection through the VPN protocol
    client, err := createVlessClient(node)
    if err != nil {
        return false, 0
    }
    defer client.Close()
    
    // Make an HTTP request through the tunnel
    resp, err := client.Get("https://www.gstatic.com/generate_204")
    if err != nil {
        return false, 0
    }
    
    return resp.StatusCode == 204, time.Since(start)
}

A node that accepts TCP connections but has a broken VPN protocol looks healthy to a simple ping check. Testing the actual protocol is the only reliable health signal.

Handling Thousands of Concurrent Connections

Go's goroutine model is perfectly suited for this workload. Each client connection is a goroutine. The Xray-core engine handles the multiplexing internally, but we still needed to monitor and limit resources:

var (
    activeConnections = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "vpn_active_connections",
        Help: "Number of active VPN connections",
    })
    bytesTransferred = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "vpn_bytes_transferred_total",
    }, []string{"direction", "user_tier"})
)
 
// Middleware that tracks connections
func withMetrics(handler ConnHandler) ConnHandler {
    return func(conn net.Conn) {
        activeConnections.Inc()
        defer activeConnections.Dec()
        
        tracked := &trackedConn{Conn: conn}
        handler(tracked)
        
        bytesTransferred.WithLabelValues("upstream", "free").Add(
            float64(tracked.bytesRead),
        )
    }
}

Prometheus + Grafana gave us real-time visibility into connection counts, latency, and bandwidth per node.

The Anti-Detection Problem

The GFW runs active probers — when it suspects an IP hosts a proxy, it sends various probes to confirm. A naive Xray setup responds to these probes in a way that confirms suspicion.

The solution is a fallback destination: if an incoming connection doesn't present a valid authentication credential (as our legitimate clients do), fall back to proxying it to a legitimate website:

{
  "fallbacks": [{
    "dest": "80",
    "xver": 0
  }],
  "inboundTag": "fallback-nginx"
}

server {
    listen 80;
    server_name _;
    return 301 https://www.bing.com$request_uri;
}

A GFW probe gets redirected to Bing. A legitimate client gets through.

99.9% Uptime in Practice

99.9% uptime means less than 8.7 hours of downtime per year. Achieving this required:

Automated node provisioning — Ansible playbooks that can spin up a new node in under 5 minutes
Graceful failover — clients get a configuration endpoint; when a node is unhealthy, the endpoint serves updated configs pointing to healthy nodes
Zero-downtime deployments — rolling restarts with connection draining
Monitoring with pages — Grafana alert rules that notify instantly when a node drops below SLA

Building infrastructure that's both high-availability and adversarially resilient was the most technically demanding thing I've worked on. The constraints imposed by operating in hostile network conditions force engineering discipline that benefits any distributed system.