Back to Glossary
MLOps

Load Balancing

Definition

Load balancing distributes AI inference requests across multiple model servers to maximize throughput, minimize latency, and ensure high availability of AI applications.

Why It Matters

Single-server deployments can’t handle production traffic for most AI applications. A single GPU server might handle 10-50 concurrent LLM requests before latency degrades unacceptably. Load balancing lets you scale horizontally, adding more servers to handle more traffic while appearing as a single endpoint to clients.

Beyond scaling, load balancing provides fault tolerance. If one server fails or becomes unresponsive, the load balancer routes traffic to healthy servers automatically. This resilience is essential for production AI systems where downtime has business impact.

For AI workloads specifically, load balancing interacts with model loading and GPU memory. Unlike stateless web servers, AI inference servers often have warm models in GPU memory. Intelligent load balancing considers not just current load but whether servers have the required models loaded.

Implementation Basics

Load balancing strategies:

  • Round-robin: Distribute requests evenly across servers in rotation
  • Least connections: Route to server with fewest active requests
  • Weighted: Route more traffic to higher-capacity servers
  • Sticky sessions: Route user’s requests to same server (useful for conversation context)

AI-specific considerations:

Heterogeneous servers: Not all GPUs are equal. A 4090 handles different throughput than an A100. Weighted balancing accounts for capacity differences.

Model routing: Different servers may host different models. Route requests to servers with the requested model already loaded, avoiding cold-start latency.

Long-running requests: LLM inference can take seconds. Health checks must distinguish between “busy processing” and “actually failed.”

Implementation options:

  • Nginx: Reverse proxy with basic load balancing
  • HAProxy: Advanced load balancing with health checks
  • Cloud load balancers: AWS ALB, GCP Load Balancer with managed SSL
  • Kubernetes Ingress: Native load balancing for K8s deployments

Health check patterns for AI:

  • TCP checks verify server is reachable
  • HTTP checks verify API responds
  • Custom checks verify model is loaded and inference works
  • Gradual rollout routes small percentage to new deployments

Start with managed cloud load balancers since they handle SSL, DDoS protection, and basic health checks. Add complexity (custom routing, model-aware balancing) only when standard approaches prove insufficient.

Source

Load balancing is the method of distributing network traffic equally across a pool of resources that support an application.

https://aws.amazon.com/what-is/load-balancing/