Back to Glossary
MLOps

Autoscaling

Definition

Autoscaling automatically adjusts the number of AI inference servers based on demand, scaling up during traffic spikes and down during quiet periods to optimize cost and performance.

Why It Matters

AI inference costs can spiral without autoscaling. Running 10 GPU servers continuously costs the same whether they’re processing requests or sitting idle. Autoscaling matches capacity to demand, spinning up servers when traffic increases and scaling down when traffic subsides.

The challenge with AI workloads is scaling speed. GPU servers take minutes to start, load models, and warm up. By the time new capacity is ready, the traffic spike may have passed. This makes predictive scaling and appropriate buffer capacity important for AI-specific autoscaling.

Cost optimization through autoscaling can be substantial. Production AI traffic often varies 10x between peak and off-peak hours. Scaling to match means paying for 10 servers during peak but only 1-2 during off-peak, potentially reducing costs by 70-80% compared to fixed capacity.

Implementation Basics

Autoscaling approaches:

Reactive scaling responds to current metrics (CPU, memory, request queue depth). When metrics exceed thresholds, add capacity. When metrics drop, remove capacity.

Predictive scaling uses historical patterns to anticipate demand. If traffic peaks at 9am daily, start scaling up at 8:45am rather than waiting for metrics to spike.

Scheduled scaling sets minimum capacity for known events (marketing campaigns, product launches) without waiting for autoscaling triggers.

Metrics for AI autoscaling:

  • Request queue depth: Most relevant for AI, measures requests waiting for inference
  • GPU utilization: Measures compute usage, but can be misleading for memory-bound workloads
  • Latency percentiles: Scale when p95/p99 latency exceeds targets
  • Custom metrics: Model-specific throughput (tokens/second, images/second)

AI-specific challenges:

  • Cold start: New instances need time to load models (30s-5min for large models)
  • GPU provisioning: Cloud GPU instances may not be immediately available
  • Minimum viable capacity: Some applications can’t scale to zero due to cold start impact

Implementation patterns:

  • Kubernetes HPA: Horizontal Pod Autoscaler with custom metrics
  • Cloud autoscaling: AWS Auto Scaling Groups, GCP Managed Instance Groups
  • Serverless: AWS Lambda, Cloud Run (limited GPU support currently)
  • Knative: Kubernetes-native scale-to-zero with configurable warmup

Start with conservative scaling, over-provisioning slightly rather than under-provisioning. Track scaling events and latency to tune thresholds. Consider keeping one warm instance always running to eliminate cold-start latency for the first request.

Source

The HorizontalPodAutoscaler automatically updates a workload resource with the aim of automatically scaling the workload to match demand.

https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/