Back to Glossary
MLOps

Serverless

Definition

Serverless computing runs AI inference functions on-demand without managing servers, automatically scaling from zero to handle variable workloads and charging only for actual compute used.

Why It Matters

Serverless eliminates infrastructure management entirely. No servers to provision, no capacity planning, no patching. You deploy code, and the platform handles everything else. For AI applications with variable traffic, serverless can dramatically reduce operational complexity and costs.

The appeal for AI workloads is pay-per-request pricing. If your AI feature gets 1,000 requests per day, you pay for those 1,000 invocations, not for a server running 24/7 waiting for requests. This makes serverless ideal for sporadic AI features, internal tools, or early-stage products with unpredictable usage.

The challenge is cold starts and GPU support. Serverless platforms traditionally lacked GPU access, limiting them to CPU inference or API calls to external AI services. This is changing. Providers are adding GPU-enabled serverless options, but constraints remain.

Implementation Basics

Serverless patterns for AI:

API gateway pattern: Serverless function receives HTTP request, calls external AI API (OpenAI, Claude), returns response. Simple, scalable, but adds latency from external API call.

CPU inference pattern: Package small model with function code. Works for embedding models, small classifiers, or quantized models that run acceptably on CPU.

GPU serverless pattern: Emerging platforms (Modal, Banana, Replicate) offer GPU-enabled serverless. Functions can load and run large models, though cold starts are significant.

When serverless works for AI:

  • Variable, unpredictable traffic patterns
  • Infrequent usage that doesn’t justify always-on servers
  • Simple transformations or lightweight models
  • Proxying to external AI APIs

When serverless doesn’t work:

  • High-volume, consistent traffic (always-on servers more economical)
  • Large models requiring GPU (cold start latency unacceptable)
  • Latency-sensitive applications (cold starts add seconds)
  • Complex multi-step workflows (orchestration becomes complicated)

Platform options:

  • AWS Lambda: Mature, broad ecosystem, limited GPU support
  • Google Cloud Functions/Run: Good Python support, some GPU options
  • Azure Functions: Microsoft ecosystem integration
  • Modal/Banana/Replicate: GPU-first serverless for AI

Start with serverless for prototypes and variable-traffic features. Migrate to persistent servers when traffic stabilizes or latency requirements tighten. The flexibility of serverless for experimentation is valuable even if production deployments move elsewhere.

Source

AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you.

https://aws.amazon.com/lambda/