Back to Glossary
MLOps

Batching

Definition

Processing multiple LLM requests together to maximize GPU utilization and throughput, trading individual latency for overall system efficiency.

Batching in LLM inference combines multiple requests into a single forward pass through the model, improving GPU utilization and overall throughput at the cost of per-request latency.

Why It Matters

GPUs are massively parallel processors, but processing one request at a time wastes most of their capacity. Batching matters because:

  • GPU efficiency: Utilize more of the available compute capacity
  • Higher throughput: Serve more requests per second
  • Lower cost: More efficient use of expensive GPU resources
  • Better scaling: Handle traffic spikes more effectively

The trade-off: individual requests wait slightly longer as they’re grouped with others. This is acceptable for most production workloads but may not suit real-time applications.

Implementation Basics

Two main batching strategies:

Static Batching:

  • Collect N requests before processing
  • Simple but inefficient, as short prompts wait for long ones
  • Suitable for offline/batch processing jobs

Continuous Batching (used by vLLM, TGI):

  • Process requests as they arrive
  • Add new requests to running batches
  • Remove completed requests without stopping
  • Much higher throughput for interactive workloads

Key parameters:

  • Batch size: Maximum requests per batch
  • Wait time: How long to collect requests before processing
  • Dynamic scheduling: Prioritize based on sequence length

OpenAI Batch API example for high-volume processing:

# Submit batch job
batch = client.batches.create(
    input_file_id="file-abc123",
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

For serving frameworks like vLLM and TGI, continuous batching is automatic. For custom deployments, implement request queuing and batch scheduling based on your latency/throughput requirements.

Source

Continuous batching is a key technique for high-throughput LLM serving

https://arxiv.org/abs/2309.06180