MLOps

Batching

Definition

Processing multiple LLM requests together to maximize GPU utilization and throughput, trading individual latency for overall system efficiency.

Batching in LLM inference combines multiple requests into a single forward pass through the model, improving GPU utilization and overall throughput at the cost of per-request latency.

Why It Matters

GPUs are massively parallel processors, but processing one request at a time wastes most of their capacity. Batching matters because:

GPU efficiency: Utilize more of the available compute capacity
Higher throughput: Serve more requests per second
Lower cost: More efficient use of expensive GPU resources
Better scaling: Handle traffic spikes more effectively

The trade-off: individual requests wait slightly longer as they’re grouped with others. This is acceptable for most production workloads but may not suit real-time applications.

Implementation Basics

Two main batching strategies:

Static Batching:

Collect N requests before processing
Simple but inefficient, as short prompts wait for long ones
Suitable for offline/batch processing jobs

Continuous Batching (used by vLLM, TGI):

Process requests as they arrive
Add new requests to running batches
Remove completed requests without stopping
Much higher throughput for interactive workloads

Key parameters:

Batch size: Maximum requests per batch
Wait time: How long to collect requests before processing
Dynamic scheduling: Prioritize based on sequence length

OpenAI Batch API example for high-volume processing:

# Submit batch job
batch = client.batches.create(
    input_file_id="file-abc123",
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

For serving frameworks like vLLM and TGI, continuous batching is automatic. For custom deployments, implement request queuing and batch scheduling based on your latency/throughput requirements.

Source

Continuous batching is a key technique for high-throughput LLM serving

https://arxiv.org/abs/2309.06180

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles