Batching
Definition
Processing multiple LLM requests together to maximize GPU utilization and throughput, trading individual latency for overall system efficiency.
Batching in LLM inference combines multiple requests into a single forward pass through the model, improving GPU utilization and overall throughput at the cost of per-request latency.
Why It Matters
GPUs are massively parallel processors, but processing one request at a time wastes most of their capacity. Batching matters because:
- GPU efficiency: Utilize more of the available compute capacity
- Higher throughput: Serve more requests per second
- Lower cost: More efficient use of expensive GPU resources
- Better scaling: Handle traffic spikes more effectively
The trade-off: individual requests wait slightly longer as theyβre grouped with others. This is acceptable for most production workloads but may not suit real-time applications.
Implementation Basics
Two main batching strategies:
Static Batching:
- Collect N requests before processing
- Simple but inefficient, as short prompts wait for long ones
- Suitable for offline/batch processing jobs
Continuous Batching (used by vLLM, TGI):
- Process requests as they arrive
- Add new requests to running batches
- Remove completed requests without stopping
- Much higher throughput for interactive workloads
Key parameters:
- Batch size: Maximum requests per batch
- Wait time: How long to collect requests before processing
- Dynamic scheduling: Prioritize based on sequence length
OpenAI Batch API example for high-volume processing:
# Submit batch job
batch = client.batches.create(
input_file_id="file-abc123",
endpoint="/v1/chat/completions",
completion_window="24h"
)
For serving frameworks like vLLM and TGI, continuous batching is automatic. For custom deployments, implement request queuing and batch scheduling based on your latency/throughput requirements.
Source
Continuous batching is a key technique for high-throughput LLM serving
https://arxiv.org/abs/2309.06180