Context Length
Definition
The maximum number of tokens an LLM can process in a single request, including both input prompt and generated output, determining how much information can be considered.
Context length (or context window) is the maximum number of tokens an LLM can process in a single request, encompassing both the input prompt and the generated output.
Why It Matters
Context length determines what you can accomplish in a single LLM call:
- Document processing: How much text can be analyzed at once
- Conversation history: How much chat context can be retained
- RAG effectiveness: How many retrieved chunks can be included
- Code understanding: How much codebase context fits in one prompt
Modern context lengths have expanded dramatically:
- GPT-3 (2020): 4K tokens
- GPT-4 (2023): 128K tokens
- GPT-5 (2026): 400K+ tokens
- Claude 4.5 (2026): 200K-1M tokens
- Gemini 3 (2026): 2M tokens
Longer contexts enable new use cases but come with trade-offs in cost, latency, and attention quality.
Implementation Basics
Working with context length:
- Token counting: Use tiktoken (OpenAI) or model-specific tokenizers
- Budget allocation: Reserve space for output (typically 4K-8K tokens)
- Truncation strategies: Remove oldest messages, summarize, or chunk
Context management patterns:
# Estimate token usage
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
token_count = len(enc.encode(text))
Short context (4K-8K): Basic Q&A, simple tasks Medium context (32K-64K): Document analysis, multi-turn conversations Long context (128K+): Large document processing, codebase understanding
Considerations:
- Cost: API pricing often includes input + output tokens
- Latency: Longer prompts increase prefill time
- Attention degradation: Models may lose focus with very long contexts
- Lost in the middle: Information in the middle of long contexts is often missed
For most applications, design for efficient context use rather than relying on maximum length. Focused, relevant context beats raw volume.