Context Window
Definition
The context window is the maximum number of tokens an LLM can process in a single request, including both input (prompt, documents, history) and output, typically ranging from 4K to 1M+ tokens.
Why It Matters
Context windows define what’s possible. A 4K token limit means your LLM can only “see” about 3,000 words at once, not enough for long documents. A 200K window changes everything: entire codebases, book chapters, or hours of conversation history.
But bigger isn’t always better. Longer contexts cost more, run slower, and models can lose track of information (“lost in the middle”). The right context size balances capability against cost and reliability.
For AI engineers, context window management is a daily consideration. How much history to keep? How many RAG documents to include? When to summarize vs. include full content? These decisions shape your application’s behavior and costs.
Implementation Basics
Current Landscape (2026)
- GPT-5: 400K+ tokens
- Claude 4.5: 200K-1M tokens (with extended thinking)
- Gemini 3: 2M+ tokens
- Llama 4: 256K tokens
- Open source: Widely variable (32K-256K typical)
Token Budget Planning Allocate your context window deliberately:
- System prompt: 500-2000 tokens
- Retrieved documents: 2000-8000 tokens
- Conversation history: Variable
- Current input: Variable
- Output buffer: Reserve 1000-4000 tokens
Strategies for Limited Context
- Summarization: Compress old messages instead of dropping them
- Selective retrieval: Only include most relevant documents
- Chunking: Process long inputs in segments
- History pruning: Keep recent + most relevant past messages
Common Mistakes
- Filling context completely (leaves no room for output)
- Including irrelevant documents that dilute important context
- Not testing how models handle information placement
- Ignoring the cost implications of long contexts
Practical Tips Track your token usage. Build monitoring to see how much context you’re actually using. If you’re consistently at 10% of capacity, you’re likely overpaying for context you don’t need. If you’re hitting limits, you need smarter context management.
Source
Long-context models extending beyond 100K tokens show degraded performance on information retrieval tasks in the middle of the context ('Lost in the Middle' phenomenon).
https://arxiv.org/abs/2307.03172