What is Context Compression?

Implementation

Context Compression

Definition

Techniques for reducing token count while preserving essential information, enabling more content to fit within context windows and reducing API costs.

Context compression refers to techniques for reducing the token count of prompts while preserving the essential information needed for accurate LLM responses, improving both cost efficiency and processing speed.

Why It Matters

Even with expanding context windows, compression matters:

Cost reduction: Token-based pricing means fewer tokens = lower costs
Faster responses: Shorter prefill times improve latency
Capacity increase: Fit more relevant information in fixed windows
RAG optimization: Include more retrieved documents

A 50% compression rate effectively doubles your context capacity or halves your API costs.

Implementation Basics

Compression approaches:

Extractive methods:

Remove redundant sentences
Extract key phrases and entities
Filter low-relevance passages
Tools: LangChain extractors, custom NLP pipelines

Abstractive methods:

Summarize verbose sections
Use LLM to compress before main call
Trade compression time for token savings

Learned compression (LLMLingua, AutoCompressor):

Train models to identify essential tokens
10-20x compression with minimal quality loss
Requires additional inference step

Prompt optimization:

Remove filler words and redundancy
Use abbreviations and dense formatting
Structured data instead of prose

LLMLingua example approach:

from llmlingua import PromptCompressor

compressor = PromptCompressor()
compressed = compressor.compress_prompt(
    context=long_document,
    rate=0.5  # 50% compression
)

Best practices:

Measure quality degradation at different compression rates
Critical information (instructions, key facts) should be preserved
Consider domain-specific compression for technical content
Use compression for context, not for the actual question

Context compression is particularly valuable for RAG systems where retrieved documents are verbose. Compress context, not queries.

Source

LLMLingua demonstrates effective prompt compression for improved efficiency

https://arxiv.org/abs/2310.06839

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles