Back to Glossary
Implementation

Context Compression

Definition

Techniques for reducing token count while preserving essential information, enabling more content to fit within context windows and reducing API costs.

Context compression refers to techniques for reducing the token count of prompts while preserving the essential information needed for accurate LLM responses, improving both cost efficiency and processing speed.

Why It Matters

Even with expanding context windows, compression matters:

  • Cost reduction: Token-based pricing means fewer tokens = lower costs
  • Faster responses: Shorter prefill times improve latency
  • Capacity increase: Fit more relevant information in fixed windows
  • RAG optimization: Include more retrieved documents

A 50% compression rate effectively doubles your context capacity or halves your API costs.

Implementation Basics

Compression approaches:

Extractive methods:

  • Remove redundant sentences
  • Extract key phrases and entities
  • Filter low-relevance passages
  • Tools: LangChain extractors, custom NLP pipelines

Abstractive methods:

  • Summarize verbose sections
  • Use LLM to compress before main call
  • Trade compression time for token savings

Learned compression (LLMLingua, AutoCompressor):

  • Train models to identify essential tokens
  • 10-20x compression with minimal quality loss
  • Requires additional inference step

Prompt optimization:

  • Remove filler words and redundancy
  • Use abbreviations and dense formatting
  • Structured data instead of prose

LLMLingua example approach:

from llmlingua import PromptCompressor

compressor = PromptCompressor()
compressed = compressor.compress_prompt(
    context=long_document,
    rate=0.5  # 50% compression
)

Best practices:

  • Measure quality degradation at different compression rates
  • Critical information (instructions, key facts) should be preserved
  • Consider domain-specific compression for technical content
  • Use compression for context, not for the actual question

Context compression is particularly valuable for RAG systems where retrieved documents are verbose. Compress context, not queries.

Source

LLMLingua demonstrates effective prompt compression for improved efficiency

https://arxiv.org/abs/2310.06839