HyDE (Hypothetical Document Embeddings)
Definition
HyDE is a retrieval technique that generates a hypothetical answer to a query using an LLM, then embeds that answer to find similar real documents, bridging the gap between question and document embeddings.
Why It Matters
Questions and answers live in different embedding spaces. When you embed “What causes inflation?” and embed a document explaining inflation, the vectors might not be as close as you’d expect. The question is short and interrogative; the document is long and declarative. HyDE bridges this gap.
The key insight: instead of searching with the question embedding, generate what a good answer might look like, then search with that. A hypothetical answer shares more semantic structure with real documents than the original question does.
For AI engineers, HyDE is a technique to try when your RAG system retrieves tangentially related documents instead of directly relevant ones. It’s particularly effective for complex questions where simple keyword matching fails.
How It Works
HyDE adds an LLM generation step before retrieval:
1. Generate Hypothetical Document Send your query to an LLM with a prompt like “Write a short passage that answers this question.” The LLM generates a plausible (but possibly inaccurate) answer.
2. Embed the Hypothetical Create an embedding of the generated text, not the original query. This embedding represents what a good answer looks like.
3. Retrieve Similar Documents Search your vector database using the hypothetical document embedding. Real documents with similar content score highly.
4. Generate Final Answer Use the retrieved documents (not the hypothetical) to generate the actual response. The hypothetical was just for finding better documents.
Implementation Basics
Adding HyDE to your retrieval pipeline:
Prompt Design Instruct the LLM to write a document-style answer, not a conversational response. “Write a Wikipedia-style paragraph answering this question” works well.
Model Choice Faster, smaller models work fine for hypothetical generation since accuracy doesn’t matter, only semantic similarity to real answers.
Multiple Hypotheticals Generate 2-3 hypothetical answers, embed each, and aggregate the retrieval results. This covers different ways to answer the question.
Cost Consideration HyDE adds an LLM call before every retrieval. For high-volume applications, this cost adds up. Consider caching hypotheticals for common query patterns.
When to Use HyDE works best for complex, conceptual queries where question-to-document similarity is weak. For simple factual lookups, standard retrieval often suffices.
Hybrid Approach Combine HyDE results with standard query embedding results, then rerank. This captures both direct matches and semantically related documents.
Source
Precise Zero-Shot Dense Retrieval without Relevance Labels - HyDE outperforms standard query embedding by generating hypothetical documents that better match document embedding space.
https://arxiv.org/abs/2212.10496