RAG with Structured Data: Integrating Databases, APIs, and Knowledge Graphs
Traditional RAG tutorials focus on documents, PDFs, markdown files, web pages. But real enterprise systems need to query structured data too: customer databases, product catalogs, transaction histories, and CRM records. Through building RAG systems that bridge structured and unstructured data, I’ve developed patterns that unlock both simultaneously.
The challenge isn’t just technical. Structured and unstructured data have fundamentally different query patterns. “What does our refund policy say?” searches documents. “What’s John Smith’s order history?” queries a database. Production systems need both, and users shouldn’t have to know which data source answers their question.
The Structured Data Challenge
Standard RAG assumes text retrieval: embed documents, store vectors, retrieve by similarity. Structured data breaks these assumptions:
Databases have schemas. Relationships between tables matter. Customer data links to orders links to products. Vector similarity doesn’t capture relational queries.
Aggregations require computation. “What’s our total revenue this quarter?” needs SUM operations, not document retrieval.
Exact matches matter. “Find customer ID 12345” isn’t a similarity search, it’s a lookup.
Real-time data changes constantly. A customer’s order status updates hourly. Embedding-based RAG can’t keep up with transactional data.
Forcing structured queries through document RAG produces terrible results. You need architectures designed for both paradigms.
Architecture Patterns for Structured RAG
Several patterns address structured data integration:
Pattern 1: Query Routing
Route queries to appropriate backends based on intent:
Query classifier determines whether a query needs documents, structured data, or both.
Document path uses standard RAG for unstructured content.
Structured path translates queries to SQL, API calls, or graph queries.
Combined path retrieves from both and merges results.
Implementation:
- Train a classifier on query examples labeled by data source
- Route to appropriate retrieval pipeline
- Format results consistently regardless of source
- Generate responses from unified context
This works well when query types are distinguishable. It struggles with ambiguous queries that could go either way.
Pattern 2: Text-to-SQL for Database Access
Convert natural language to SQL queries:
Schema understanding teaches the model your database structure.
Query generation produces SQL from natural language.
Execution runs queries against your database.
Result formatting converts tabular results to context for generation.
This enables queries like:
- “How many orders did we process last month?”
- “Who are our top 10 customers by revenue?”
- “What products have low inventory?”
Text-to-SQL works best with well-designed schemas and clear column naming. Complex joins and multi-step queries remain challenging.
Pattern 3: Knowledge Graph Integration
Knowledge graphs capture entity relationships explicitly:
Entity extraction identifies entities in queries (people, products, concepts).
Graph traversal finds related entities and their properties.
Context construction assembles relevant subgraph as RAG context.
Hybrid retrieval combines graph context with document retrieval.
Knowledge graphs excel at relationship queries:
- “Who reports to the VP of Engineering?”
- “What products are compatible with X?”
- “How are these two customers related?”
Building and maintaining knowledge graphs requires significant effort, but they enable query types that neither documents nor SQL handle well.
Pattern 4: API-Based Retrieval
For data behind APIs (CRMs, external services, real-time systems):
API catalog documents available endpoints and their capabilities.
Query decomposition breaks complex queries into API calls.
Orchestration executes API calls, potentially in sequence.
Result aggregation combines API responses into coherent context.
This pattern handles:
- CRM data (Salesforce, HubSpot)
- Real-time metrics (monitoring systems)
- External data (weather, stocks, third-party services)
Implementation: Text-to-SQL Deep Dive
Text-to-SQL is the most common structured data integration. Here’s how to implement it well:
Schema Representation
The LLM needs to understand your database:
Schema as context includes table and column definitions in the prompt:
Tables:
- customers (id, name, email, created_at)
- orders (id, customer_id, total, status, order_date)
- order_items (id, order_id, product_id, quantity, price)
Schema descriptions add meaning to column names. “created_at: timestamp when customer account was created.”
Example queries show how to query common patterns. Few-shot examples dramatically improve SQL quality.
Relationship mapping explains foreign keys and joins. “orders.customer_id references customers.id.”
Query Generation
Generate SQL safely and accurately:
Iterative refinement generates SQL, validates syntax, retries if needed.
Explain before query asks the model to explain its approach before writing SQL. This catches misunderstandings early.
Query validation checks generated SQL for syntax errors, unauthorized tables, and dangerous operations.
Parameterization prevents SQL injection by using parameters rather than string interpolation.
Read-only enforcement ensures queries can’t modify data.
Handling Query Failures
Text-to-SQL doesn’t always work:
Syntax errors trigger regeneration with error feedback.
Empty results prompt clarification: “No customers found with that name. Did you mean…?”
Ambiguous queries ask for clarification rather than guessing.
Capability limits acknowledge when the database can’t answer the query.
Result Presentation
Database results need formatting for LLM context:
Tabular summaries work for small result sets.
Aggregation for large results summarizes rather than including thousands of rows.
Highlighting key findings emphasizes relevant data points.
Source attribution notes that data comes from the database, not documents.
For more on structured data handling, see my hybrid database solutions guide.
Implementation: Knowledge Graph Integration
Knowledge graphs provide structured relationship data:
Graph Schema Design
Design graphs that support your query patterns:
Entity types represent the things you care about: customers, products, employees, concepts.
Relationship types capture how entities connect: purchased, manages, relates_to.
Properties store attributes on nodes and edges.
Design for queries prioritizes relationships you’ll actually traverse.
Query Translation
Convert natural language to graph queries:
Entity recognition identifies entities in queries.
Intent classification determines the relationship being queried.
Traversal planning determines how to navigate the graph.
Query generation produces Cypher, SPARQL, or your graph query language.
Hybrid Document-Graph Retrieval
Combine graph and document retrieval:
- Extract entities from query
- Retrieve relevant subgraph around those entities
- Use graph context to focus document retrieval
- Combine graph facts and document content as context
This hybrid approach answers questions that neither alone could handle: “What’s our policy on refunds for enterprise customers?” needs the refund policy (document) filtered by enterprise customer context (graph).
Implementation: Multi-Source Orchestration
Real systems query multiple sources for single questions:
Query Decomposition
Break complex queries into components:
“What was the revenue trend for the products mentioned in last week’s marketing email?”
Decomposes to:
- Find last week’s marketing email (document search)
- Extract mentioned products (text processing)
- Query revenue by product over time (SQL)
- Analyze trend (computation)
Execution Orchestration
Coordinate multi-source queries:
Dependency analysis determines execution order. Step 2 depends on Step 1.
Parallel execution runs independent queries simultaneously.
Result passing feeds outputs from earlier steps to later steps.
Error handling manages partial failures gracefully.
Context Assembly
Combine results from multiple sources:
Source labeling identifies where each piece of information came from.
Deduplication removes redundant information across sources.
Relevance ranking prioritizes most relevant content when context is limited.
Coherent formatting presents multi-source context in digestible form.
Data Freshness and Synchronization
Structured data changes. Keep your RAG system current:
Real-Time vs. Indexed Data
Different data has different freshness requirements:
Transactional data (orders, inventory, tickets) needs real-time queries. Don’t embed this, query it live.
Slowly changing data (product catalog, employee directory) can be indexed but needs update mechanisms.
Static reference data (policies, documentation) works with standard RAG indexing.
Synchronization Strategies
Keep indexed data current:
Change data capture detects database changes and triggers reindexing.
Scheduled refresh reindexes on cadence (hourly, daily).
Hybrid freshness queries real-time for critical fields, uses cached data for others.
Invalidation removes outdated indexed content when source data changes.
Staleness Handling
When data might be stale:
Timestamp display shows when information was last updated.
Freshness warnings alert users when data might be outdated.
Live verification double-checks critical information against live sources.
Security and Access Control
Structured data often has access restrictions:
Query-Level Authorization
Enforce data access at query time:
User context includes permissions in the query pipeline.
Row-level security filters database results by user access.
Field-level security redacts sensitive columns.
Query auditing logs data access for compliance.
Cross-Source Authorization
Different sources have different permissions:
Permission intersection ensures users have access to all sources a query touches.
Fallback handling provides partial answers when full access isn’t available.
Clear communication explains what the user can and can’t see.
Quality Considerations
Structured RAG has unique quality challenges:
Answer Accuracy
Database queries must return correct data:
SQL verification tests generated queries against known results.
Calculation validation checks aggregations and computations.
Type checking ensures numeric and date handling is correct.
Explanation Quality
Help users understand data-driven answers:
Show your work includes the query or calculation that produced results.
Confidence indicators distinguish precise database answers from interpreted document content.
Source transparency clarifies where numbers come from.
Handling Limitations
Be honest about what the system can and can’t do:
Query complexity limits acknowledge when queries exceed capability.
Data coverage gaps explain when relevant data doesn’t exist.
Uncertainty communication distinguishes facts from inferences.
Building Your Structured RAG System
Start with clear use cases:
- Inventory your data sources and their query patterns
- Identify common queries that need structured data
- Start with one integration (probably SQL) and prove value
- Add sources incrementally as needs emerge
- Build query routing to direct traffic appropriately
For related implementation patterns, see my production RAG guide and RAG implementation tutorial.
The systems that deliver the most value combine unstructured knowledge with structured data. Documents tell users how and why. Databases tell them what and how much. Together, they answer questions neither could handle alone.
Ready to build RAG systems that bridge structured and unstructured data? Join the AI Engineering community where engineers share integration patterns and help each other build production-grade systems.