AI Logging and Observability: See Inside Your Systems

While everyone celebrates their AI demo working, few engineers can explain what happens when it doesn’t. Through building production AI systems, I’ve discovered that observability separates systems that scale from systems that fail mysteriously, and that AI systems need different observability than traditional applications.

You can’t debug what you can’t see. When an AI response is slow, wrong, or expensive, you need to know why. Was it the model? The retrieval? The prompt? The network? Without proper observability, you’re guessing. This guide covers what actually matters for AI systems.

Why AI Observability Is Different

Traditional observability tracks request counts, error rates, and latency. AI observability needs more:

Quality is measurable. AI responses have quality that varies. You need to track whether outputs are actually good, not just whether they were produced.

Costs are significant. Every request has material cost. You need visibility into spending by feature, user, and request type.

Behavior is non-deterministic. The same input can produce different outputs. You need to track variations and their causes.

Debugging requires context. Understanding why an AI response was poor requires seeing the full context: prompt, retrieved documents, model response, post-processing.

For foundational architecture patterns, see my guide to AI system design.

Logging Strategy

What to Log

Request context: User ID, session ID, request ID, timestamp, source Input data: Query, conversation history (sanitized), request parameters Processing details: Model used, prompt template version, retrieval results Output data: Response content, token counts, finish reason Performance: Latency breakdown, queue wait time, processing time Costs: Tokens consumed, estimated cost, model tier

What Not to Log

Sensitive user content: PII, credentials, health information Full conversation history in production: Summarize or hash instead Binary data: Log references, not file contents Excessive detail in hot paths: High-volume paths need minimal logging

Log Levels

DEBUG: Full prompt content, retrieval details, model parameters,useful for development, expensive in production INFO: Request summary, response summary, key metrics,standard production logging WARN: Degraded service, fallback activation, approaching limits ERROR: Failed requests, unexpected model behavior, validation failures CRITICAL: Service outages, data corruption, security events

Structured Logging

Always log structured data, not strings:

Structured logs enable filtering, aggregation, and analysis. String concatenation produces logs that are hard to search and impossible to aggregate.

Include standard fields in every log entry: timestamp, service, request_id, user_id, level. This enables correlation across your system.

Metrics That Matter

Request Metrics

Throughput: Requests per second, by endpoint and model Latency: P50, P95, P99 response times,averages hide problems Error rate: Failures by type (model error, validation error, timeout) Queue depth: Jobs waiting for processing

AI-Specific Metrics

Token consumption: Input tokens, output tokens, total by model Cost: Dollars spent, by model, by feature, by user tier Quality scores: User feedback, automated evaluation scores Retrieval metrics: Documents retrieved, relevance scores, hit rates

Resource Metrics

API quota utilization: Percentage of rate limits consumed Cache performance: Hit rate, miss rate, eviction rate Infrastructure: CPU, memory, GPU utilization

Business Metrics

Conversion from AI features: Do AI features drive business outcomes? User engagement: Time spent, interactions, return rate Support tickets: Are AI features generating support load?

Distributed Tracing

AI systems are distributed. Requests touch multiple services. Tracing shows the complete picture.

Trace Structure

Trace: Complete request journey from user to response Span: Individual operation within a trace (embedding, retrieval, generation) Context: Metadata that flows through the trace

What to Trace

API request span: Total request handling time Embedding span: Time to generate embeddings, model used Retrieval span: Vector search time, documents returned LLM span: Model inference time, token counts Post-processing span: Response formatting, validation

Trace Context

Include AI-specific context in traces:

Prompt template version: Which prompt template generated this response? Model parameters: Temperature, max_tokens, stop sequences Retrieval configuration: Top-k, filters, reranking settings Feature flags: Which experimental features were active?

Implementation

Popular tracing solutions work for AI systems:

OpenTelemetry: Vendor-neutral, comprehensive, growing ecosystem Jaeger/Zipkin: Open source, self-hosted options Datadog/New Relic: Managed solutions with AI-specific features

Choose based on your existing infrastructure. Consistency matters more than specific tool choice.

Debugging AI Issues

When things go wrong, how do you diagnose?

The Debugging Framework

Step 1: Identify the symptom

Response quality poor?
Latency too high?
Costs too high?
Errors occurring?

Step 2: Locate the component

Use traces to identify which span has problems
Check metrics to see if issue is systemic or isolated

Step 3: Examine the details

Pull logs for specific requests
Review prompts, retrievals, model responses
Compare to working requests

Step 4: Form hypotheses

Is the prompt unclear?
Is retrieval returning irrelevant documents?
Is the model appropriate for this task?

Step 5: Test fixes

Make targeted changes
Compare new behavior to old
Monitor for improvement

Common AI Debugging Scenarios

Poor response quality:

Check retrieved documents,are they relevant?
Review prompt,is the instruction clear?
Examine model output,what went wrong?
Compare to successful responses,what’s different?

High latency:

Check trace spans,which operation is slow?
Review queue metrics,is there backlog?
Check external API latency,is the provider slow?
Review batch sizes,are requests too large?

Cost spikes:

Check token consumption,which requests are expensive?
Review model routing,are expensive models being used unnecessarily?
Check for retry storms,are failures causing repeat requests?
Review caching,are cache hits declining?

Intermittent failures:

Check error logs,what’s the failure mode?
Review rate limit status,are you hitting limits?
Check circuit breaker state,is a service degraded?
Review timing,do failures correlate with load patterns?

Alerting Strategy

What to Alert On

Alert on symptoms, not causes:

“Response latency P95 > 5 seconds” (symptom)
Not “CPU usage > 80%” (cause that might be fine)

Alert on user impact:

“Error rate > 5%” (users are affected)
“Response quality score declining” (users getting worse results)

Alert on cost anomalies:

“Daily spend 2x normal” (potential runaway costs)
“Cost per request increasing” (efficiency degradation)

Alert Severity

Page (immediate response):

Complete service outage
Security incidents
Data corruption

Notify (respond within hours):

Degraded service quality
Approaching rate limits
Circuit breaker trips

Track (respond within days):

Gradual performance degradation
Cost trend changes
Quality metric decline

Avoiding Alert Fatigue

Alert on actionable conditions. If there’s no clear action, it’s not an alert. Set appropriate thresholds. Alerts that fire constantly get ignored. Group related alerts. One incident, one alert, not dozens. Include context in alerts. What happened, what’s the impact, where to start investigating.

Log Aggregation and Analysis

Log Storage

Short-term (days-weeks): Fast storage for active debugging. This is your primary interface. Medium-term (months): Cheaper storage for trend analysis and incident review. Long-term (years): Cold storage for compliance and historical analysis.

Implement automatic tiering to manage costs while maintaining access.

Log Analysis Patterns

Anomaly detection: Automatically identify unusual patterns Trend analysis: Track metrics over time Correlation: Link logs across services Search: Find specific requests by attributes

Tools

ELK Stack (Elasticsearch, Logstash, Kibana): Powerful, flexible, self-hosted Splunk: Enterprise-grade, expensive, comprehensive Datadog/Sumo Logic: Managed solutions, easier to operate Loki: Lightweight, Prometheus-native, cost-effective

Choose based on team expertise, budget, and existing infrastructure.

AI-Specific Observability

Prompt Observability

Log prompt templates separately from populated prompts. Template version + variable values enables reconstruction without logging full prompts.

Track prompt performance by template version. When you change prompts, compare quality and cost metrics.

Store prompt examples for debugging. When quality degrades, compare current prompts to working examples.

Retrieval Observability

Log retrieval queries and results. What did the system search for? What did it find?

Track relevance scores. Are retrieved documents actually relevant?

Monitor retrieval latency independently. Separate retrieval from generation timing.

Model Observability

Track model behavior by version. Model updates can change behavior. Track metrics separately.

Monitor for model drift. The same prompts might produce different results over time.

Log model parameters with responses. Temperature, max_tokens, and other settings affect output.

My guide to AI system monitoring covers additional patterns.

Privacy and Compliance

Log Sanitization

Remove PII before logging. Names, emails, phone numbers, addresses should be redacted or hashed.

Be careful with user content. User queries might contain sensitive information.

Implement retention policies. Don’t keep logs longer than necessary.

Access Controls

Restrict access to detailed logs. Not everyone needs to see full prompts and responses.

Audit log access. Track who views sensitive logs.

Implement role-based access. Different roles need different log visibility.

Compliance Considerations

GDPR/CCPA: Users can request their data. Can you identify and export their logs? HIPAA: Health information has specific logging requirements. SOC 2: Audit requirements affect log retention and access.

Build compliance into your logging strategy from the start. Retrofitting is painful.

Implementation Roadmap

If you’re starting from scratch:

Week 1-2: Basic logging infrastructure

Structured logging library
Log aggregation (ELK, managed service)
Basic dashboards

Week 3-4: AI-specific logging

Token counting and cost tracking
Prompt template logging
Retrieval metrics

Week 5-6: Tracing

Distributed tracing setup
AI span instrumentation
Trace visualization

Week 7-8: Alerting

Critical alerts (outages, errors)
Cost alerts
Quality alerts

Ongoing: Refinement

Alert threshold tuning
Dashboard improvement
Additional metrics

Build incrementally. Basic visibility early beats comprehensive observability never.

The Observability Mindset

Good observability isn’t about logging everything. It’s about being able to answer questions quickly when things go wrong.

When a user reports a problem, can you find their request? When latency spikes, can you identify the cause? When costs increase, can you trace the source?

If you can answer these questions quickly, your observability is working. If you can’t, keep building.

Ready to build observable AI systems? Watch implementation tutorials on my YouTube channel for hands-on guidance. And join the AI Engineering community to discuss observability patterns with other engineers building production AI systems.

Zen van Riel

Senior AI Engineer at GitHub | Ex-Microsoft

I grew from intern to Senior Engineer at GitHub, previously working at Microsoft. Now I teach 22,000+ engineers on YouTube, reaching hundreds of thousands of developers with practical AI engineering tutorials. My blog posts are generated from my own video content, focusing on real-world implementation over theory.

Blog last updated Jan 26, 2026