NGINX for AI API Serving: Configuration and Best Practices


While AI engineers often focus on application code and model optimization, the infrastructure layer determines whether your APIs survive production traffic. NGINX sits between users and your AI services, handling concerns that application code shouldn’t manage directly.

Through deploying AI applications at scale, I’ve learned that proper NGINX configuration can double your effective capacity, improve user experience with better latency, and protect your inference services from traffic spikes.

Why NGINX for AI Applications

NGINX solves problems specific to AI API serving:

Long-running connections during streaming LLM responses. Default server configurations often timeout before responses complete.

High memory usage by inference services means you can’t run many workers. NGINX handles connection management externally.

Uneven request costs where one prompt might take 10ms and another 30 seconds. Load balancing needs awareness of this.

Rate limiting to protect expensive inference resources from abuse.

Basic Reverse Proxy Setup

The foundation is proxying requests to your AI service.

Upstream Configuration

Define your backend AI service:

Keepalive connections reduce latency by reusing connections. Configure keepalive count based on expected concurrency.

Timeout settings must accommodate long inference times. Default timeouts are too short for LLM responses.

Health checks ensure traffic only routes to healthy backends. NGINX Plus offers active health checks; open source uses passive.

Location Blocks

Route requests to your AI endpoints:

Path-based routing separates inference from other APIs. Different endpoints might need different timeout configurations.

Header management passes necessary context to backends. Client IP, request IDs, and content types.

Buffer settings control how responses are handled. Streaming requires specific buffer configuration.

Streaming Response Configuration

LLM responses stream over seconds or minutes. NGINX must not buffer or timeout these connections.

Disabling Buffering

For streaming responses:

proxy_buffering off prevents NGINX from buffering the response. Tokens flow directly to clients.

proxy_cache off disables caching for streaming endpoints. Each response is unique and shouldn’t be cached.

chunked_transfer_encoding on enables chunked responses. Required for Server-Sent Events.

Timeout Configuration

Streaming connections need generous timeouts:

proxy_read_timeout must exceed maximum generation time. 5-10 minutes is common for long contexts.

proxy_send_timeout handles slow client connections. Match to your expected client diversity.

keepalive_timeout maintains connections between requests. Important for conversational applications.

SSE-Specific Settings

Server-Sent Events have additional requirements:

Connection: keep-alive header must be preserved.

Cache-Control: no-cache prevents intermediate caching.

Content-Type: text/event-stream identifies SSE responses.

Load Balancing Strategies

AI workloads benefit from intelligent load balancing.

Round Robin Limitations

Simple round robin fails for AI workloads:

Request costs vary dramatically. A simple classification takes milliseconds; document summarization takes minutes.

Worker availability matters. Sending requests to a busy worker queues them unnecessarily.

Least Connections

Least connections works better:

Routes to least busy server based on active connections. Naturally balances uneven request costs.

Requires connection tracking by NGINX. Minimal overhead in practice.

Works well with the variable latency of AI inference.

Weighted Distribution

When servers have different capabilities:

Weight by GPU capacity. A server with 2 GPUs should receive twice the traffic.

Adjust for model differences. Smaller models on some servers handle more requests.

Monitor and tune. Initial weights often need adjustment based on actual performance.

IP Hash

For stateful AI applications:

Same client routes to same server. Useful for session-based applications.

Conversation continuity when servers cache conversation state.

Consider sticky sessions for multi-turn chat applications.

Rate Limiting

Protect expensive inference resources from abuse.

Request Rate Limiting

Limit requests per client:

limit_req_zone defines the rate limit parameters. Key by IP, API key, or user identifier.

limit_req applies the limit to specific locations. Different endpoints might need different limits.

Burst handling allows temporary spikes. Configure burst size for legitimate usage patterns.

Connection Limiting

Limit concurrent connections:

limit_conn_zone tracks connections per key. Prevents single clients from monopolizing resources.

limit_conn sets the maximum concurrent connections. Balance between legitimate use and protection.

Cost-Based Limiting

For AI, request count isn’t the best metric:

Consider token limits rather than request limits. A 100-token request shouldn’t count the same as 10,000 tokens.

Application-level limiting often works better. NGINX handles basic protection; your app handles nuanced limits.

SSL/TLS Termination

Handle HTTPS at the NGINX layer.

Certificate Configuration

Standard SSL setup:

Full certificate chain in ssl_certificate. Include intermediates.

Modern TLS versions. TLS 1.2 minimum, prefer TLS 1.3.

Strong cipher suites. Let NGINX choose modern defaults.

Performance Optimization

SSL adds latency. Optimize it:

SSL session caching reuses negotiated parameters. Reduces handshake overhead for returning clients.

SSL session tickets for stateless session resumption. Distributes across multiple NGINX instances.

OCSP stapling improves certificate validation performance. Reduces client-side checks.

Backend Communication

Between NGINX and your AI service:

HTTP internally is often appropriate. Encryption adds latency with no security benefit on localhost.

HTTPS for remote backends. Encrypt traffic across networks.

Trust verification when connecting to external services.

Caching Strategies

Intelligent caching reduces inference costs dramatically.

Response Caching

Cache deterministic responses:

Cache by full request hash. Same prompt, same parameters, same response.

Short TTLs for time-sensitive content. Minutes rather than hours.

Cache validation via ETags or If-Modified-Since.

Cache Configuration

Set up NGINX caching:

proxy_cache_path defines cache storage. Size based on expected cache hit rate and storage available.

proxy_cache_key determines cache identity. Include all parameters that affect the response.

proxy_cache_valid sets TTLs per response code. Cache 200s longer than errors.

Bypass Conditions

Skip cache when appropriate:

Fresh content requests via Cache-Control headers.

Authenticated requests that might have user-specific responses.

Debug requests during development.

Health Checks and Failover

Ensure traffic routes to healthy services.

Passive Health Checks

Open source NGINX supports passive checks:

max_fails sets failure threshold. Server marked down after this many consecutive failures.

fail_timeout defines the check window and downtime. Server returns to pool after this period.

Active Health Checks

NGINX Plus supports active probing:

health_check directive with configurable parameters.

Custom endpoints that verify model loading and inference capability.

Interval tuning based on your detection requirements.

Failover Patterns

Handle backend failures gracefully:

error_page directives for backup responses.

Backup servers that receive traffic only when primaries fail.

Graceful degradation returning cached or static responses.

Logging and Monitoring

Visibility into NGINX is essential for AI operations.

Access Logging

Configure informative logs:

Include timing information. Request time, upstream response time, upstream connect time.

Request identifiers. Correlation IDs for tracing through your system.

Response details. Status codes, response sizes, cache status.

Error Logging

Capture problems:

Appropriate log level. Warn or error for production.

Upstream errors. Connection failures, timeouts, protocol errors.

Client errors. Bad requests, rate limit hits.

Metrics Export

Expose metrics for monitoring:

stub_status for basic metrics. Connections, requests, waiting connections.

VTS module for detailed per-location metrics. Open source option.

NGINX Plus API for comprehensive metrics. Commercial feature.

Performance Tuning

Optimize NGINX for AI workloads.

Worker Configuration

Match workers to hardware:

worker_processes typically equals CPU cores. Let NGINX auto-detect.

worker_connections limits concurrent connections. Higher values for AI’s long connections.

multi_accept handles multiple connections per worker. Improves performance under load.

Buffer Tuning

Appropriate buffer sizes:

proxy_buffer_size for response headers. Default is usually sufficient.

proxy_buffers for response body. Larger for big AI responses.

client_body_buffer_size for request bodies. Large prompts need larger buffers.

Connection Optimization

Efficient connection handling:

tcp_nodelay reduces latency for small packets. Important for streaming.

tcp_nopush optimizes packet sending. Works with sendfile.

sendfile for serving static files. Not applicable to proxied content.

What AI Engineers Need to Know

NGINX proficiency for AI serving means understanding:

  1. Streaming configuration for LLM responses
  2. Load balancing appropriate for variable-cost requests
  3. Rate limiting to protect expensive inference
  4. SSL termination without killing performance
  5. Caching strategies for cost reduction
  6. Health checks for reliable routing
  7. Performance tuning for AI workloads

The engineers who master these patterns build AI infrastructure that handles production traffic reliably and efficiently.

For more on AI infrastructure, check out my guides on building AI applications with FastAPI and AI infrastructure decisions. Understanding the infrastructure layer is essential for production AI systems.

Ready to configure production AI infrastructure? Watch the implementation on YouTube where I set up real NGINX configurations. And if you want to learn alongside other AI engineers, join our community where we share infrastructure patterns daily.

Zen van Riel

Zen van Riel

Senior AI Engineer at GitHub | Ex-Microsoft

I grew from intern to Senior Engineer at GitHub, previously working at Microsoft. Now I teach 22,000+ engineers on YouTube, reaching hundreds of thousands of developers with practical AI engineering tutorials. My blog posts are generated from my own video content, focusing on real-world implementation over theory.

Blog last updated