Ollama
Definition
Ollama is an open-source tool that simplifies running large language models locally, providing one-command model downloads, automatic optimization, and a simple API for integration.
Why It Matters
Ollama removes the friction from local LLM deployment. Before Ollama, running a local LLM required downloading model files manually, configuring quantization, managing memory, and handling hardware detection. Ollama wraps this complexity in a single command: ollama run llama3.
For AI engineers, Ollama provides a local development environment that mirrors cloud API patterns. The API is OpenAI-compatible, meaning code written for OpenAI can often switch to local Ollama with a base URL change. This enables rapid prototyping without API costs and easy migration between local and cloud deployment.
Ollama’s model library provides pre-quantized, optimized versions of popular open-source models. You don’t need to understand GGUF formats or quantization levels. Ollama selects appropriate versions for your hardware automatically.
Implementation Basics
Getting started:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Run a model (downloads automatically)
ollama run llama3
# List available models
ollama list
# Pull a specific model
ollama pull mistral
API usage:
Ollama exposes an HTTP API at localhost:11434:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain RAG in one sentence"
}'
The API is OpenAI-compatible at /v1/chat/completions, enabling drop-in replacement for OpenAI SDK usage.
Key features:
- Automatic GPU detection: Uses CUDA/Metal when available
- Memory management: Handles model loading/unloading
- Concurrent requests: Serves multiple requests efficiently
- Custom models: Create Modelfiles to customize base models
- Embedding support: Generate embeddings locally
Model management:
- Models stored in
~/.ollama/models - Delete unused models with
ollama rm <model> - Check running models with
ollama ps - Stop all models with
ollama stop
Integration patterns:
- Use with LangChain/LlamaIndex via their Ollama integrations
- Direct HTTP calls for simple applications
- OpenAI SDK with custom base URL for existing OpenAI code
Ollama is ideal for development and personal use. For production serving with high concurrency, consider vLLM or text-generation-inference which offer better throughput optimization.
Source
Get up and running with large language models locally. Run Llama 3, Mistral, Gemma, and other models.
https://ollama.ai/