Back to Glossary
Implementation

Ollama

Definition

Ollama is an open-source tool that simplifies running large language models locally, providing one-command model downloads, automatic optimization, and a simple API for integration.

Why It Matters

Ollama removes the friction from local LLM deployment. Before Ollama, running a local LLM required downloading model files manually, configuring quantization, managing memory, and handling hardware detection. Ollama wraps this complexity in a single command: ollama run llama3.

For AI engineers, Ollama provides a local development environment that mirrors cloud API patterns. The API is OpenAI-compatible, meaning code written for OpenAI can often switch to local Ollama with a base URL change. This enables rapid prototyping without API costs and easy migration between local and cloud deployment.

Ollama’s model library provides pre-quantized, optimized versions of popular open-source models. You don’t need to understand GGUF formats or quantization levels. Ollama selects appropriate versions for your hardware automatically.

Implementation Basics

Getting started:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Run a model (downloads automatically)
ollama run llama3

# List available models
ollama list

# Pull a specific model
ollama pull mistral

API usage:

Ollama exposes an HTTP API at localhost:11434:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain RAG in one sentence"
}'

The API is OpenAI-compatible at /v1/chat/completions, enabling drop-in replacement for OpenAI SDK usage.

Key features:

  • Automatic GPU detection: Uses CUDA/Metal when available
  • Memory management: Handles model loading/unloading
  • Concurrent requests: Serves multiple requests efficiently
  • Custom models: Create Modelfiles to customize base models
  • Embedding support: Generate embeddings locally

Model management:

  • Models stored in ~/.ollama/models
  • Delete unused models with ollama rm <model>
  • Check running models with ollama ps
  • Stop all models with ollama stop

Integration patterns:

  • Use with LangChain/LlamaIndex via their Ollama integrations
  • Direct HTTP calls for simple applications
  • OpenAI SDK with custom base URL for existing OpenAI code

Ollama is ideal for development and personal use. For production serving with high concurrency, consider vLLM or text-generation-inference which offer better throughput optimization.

Source

Get up and running with large language models locally. Run Llama 3, Mistral, Gemma, and other models.

https://ollama.ai/