AI Engineering Glossary

Essential terminology for building production AI systems. 238 terms and counting.

LLM

Large Language Model concepts and terminology

Agent (AI Agent)

An AI agent is an autonomous system that uses an LLM as its reasoning engine to plan, execute multi-step tasks, and interact with external tools to accomplish complex goals.

Learn more →

AGI (Artificial General Intelligence)

AGI refers to a hypothetical AI system with human-level cognitive abilities across all domains, capable of learning and performing any intellectual task that humans can do.

Learn more →

AI safety encompasses practices, research, and engineering to ensure AI systems behave as intended, avoid harmful outputs, and remain under human control, spanning from immediate content filtering to long-term alignment.

Learn more →

Alignment

Alignment is the challenge of ensuring AI systems pursue intended goals and behave according to human values and preferences, encompassing training techniques like RLHF and Constitutional AI that shape model behavior.

Learn more →

BLEU Score

BLEU (Bilingual Evaluation Understudy) is a metric measuring text generation quality by comparing n-gram overlap between generated text and reference texts, originally designed for machine translation evaluation.

Learn more →

Chain-of-Thought (CoT)

Chain-of-thought is a prompting technique where the LLM is instructed to show its reasoning step-by-step before answering, dramatically improving performance on complex reasoning, math, and multi-step problems.

Learn more →

Constitutional AI

Constitutional AI is an alignment technique developed by Anthropic that trains models to follow a set of principles (a 'constitution') using self-critique and revision, reducing reliance on human feedback for safety training.

Learn more →

Context Length

The maximum number of tokens an LLM can process in a single request, including both input prompt and generated output, determining how much information can be considered.

Learn more →

Context Window

The context window is the maximum number of tokens an LLM can process in a single request, including both input (prompt, documents, history) and output, typically ranging from 4K to 1M+ tokens.

Learn more →

DeepSeek

DeepSeek is a Chinese AI company that produces open-weight large language models, including the DeepSeek V3 general-purpose model and DeepSeek R1 reasoning model. Both are known for achieving frontier-level performance at significantly lower training and inference costs through efficient architecture innovations.

Learn more →

DPO (Direct Preference Optimization)

DPO is a simpler alternative to RLHF that directly optimizes language models on human preference data without requiring a separate reward model or reinforcement learning, making alignment training more stable and accessible.

Learn more →

Embeddings

Embeddings are dense numerical vectors that represent text, images, or other data in a high-dimensional space where semantically similar items are positioned closer together, enabling similarity search and retrieval.

Learn more →

Extended Thinking

Extended Thinking is Claude's ability to engage in visible step-by-step reasoning before responding, using additional compute time to work through complex problems systematically.

Learn more →

Few-Shot Learning

Few-shot learning is providing an LLM with a small number of examples (typically 2-5) in the prompt to demonstrate the desired task, enabling the model to generalize the pattern to new inputs.

Learn more →

Fine-Tuning

Fine-tuning is the process of further training a pre-trained language model on a specific dataset to adapt it for particular tasks, domains, or behaviors while preserving its general capabilities.

Learn more →

Frontier Models

Frontier models are the most advanced AI systems at the leading edge of capabilities, typically requiring massive compute resources and pushing the boundaries of what AI can do.

Learn more →

Function Calling

Function calling is an LLM capability that generates structured JSON output matching predefined function schemas, enabling reliable tool invocation and data extraction with validated parameters.

Learn more →

Grounding

Grounding connects LLM outputs to verifiable information sources (documents, databases, APIs) ensuring responses are based on real data rather than the model's training knowledge, reducing hallucination.

Learn more →

Hallucination

Hallucination is when an LLM generates false or fabricated information with high confidence, producing plausible-sounding but incorrect facts, citations, or details not grounded in reality or source data.

Learn more →

In-Context Learning

In-context learning is the ability of large language models to learn new tasks from examples provided in the prompt without any weight updates, enabling task adaptation through demonstration rather than fine-tuning.

Learn more →

Inference

Inference is the process of running a trained AI model to generate predictions or outputs from new inputs, as opposed to training which updates model weights. In LLMs, inference means generating text responses from prompts.

Learn more →

Jailbreaking

Jailbreaking is the practice of circumventing an LLM's built-in safety restrictions and content policies through specially crafted prompts, causing the model to generate normally prohibited outputs.

Learn more →

LLM (Large Language Model)

A Large Language Model is a neural network with billions of parameters trained on massive text datasets to predict the next token, enabling capabilities like text generation, reasoning, and instruction following.

Learn more →

Long Context

LLM capabilities for processing 100K+ token inputs, enabling document analysis, codebase understanding, and extended conversations without chunking or summarization.

Learn more →

LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning technique that trains small adapter matrices instead of the full model weights, reducing memory requirements by 10-100x while achieving comparable performance to full fine-tuning.

Learn more →

Model Distillation

Model distillation is a technique that trains a smaller 'student' model to mimic the outputs of a larger 'teacher' model, creating more efficient models that retain much of the original's capability at a fraction of the size and cost.

Learn more →

Needle in a Haystack

A benchmark test that evaluates an LLM's ability to retrieve specific information from various positions within long context inputs, revealing attention limitations.

Learn more →

Open-weight Models

Open-weight models are LLMs where the model weights are publicly released for download, allowing anyone to run, fine-tune, and deploy them locally without API fees.

Learn more →

PEFT (Parameter-Efficient Fine-Tuning)

PEFT is an umbrella term for techniques that adapt large language models by training only a small subset of parameters, reducing compute and memory requirements by 10-1000x compared to full fine-tuning while achieving comparable performance.

Learn more →

Perplexity

Perplexity is a metric measuring how well a language model predicts text, calculated as the exponential of the average negative log-likelihood. Lower perplexity indicates the model is less 'surprised' by the text and better at prediction.

Learn more →

Planning (AI Agent Context)

Planning in AI agents is the capability to decompose complex goals into ordered sequences of actions, anticipate outcomes, allocate resources, and adapt strategies based on execution results.

Learn more →

Prompt Engineering

Prompt engineering is the practice of crafting inputs to LLMs that reliably produce desired outputs, including techniques like few-shot examples, structured formatting, and explicit instructions.

Learn more →

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA fine-tuning, enabling fine-tuning of large language models on consumer GPUs by dramatically reducing memory requirements without significant quality loss.

Learn more →

Quantization

Quantization reduces model memory footprint and increases inference speed by representing weights in lower-precision formats (INT8, INT4) instead of 16-bit or 32-bit floating point, typically with minimal quality loss.

Learn more →

RAG (Retrieval-Augmented Generation)

RAG is a technique that enhances LLM responses by retrieving relevant documents from a knowledge base and including them as context, reducing hallucinations and enabling accurate answers about private or recent data.

Learn more →

Reasoning (AI Reasoning)

AI reasoning refers to an LLM's ability to logically analyze problems, draw conclusions, and solve multi-step tasks, enhanced through techniques like chain-of-thought prompting and specialized reasoning models like o1.

Learn more →

RLHF (Reinforcement Learning from Human Feedback)

RLHF is a training technique that aligns language model outputs with human preferences by training a reward model on human comparisons and using reinforcement learning to optimize the LLM toward higher-scoring responses.

Learn more →

ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics measuring text summarization quality by comparing overlap between generated summaries and reference summaries, focusing on recall of important content.

Learn more →

Sample Term

This is a sample term used for testing the glossary infrastructure. It will be removed once real content is added.

Learn more →

Small Language Models (SLMs)

Small language models are LLMs with fewer parameters (typically under 10B) optimized for efficiency, enabling deployment on edge devices, lower costs, and faster inference.

Learn more →

System Prompt

A system prompt is special instructions provided to an LLM that define its behavior, persona, capabilities, and constraints, typically set by developers and persistent across a conversation.

Learn more →

Temperature

Temperature is a parameter that controls LLM output randomness, where lower values (0-0.3) produce deterministic, focused responses and higher values (0.7-1.0) increase creativity and variation.

Learn more →

Test-Time Compute

Test-time compute refers to allocating additional computational resources during inference to improve model outputs, through techniques like extended reasoning, multiple sampling, or search-based generation.

Learn more →

Tokens

Tokens are the basic units that LLMs process, subword pieces typically 3-4 characters long. They determine context limits, API costs, and processing speed, with 1,000 tokens roughly equal to 750 words.

Learn more →

Tool Use

Tool use enables LLMs to interact with external systems (APIs, databases, code interpreters) by generating structured requests that your application executes, extending the model's capabilities beyond text generation.

Learn more →

Top-P (Nucleus Sampling)

Top-P (nucleus sampling) is a decoding strategy that limits token selection to the smallest set whose cumulative probability exceeds P, providing dynamic vocabulary restriction based on confidence.

Learn more →

Zero-Shot Learning

Zero-shot learning is asking an LLM to perform a task using only instructions and no examples, relying on the model's pre-trained knowledge to generalize to new tasks it wasn't explicitly trained on.

Learn more →

MLOps

Machine Learning Operations and deployment

A/B Testing

A/B testing in AI systems compares two variants (A and B) by randomly assigning users to each version and measuring which performs better on defined metrics like engagement, accuracy, or user satisfaction.

Learn more →

Ablation Study

An ablation study systematically removes or modifies components of an AI system to measure their individual contribution to overall performance, revealing which elements matter most.

Learn more →

API Gateway (AI Context)

An API gateway in AI systems is a centralized entry point that manages traffic to ML model endpoints, handling authentication, rate limiting, request routing, caching, and load balancing across multiple model versions or providers.

Learn more →

Automated Evaluation

Automated evaluation uses programmatic methods (metrics, test suites, and LLM-as-judge approaches) to assess AI system quality without human reviewers, enabling rapid iteration and continuous monitoring.

Learn more →

Autoscaling

Autoscaling automatically adjusts the number of AI inference servers based on demand, scaling up during traffic spikes and down during quiet periods to optimize cost and performance.

Learn more →

Batch Processing

Batch processing in AI systems runs inference or training on large datasets in scheduled jobs, prioritizing throughput and cost efficiency over real-time response.

Learn more →

Batching

Processing multiple LLM requests together to maximize GPU utilization and throughput, trading individual latency for overall system efficiency.

Learn more →

Benchmark

A benchmark is a standardized test suite with defined tasks, datasets, and metrics used to evaluate and compare AI model performance, enabling reproducible comparison across different systems.

Learn more →

CI/CD for ML

CI/CD for ML extends continuous integration and deployment practices to machine learning, automating testing, validation, and deployment of both code and model artifacts throughout the ML lifecycle.

Learn more →

Containerization

Containerization packages an AI application with all its dependencies into a portable, isolated unit that runs consistently across different computing environments.

Learn more →

CUDA

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and API that enables AI frameworks to leverage GPU acceleration for training and inference.

Learn more →

Data Drift

Data drift occurs when the statistical properties of production input data change over time compared to the training data, causing model performance to degrade even though the model itself hasn't changed.

Learn more →

Docker

Docker is a platform for building, shipping, and running applications in containers, which are lightweight, portable environments that include everything needed to run AI models and applications consistently.

Learn more →

Evaluation Metrics

Evaluation metrics are quantitative measures used to assess how well AI models perform specific tasks, including accuracy, precision, recall, F1 score, and domain-specific metrics for NLP and generation quality.

Learn more →

Experiment Tracking

Experiment tracking is the practice of systematically logging and organizing ML experiments (including parameters, metrics, artifacts, and code versions) to enable comparison, reproducibility, and informed decision-making.

Learn more →

Feature Store

A feature store is a centralized repository for storing, managing, and serving ML features (the processed input variables used for model training and inference) ensuring consistency between training and production environments.

Learn more →

GPU (Graphics Processing Unit)

A GPU is a specialized processor designed for parallel computation, essential for training and running AI models due to its ability to perform thousands of matrix operations simultaneously.

Learn more →

Human Evaluation

Human evaluation is the process of having people assess AI system outputs for quality dimensions that automated metrics cannot capture, including helpfulness, coherence, accuracy, style, and safety.

Learn more →

Kubernetes

Kubernetes (K8s) is an open-source container orchestration platform that automates deploying, scaling, and managing containerized AI applications across clusters of machines.

Learn more →

Langfuse

Langfuse is an open-source LLM observability platform providing tracing, analytics, evaluation, and prompt management for production AI applications, with self-hosting options.

Learn more →

LangSmith

LangSmith is LangChain's observability and testing platform for LLM applications, providing tracing, debugging, evaluation, and monitoring capabilities for production AI systems.

Learn more →

Leaderboard

A leaderboard is a ranked listing of AI models based on their performance on standardized benchmarks, providing transparent comparison and tracking progress in the field.

Learn more →

Load Balancing

Load balancing distributes AI inference requests across multiple model servers to maximize throughput, minimize latency, and ensure high availability of AI applications.

Learn more →

MLOps

MLOps (Machine Learning Operations) is the practice of applying DevOps principles to ML systems, automating and standardizing the entire ML lifecycle from data preparation through model deployment, monitoring, and retraining.

Learn more →

Model Drift

Model drift (also called concept drift) occurs when the relationship between inputs and outputs changes over time, making previously learned patterns obsolete even if the input data distribution remains stable.

Learn more →

Model Optimization

Model optimization encompasses techniques to make AI models faster, smaller, and more efficient for deployment, including quantization, pruning, distillation, and compilation.

Learn more →

Model Registry

A model registry is a centralized repository for storing, versioning, and managing ML models throughout their lifecycle, tracking metadata like training parameters, performance metrics, and deployment status.

Learn more →

Model Serving

Model serving is the infrastructure and processes for hosting trained ML models in production to handle real-time prediction requests via APIs, managing concerns like latency, throughput, scaling, and availability.

Learn more →

Model Versioning

Model versioning is the practice of tracking and managing different versions of ML models and their associated artifacts, enabling reproducibility, comparison, rollback, and auditing throughout the model lifecycle.

Learn more →

Monitoring (ML Context)

ML monitoring is the continuous observation of model performance, data quality, and system health in production, enabling detection of degradation, drift, and anomalies before they impact business outcomes.

Learn more →

RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework specifically designed for RAG pipelines, providing metrics like faithfulness, answer relevancy, and context precision to measure retrieval and generation quality without requiring ground truth datasets.

Learn more →

Rate Limiting (AI Context)

Rate limiting in AI systems restricts the number of requests a user or application can make to model endpoints within a time window, protecting infrastructure from overload and managing costs for expensive LLM inference.

Learn more →

Red Teaming

Red teaming in AI is the practice of adversarially testing systems to discover failures, vulnerabilities, and harmful behaviors before deployment, simulating how malicious users might exploit the system.

Learn more →

Serverless

Serverless computing runs AI inference functions on-demand without managing servers, automatically scaling from zero to handle variable workloads and charging only for actual compute used.

Learn more →

Shadow AI

Shadow AI refers to employees using unauthorized AI tools (like ChatGPT) for work tasks without IT approval, creating security, compliance, and data governance risks for organizations.

Learn more →

Text Generation Inference (TGI)

Hugging Face's production-ready inference server for LLMs, optimized for high throughput with features like continuous batching, tensor parallelism, and Flash Attention support.

Learn more →

vLLM

A high-throughput LLM serving library that uses PagedAttention for efficient memory management, enabling faster inference and higher GPU utilization in production.

Learn more →

Implementation

Practical implementation patterns and techniques

Agentic RAG

Agentic RAG is a retrieval-augmented generation pattern where an AI agent autonomously decides when, what, and how to retrieve information, dynamically adjusting retrieval strategy based on query complexity, context, and intermediate results.

Learn more →

AI Wrapper

An AI wrapper is an application that provides value primarily through a user interface or specialized workflow built on top of foundation model APIs like GPT-4 or Claude.

Learn more →

Approximate Nearest Neighbor (ANN)

ANN algorithms find vectors similar to a query vector without exhaustively comparing every vector, trading a small accuracy loss for dramatically faster search. This is essential for semantic search at scale.

Learn more →

AutoGen

AutoGen is Microsoft's framework for building multi-agent AI systems where multiple agents can converse, collaborate, and execute code together to solve complex tasks through structured conversations.

Learn more →

Bi-Encoder

A bi-encoder embeds queries and documents independently into dense vectors, enabling fast similarity comparison through pre-computed document embeddings. It is the architecture behind semantic search.

Learn more →

BM25

BM25 (Best Match 25) is a ranking algorithm that scores documents based on term frequency and inverse document frequency, with length normalization. It is the standard for keyword-based search.

Learn more →

Caching

Caching in AI systems stores computed results to avoid redundant model inference, dramatically reducing latency and costs for repeated or similar queries.

Learn more →

Chroma

Chroma is an open-source embedding database designed for simplicity, offering an in-memory mode for development and persistent storage for production, with a focus on being the easiest way to build LLM applications.

Learn more →

Chunking

Chunking is the process of splitting documents into smaller pieces for embedding and retrieval, balancing between preserving context and fitting within token limits while maintaining semantic coherence.

Learn more →

Claude API

The Claude API provides programmatic access to Anthropic's Claude models, offering strong reasoning, long context windows up to 200K tokens, and a focus on safety and helpful responses for production AI applications.

Learn more →

Cohere Rerank

Cohere Rerank is a cross-encoder reranking model that scores query-document relevance with high accuracy, used to reorder retrieval results in RAG pipelines for improved answer quality.

Learn more →

ColBERT (Contextualized Late Interaction)

ColBERT is a retrieval model that creates token-level embeddings and uses late interaction matching, achieving better accuracy than single-vector embeddings while remaining efficient enough for large-scale search.

Learn more →

Computer Use

Computer Use refers to AI models that can interact with graphical user interfaces by viewing screenshots and performing mouse clicks and keyboard inputs, enabling automation of desktop and web applications.

Learn more →

Computer Use API

Computer Use APIs enable AI models to control computers by viewing screens, moving the mouse, clicking, and typing - allowing autonomous interaction with any software through its visual interface.

Learn more →

Context Compression

Techniques for reducing token count while preserving essential information, enabling more content to fit within context windows and reducing API costs.

Learn more →

Cosine Similarity

Cosine similarity measures the angle between two vectors, returning a value from -1 to 1 that indicates how similar their directions are regardless of magnitude. It is the primary metric for comparing embeddings in semantic search.

Learn more →

CrewAI

CrewAI is a multi-agent framework that organizes AI agents into role-based 'crews' where each agent has a specific persona, goal, and set of tools, enabling collaborative task execution through structured workflows.

Learn more →

Cross-Encoder

A cross-encoder is a neural model that takes a query and document together as input, processing them jointly to produce a relevance score. It is more accurate than bi-encoders but too slow for initial retrieval.

Learn more →

Dense Retrieval

Dense retrieval uses learned embeddings (dense vectors) to represent queries and documents, finding relevant results through vector similarity rather than keyword matching.

Learn more →

Edge Inference

Edge inference runs AI models directly on local devices (phones, IoT, edge servers) rather than in the cloud, enabling offline operation, lower latency, and improved privacy.

Learn more →

FastAPI

FastAPI is a modern Python web framework for building APIs, popular in AI applications for its async support, automatic documentation, and type-based validation that accelerates development of LLM and ML model endpoints.

Learn more →

Function Schema

A JSON Schema definition that describes function parameters for LLM tool use, enabling models to generate properly formatted function calls.

Learn more →

Gemini API

Google's API for accessing the Gemini family of multimodal AI models, offering text, image, audio, and video understanding capabilities with native function calling support.

Learn more →

GraphRAG

GraphRAG enhances traditional RAG by using knowledge graphs to capture entity relationships, enabling more accurate answers for complex queries that require understanding connections between concepts rather than just finding similar text.

Learn more →

Guardrails

Guardrails are safety mechanisms that validate, filter, and constrain LLM inputs and outputs to prevent harmful content, enforce policies, ensure output quality, and protect against prompt injection attacks.

Learn more →

HNSW (Hierarchical Navigable Small World)

HNSW is the most widely used approximate nearest neighbor algorithm, building a multi-layer graph structure that enables logarithmic-complexity vector search and is the default in most vector databases.

Learn more →

Hybrid Search

Hybrid search combines semantic vector search with traditional keyword search (like BM25) to get the benefits of both approaches, including semantic understanding plus exact match precision.

Learn more →

HyDE (Hypothetical Document Embeddings)

HyDE is a retrieval technique that generates a hypothetical answer to a query using an LLM, then embeds that answer to find similar real documents, bridging the gap between question and document embeddings.

Learn more →

JSON Mode

An LLM API feature that guarantees responses are valid JSON, though without schema enforcement, making it useful for simple structured data extraction.

Learn more →

Keyword Search

Keyword search (also called lexical search) finds documents containing the exact words in a query, using algorithms like BM25 to rank results by term frequency and document relevance.

Learn more →

LangChain

LangChain is a Python and JavaScript framework for building applications with large language models, providing abstractions for chains, agents, memory, and integrations with various LLM providers and tools.

Learn more →

LangGraph

LangGraph is a framework for building stateful, multi-step agent applications using a graph-based architecture where nodes represent actions and edges control the flow between them.

Learn more →

Latency (AI Context)

Latency in AI systems is the time delay between sending a request and receiving the first response token (time-to-first-token) or complete response, directly impacting user experience and system design decisions.

Learn more →

LlamaIndex

LlamaIndex is a data framework for LLM applications that specializes in connecting, indexing, and querying custom data sources, with particular strength in building RAG systems and knowledge bases.

Learn more →

LM Studio

A desktop application for discovering, downloading, and running local LLMs with an OpenAI-compatible API server, enabling offline AI development.

Learn more →

Local LLM

A local LLM is a large language model running on your own hardware (laptop, workstation, or server) rather than accessed through cloud APIs, enabling privacy, offline use, and cost control.

Learn more →

MCP Server

An MCP (Model Context Protocol) server is a program that exposes data, tools, and context to AI applications through Anthropic's open standard, enabling LLMs to interact with external systems.

Learn more →

Model Context Protocol (MCP)

Model Context Protocol is an open standard created by Anthropic that enables AI applications to connect with external tools and data sources through a universal interface, often described as 'USB-C for AI.'

Learn more →

Ollama

Ollama is an open-source tool that simplifies running large language models locally, providing one-command model downloads, automatic optimization, and a simple API for integration.

Learn more →

OpenAI API

The OpenAI API provides programmatic access to GPT-5, o3, o4-mini, DALL-E, Whisper, and embedding models through a REST interface, enabling developers to integrate advanced AI capabilities into applications.

Learn more →

pgvector

pgvector is a PostgreSQL extension that adds vector similarity search capabilities to the world's most popular open-source relational database, enabling hybrid queries that combine traditional SQL with embedding-based retrieval.

Learn more →

Pinecone

Pinecone is a fully managed vector database designed for machine learning applications, offering fast similarity search at scale with automatic indexing, sharding, and high availability without infrastructure management.

Learn more →

Prefill

The initial phase of LLM inference where the model processes all input tokens to build the key-value cache, determining time-to-first-token latency.

Learn more →

Prompt Caching

Prompt caching stores the computed state (KV cache) from static prompt portions across requests, reducing latency and costs when multiple requests share common prefixes like system prompts or context documents.

Learn more →

Prompt Injection

Prompt injection is an attack where malicious input manipulates an LLM's behavior by overriding or bypassing system instructions, causing the model to ignore safety guidelines or perform unintended actions.

Learn more →

Reranking

Reranking is a second-stage retrieval process that takes initial search results and reorders them using a more powerful model to improve precision, typically using cross-encoders that compare query and document together.

Learn more →

Retrieval

Retrieval is the process of finding and ranking relevant documents from a knowledge base in response to a query, typically using embedding similarity, keyword matching, or hybrid approaches.

Learn more →

Semantic Kernel

Semantic Kernel is Microsoft's open-source SDK for building AI agents and copilots, designed for enterprise integration with strong typing, plugin architecture, and native support for Azure OpenAI.

Learn more →

Semantic Search

Semantic search finds results based on meaning and intent rather than exact keyword matches, using embeddings to understand the conceptual similarity between queries and documents.

Learn more →

Sparse Retrieval

Sparse retrieval represents documents and queries as high-dimensional vectors where most values are zero, with non-zero values corresponding to specific vocabulary terms, the basis for traditional keyword search.

Learn more →

Speculative Decoding

An inference acceleration technique that uses a smaller, faster draft model to predict multiple tokens, which are then verified in parallel by the larger target model, achieving significant speedups without changing output quality.

Learn more →

Streaming (LLM)

LLM streaming delivers model output token-by-token as it's generated rather than waiting for the complete response, significantly improving perceived latency and user experience in conversational AI applications.

Learn more →

Streaming Inference

Streaming inference delivers AI model outputs incrementally as they're generated, enabling responsive user experiences for LLMs where complete responses take several seconds to produce.

Learn more →

Structured Output

The ability to constrain LLM responses to follow specific formats like JSON schemas, ensuring reliable parsing and integration with downstream systems.

Learn more →

Swarm (OpenAI)

Swarm is OpenAI's experimental multi-agent orchestration framework that enables lightweight coordination between specialized agents through handoffs and routines, designed for simplicity and educational purposes.

Learn more →

Token Streaming

Real-time delivery of LLM output tokens as they are generated, enabling responsive user interfaces without waiting for complete responses.

Learn more →

Vector Database

A vector database is a specialized database optimized for storing and searching high-dimensional embedding vectors, using algorithms like HNSW to find similar items in milliseconds across millions of vectors.

Learn more →

Weaviate

Weaviate is an open-source vector database with built-in modules for vectorization, hybrid search, and generative features, supporting both self-hosted deployments and a managed cloud service.

Learn more →

Webhooks (AI Context)

Webhooks in AI applications enable event-driven architectures where LLM completions, async processing results, or AI workflow outputs automatically trigger callbacks to your application endpoints.

Learn more →

Architecture

System design and architectural patterns

Agentic Workflows

Agentic workflows are AI system architectures where LLMs autonomously plan, execute multi-step tasks, use tools, and iterate toward goals, moving beyond simple request-response patterns to dynamic, goal-driven automation.

Learn more →

Attention Mechanism

A neural network mechanism that lets models weigh the relevance of different parts of the input when producing each output, enabling dynamic focus on contextually important information.

Learn more →

Autoregressive

A generation approach where models predict one token at a time, with each new token conditioned on all previously generated tokens. The standard method for LLM text generation.

Learn more →

Beam Search

Beam search is a decoding algorithm that maintains multiple candidate sequences (beams) at each generation step, exploring the most promising paths to find higher-quality outputs than greedy decoding while remaining computationally tractable.

Learn more →

BPE (Byte Pair Encoding)

BPE is a tokenization algorithm that iteratively merges the most frequent pairs of characters or tokens into new tokens, creating a vocabulary that balances between character-level and word-level representations for efficient text processing in LLMs.

Learn more →

Diffusion Model

A generative model that learns to reverse a gradual noising process, generating data by iteratively denoising from pure noise. The dominant architecture for AI image generation.

Learn more →

Encoder-Decoder

A neural network architecture with two components: an encoder that processes input into a representation, and a decoder that generates output from that representation. Used for sequence-to-sequence tasks.

Learn more →

Feed-Forward Network

A neural network component in Transformers that processes each position independently through two linear transformations with a non-linear activation, providing position-wise computation after attention.

Learn more →

Flash Attention

An IO-aware attention algorithm that reduces memory usage and speeds up transformer inference by minimizing GPU memory reads/writes through tiling and recomputation.

Learn more →

Foundation Model

A foundation model is a large AI model trained on broad data that can be adapted to many downstream tasks through fine-tuning or prompting, serving as the base for specialized applications.

Learn more →

KV Cache

Key-Value cache storing computed attention states during LLM generation, enabling efficient autoregressive decoding by avoiding redundant computation of previous tokens.

Learn more →

Layer Normalization

A technique that normalizes activations across features within each layer, stabilizing training and enabling deeper neural networks without batch dependencies.

Learn more →

Memory (AI Context)

Memory in AI systems enables LLMs to retain and access information across interactions (including conversation history, user preferences, and learned facts) beyond the limitations of a single context window.

Learn more →

Mixture of Experts (MoE)

Mixture of Experts is a neural network architecture that uses multiple specialized sub-networks (experts) with a routing mechanism that activates only a subset for each input, enabling larger model capacity without proportionally increasing compute costs.

Learn more →

Multi-Head Attention

An extension of attention that runs multiple attention operations in parallel with different learned projections, allowing the model to capture different types of relationships simultaneously.

Learn more →

Multimodal

AI systems that can process and generate multiple types of data (text, images, audio, video) within a single model, enabling cross-modal understanding and generation.

Learn more →

Orchestration

Orchestration in AI systems coordinates multiple LLM calls, tools, and data sources into cohesive workflows, managing the flow of information, error handling, and state across complex multi-step operations.

Learn more →

Positional Encoding

A technique for injecting sequence order information into Transformer models, which otherwise process tokens in parallel without inherent position awareness.

Learn more →

Self-Attention

An attention mechanism where a sequence attends to itself, allowing each position to gather information from all other positions in the same sequence. The core operation in Transformer models.

Learn more →

Speech-to-Text

AI systems that convert spoken language into written text, also known as automatic speech recognition (ASR). Modern approaches use neural networks trained on massive audio-text datasets.

Learn more →

Text-to-Image

AI systems that generate images from natural language descriptions, using models like diffusion or transformers to translate text prompts into visual content.

Learn more →

Text-to-Speech

AI systems that convert written text into natural-sounding spoken audio, using neural networks to synthesize human-like voice with appropriate prosody, emotion, and intonation.

Learn more →

Transformer

A neural network architecture that uses self-attention to process input sequences in parallel, enabling models to capture long-range dependencies and scale efficiently. The foundation of all modern LLMs.

Learn more →

Vision-Language Model

A multimodal AI model that processes both images and text together, enabling visual understanding, image-based reasoning, and text generation grounded in visual content.

Learn more →

Careers

AI engineering roles, skills, and career paths

AI Engineer

An AI Engineer is a professional who builds production AI applications by integrating large language models, designing RAG systems, implementing AI agents, and deploying AI-powered features into real products.

Learn more →

AI Engineer Interview

AI Engineer interviews typically include system design for AI applications, coding challenges involving LLM integration, prompt engineering exercises, and discussions of RAG architecture and agent design patterns.

Learn more →

AI Engineer Roadmap

An AI Engineer roadmap is a structured learning path that guides developers from foundational skills through advanced AI engineering topics like RAG, agents, and production deployment.

Learn more →

AI Engineering Portfolio

An AI Engineering portfolio showcases projects demonstrating practical skills in building AI applications, including RAG systems, AI agents, LLM integrations, and production-ready AI features.

Learn more →

AI Engineering Salary

AI Engineer salaries typically range from $150,000 to $400,000+ in the US market, varying based on experience level, location, company type, and specialization within the AI engineering field.

Learn more →

AI Engineering Skills

AI Engineering skills include Python programming, LLM API integration, RAG system design, prompt engineering, vector databases, AI agent development, evaluation methods, and production deployment practices.

Learn more →

AI Product Manager

An AI Product Manager is responsible for defining, prioritizing, and guiding the development of AI-powered products, bridging technical AI capabilities with user needs and business objectives.

Learn more →

LLMOps Engineer

An LLMOps Engineer specializes in the operational aspects of LLM applications: deployment, monitoring, evaluation, cost optimization, and maintaining reliability of AI systems in production.

Learn more →

ML Engineer vs AI Engineer

ML Engineers focus on training and optimizing machine learning models from scratch, while AI Engineers specialize in building applications using pre-trained foundation models like GPT-4 or Claude.

Learn more →

Prompt Engineer

A Prompt Engineer is a professional who specializes in crafting, testing, and optimizing prompts to get the best performance from large language models for specific tasks and applications.

Learn more →

Prompting

Prompt engineering techniques and patterns

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting encourages LLMs to show their reasoning step-by-step before giving a final answer, dramatically improving performance on complex reasoning and math problems.

Learn more →

DSPy

DSPy is a framework for programmatically optimizing LLM prompts and pipelines, replacing manual prompt engineering with automated optimization based on metrics and examples.

Learn more →

Few-shot Prompting

Few-shot prompting provides 2-5 examples of desired input-output pairs before the actual task, helping the LLM understand the expected format, style, and behavior through demonstration.

Learn more →

Prompt Templates

Prompt templates are reusable prompt structures with placeholder variables that can be filled in dynamically, enabling consistent and maintainable prompt engineering across applications.

Learn more →

ReAct Prompting

ReAct (Reasoning + Acting) prompting combines chain-of-thought reasoning with action execution, creating agents that think about what to do, take actions, observe results, and iterate.

Learn more →

Role Prompting

Role prompting assigns a specific persona or expertise to an LLM (e.g., 'You are a senior Python developer'), which influences its response style, vocabulary, and approach to problems.

Learn more →

Self-Consistency

Self-consistency is a prompting technique that generates multiple reasoning paths for the same question and selects the most frequent answer, improving accuracy on complex reasoning tasks.

Learn more →

System Prompt Best Practices

System prompt best practices are guidelines for writing effective system-level instructions that define an LLM's behavior, persona, constraints, and output format for consistent, high-quality responses.

Learn more →

Tree of Thoughts

Tree of Thoughts (ToT) extends chain-of-thought by exploring multiple reasoning paths in parallel, allowing the model to consider different approaches and backtrack from dead ends.

Learn more →

Zero-shot Prompting

Zero-shot prompting is asking an LLM to perform a task without providing any examples, relying entirely on the model's pre-trained knowledge and the clarity of your instructions.

Learn more →

Agents

AI agents, automation, and agentic systems

Agent Handoff

Agent handoff is the process of transferring control from one AI agent to another, including passing relevant context and state so the receiving agent can continue the task seamlessly.

Learn more →

Agent Memory

Agent memory refers to systems that allow AI agents to store, retrieve, and use information across interactions, enabling persistent context and learning from past experiences.

Learn more →

Agent Orchestration

Agent orchestration is the process of coordinating multiple AI agents, managing their workflows, handling communication between them, and ensuring they work together effectively toward a common goal.

Learn more →

Agentic AI

Agentic AI refers to AI systems that can autonomously plan, reason, and execute multi-step tasks with minimal human intervention, going beyond simple prompt-response interactions.

Learn more →

Agentic Workflow

An agentic workflow is an AI-driven process where agents autonomously execute multi-step tasks, make decisions, use tools, and adapt their approach based on intermediate results to achieve a goal.

Learn more →

AI Agent

An AI agent is an autonomous system that uses large language models to perceive its environment, make decisions, and take actions to accomplish goals without constant human guidance.

Learn more →

Browser Agents

Browser agents are AI systems that can autonomously navigate websites, fill forms, click buttons, and extract information from web pages to complete tasks on behalf of users.

Learn more →

Human-in-the-Loop (HITL)

Human-in-the-loop (HITL) is a design pattern where AI agents pause at critical decision points to request human approval or guidance before proceeding with potentially consequential actions.

Learn more →

Multi-Agent System

A multi-agent system (MAS) is an architecture where multiple specialized AI agents collaborate, communicate, and coordinate to solve complex problems that exceed the capabilities of a single agent.

Learn more →

ReAct Pattern

ReAct (Reasoning + Acting) is an agent design pattern where the LLM alternates between reasoning about what to do next and taking actions, creating an explicit thought-action-observation loop.

Learn more →

Supervisor Agent

A supervisor agent is a controller agent that manages and coordinates other agents in a multi-agent system, deciding which agent should handle each task and aggregating their results.

Learn more →

RAG

Retrieval-Augmented Generation systems

CRAG (Corrective RAG)

Corrective RAG (CRAG) adds a self-correction layer that evaluates retrieved documents and triggers additional retrieval strategies (like web search) when the initial retrieval is insufficient.

Learn more →

GraphRAG

GraphRAG enhances traditional RAG by building a knowledge graph from documents, enabling retrieval of relationships and multi-hop reasoning that vector similarity search alone cannot capture.

Learn more →

Hybrid Search

Hybrid search combines dense vector retrieval (semantic similarity) with sparse retrieval (keyword matching like BM25) to capture both conceptual meaning and exact term matches.

Learn more →

Multimodal RAG

Multimodal RAG extends retrieval-augmented generation to handle images, videos, audio, and documents alongside text, enabling AI to answer questions using visual and other non-text content.

Learn more →

Parent Document Retriever

Parent Document Retriever is a RAG pattern that embeds small chunks for precise retrieval but returns their larger parent documents for generation, balancing retrieval precision with context richness.

Learn more →

Query Expansion

Query expansion augments the original user query with synonyms, related terms, or LLM-generated variations to improve retrieval recall and find relevant documents the original query might miss.

Learn more →

RAG Evaluation

RAG evaluation measures the quality of RAG systems across multiple dimensions: retrieval accuracy, answer faithfulness to sources, relevance to the query, and overall response quality.

Learn more →

RAG Pipeline

A RAG pipeline is the complete system for Retrieval-Augmented Generation, including document ingestion, chunking, embedding, indexing, retrieval, and generation components working together.

Learn more →

RAG vs Fine-tuning

RAG retrieves external knowledge at inference time while fine-tuning bakes knowledge into model weights during training - each approach suits different use cases for adding custom knowledge to LLMs.

Learn more →

Self-RAG

Self-RAG is an advanced RAG pattern where the model decides whether retrieval is needed, retrieves if necessary, grades retrieved documents for relevance, and self-critiques its generated answer.

Learn more →

Coding Tools

AI coding assistants and developer tools

Agentic Coding

Agentic coding refers to AI-assisted development where the AI autonomously plans, executes, and iterates on multi-step coding tasks rather than just providing suggestions for the developer to accept.

Learn more →

AI Code Review

AI code review uses LLMs to automatically analyze code changes, identify bugs, suggest improvements, check for security issues, and provide feedback similar to human reviewers.

Learn more →

Aider

Aider is an open-source, terminal-based AI pair programming tool that can edit code in your local git repository using natural language commands and automatic commits.

Learn more →

Amazon Q Developer

Amazon Q Developer (formerly CodeWhisperer) is AWS's AI coding assistant offering code suggestions, security scanning, and deep integration with AWS services and enterprise features.

Learn more →

Cursor AI

Cursor is an AI-first code editor (VS Code fork) designed around AI-assisted development, offering deep codebase understanding, multi-file editing, and agentic coding capabilities.

Learn more →

GitHub Copilot

GitHub Copilot is Microsoft's AI coding assistant that provides real-time code suggestions, chat-based help, and autonomous coding capabilities directly within VS Code and other IDEs.

Learn more →

Tabnine

Tabnine is a privacy-focused AI coding assistant that can run entirely on-premise or locally, making it popular with enterprises that cannot send code to external cloud services.

Learn more →

Vibe Coding

Vibe coding (coined by Andrej Karpathy) describes a development style where you describe what you want in natural language and let AI write the code, embracing imperfection and rapid iteration.

Learn more →

Windsurf (Codeium)

Windsurf is Codeium's AI-native IDE featuring Cascade, an agentic workflow system that can execute multi-step coding tasks autonomously while keeping developers in control.

Learn more →

Models

LLM models, providers, and comparisons

Grok

Grok is xAI's large language model known for its real-time access to X (Twitter) data, witty personality, and willingness to engage with controversial topics that other models refuse to discuss.

Learn more →

Mistral

Mistral AI is a French company known for efficient open-weight models like Mistral 7B and Mixtral, which deliver strong performance relative to their size and pioneered the Mixture of Experts architecture in open models.

Learn more →

Qwen

Qwen is Alibaba's family of open-weight language models offering competitive performance across sizes from 0.5B to 110B+ parameters, with strong multilingual capabilities and coding-specific variants.

Learn more →

Safety

AI safety, security, and responsible AI

AI Bias

AI bias refers to systematic errors in AI systems that produce unfair outcomes, typically arising from biased training data, flawed model design, or problematic deployment contexts.

Learn more →

AI Guardrails Implementation

AI guardrails are protective systems that constrain LLM behavior by filtering inputs, validating outputs, and enforcing safety policies to prevent harmful or unintended AI responses.

Learn more →

AI Hallucination Causes

AI hallucinations occur when LLMs generate confident but incorrect information, caused by training data gaps, pattern overgeneralization, and the model's inability to distinguish known from unknown facts.

Learn more →

AI Safety

AI safety is the field focused on ensuring AI systems behave as intended, avoid harmful outputs, resist manipulation, and remain under human control as they become more capable.

Learn more →

Content Filtering

Content filtering in AI systems automatically screens LLM inputs and outputs to block or flag harmful, inappropriate, or policy-violating content before it reaches users.

Learn more →

Llama Guard

Llama Guard is Meta's open-source safety classifier model designed to detect unsafe content in LLM inputs and outputs, providing a moderation layer for AI applications.

Learn more →

Prompt Injection Defense

Prompt injection defense refers to techniques for preventing attackers from manipulating LLM applications by inserting malicious instructions that override intended system behavior.

Learn more →

Red Teaming AI

AI red teaming is the practice of systematically testing AI systems for vulnerabilities, biases, and failure modes by simulating adversarial attacks and edge cases before deployment.

Learn more →

Responsible AI

Responsible AI is a framework for developing AI systems that are fair, transparent, accountable, and aligned with human values, addressing ethical considerations throughout the AI lifecycle.

Learn more →

Multimodal

Vision, audio, video, and multi-modal AI