Back to Glossary
Safety

Llama Guard

Definition

Llama Guard is Meta's open-source safety classifier model designed to detect unsafe content in LLM inputs and outputs, providing a moderation layer for AI applications.

Why It Matters

Most safety classifiers are closed-source or limited in scope. Llama Guard provides an open, customizable solution that can run locally. This is valuable for organizations that can’t send content to external moderation APIs or need to customize safety policies.

How It Works

Llama Guard classifies content into safety categories:

  • Violence and hate
  • Sexual content
  • Criminal planning
  • Self-harm
  • Regulated substances

You can customize categories and thresholds for your use case. The model evaluates both inputs (before LLM processing) and outputs (before returning to users).

Integration Pattern

User Input → Llama Guard (input filter)
            → LLM Processing
            → Llama Guard (output filter) → User Response

If either filter triggers, the request is blocked or handled according to your policy.

When to Use

Use Llama Guard when: you need open-source moderation, you require local processing for privacy, you want to customize safety categories, or you need a fast safety classifier for real-time applications.

Source

Llama Guard provides an LLM-based safety risk classifier for evaluating both prompts and responses in conversational AI.

https://arxiv.org/abs/2312.06674