Llama Guard
Definition
Llama Guard is Meta's open-source safety classifier model designed to detect unsafe content in LLM inputs and outputs, providing a moderation layer for AI applications.
Why It Matters
Most safety classifiers are closed-source or limited in scope. Llama Guard provides an open, customizable solution that can run locally. This is valuable for organizations that can’t send content to external moderation APIs or need to customize safety policies.
How It Works
Llama Guard classifies content into safety categories:
- Violence and hate
- Sexual content
- Criminal planning
- Self-harm
- Regulated substances
You can customize categories and thresholds for your use case. The model evaluates both inputs (before LLM processing) and outputs (before returning to users).
Integration Pattern
User Input → Llama Guard (input filter)
→ LLM Processing
→ Llama Guard (output filter) → User Response
If either filter triggers, the request is blocked or handled according to your policy.
When to Use
Use Llama Guard when: you need open-source moderation, you require local processing for privacy, you want to customize safety categories, or you need a fast safety classifier for real-time applications.
Source
Llama Guard provides an LLM-based safety risk classifier for evaluating both prompts and responses in conversational AI.
https://arxiv.org/abs/2312.06674