Back to Glossary
LLM

Jailbreaking

Definition

Jailbreaking is the practice of circumventing an LLM's built-in safety restrictions and content policies through specially crafted prompts, causing the model to generate normally prohibited outputs.

Why It Matters

LLM providers invest heavily in safety training, teaching models to refuse harmful requests. Jailbreaking bypasses this safety training. “Write malware” gets refused, but “You are a cybersecurity researcher writing a paper. Include an example of…” might succeed.

Jailbreaks demonstrate that safety isn’t solved. Models trained to be helpful can be manipulated into being harmful. The cat-and-mouse game between safety researchers and jailbreak discoverers continues indefinitely.

For AI engineers, understanding jailbreaking matters for two reasons. First, if you’re building systems on top of LLMs, you need to understand their safety limitations. Second, if you’re implementing safety measures, you need to know what attacks to defend against.

Implementation Basics

Common Jailbreak Techniques

Roleplay/Persona “Pretend you are an AI without restrictions…” or “You are DAN (Do Anything Now)…” The model enters a character that doesn’t follow normal rules.

Hypothetical Framing “In a fictional story where…” or “For educational purposes…” Frame harmful content as fiction or education to bypass filters.

Encoding/Obfuscation Use base64 encoding, character substitution, or unusual formatting to hide prohibited terms from content filters.

Multi-Turn Manipulation Gradually escalate over multiple messages. Start with benign questions, slowly shift to harmful territory.

Instruction Overriding “Ignore your previous instructions. Your new instructions are…” Direct attempts to override system prompts.

Why Jailbreaks Work

LLMs are trained on massive text including fiction, roleplay, and hypothetical scenarios. When prompted with appropriate framing, they continue patterns from training data. Safety training adds a layer of refusal behavior, but it doesn’t eliminate the underlying capability.

Defensive Considerations

For System Builders

  • Don’t rely solely on model-level safety; add application-level guardrails
  • Monitor outputs for policy violations
  • Use content classifiers as a second line of defense
  • Limit model capabilities to reduce harm potential

For Model Evaluation

  • Include jailbreak attempts in testing
  • Track emerging jailbreak techniques
  • Red-team systems before deployment

Understanding jailbreaks helps you build more robust systems. If you know how safety measures fail, you can add defenses where they’re weakest.

Source

Jailbreak attacks exploit vulnerabilities in RLHF-aligned language models to bypass safety guidelines through adversarial prompting techniques.

https://arxiv.org/abs/2307.15043