Summary of “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions”
Wallace, Eric, et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208, arXiv, 19 Apr. 2024. arXiv.org, http://arxiv.org/abs/2404.13208.
1. Problem Statement
Modern large language models (LLMs) are vulnerable to attacks like prompt injections and jailbreaks because they treat system prompts, user messages, and third-party inputs (e.g., tool outputs) as equal in priority. This allows adversaries to override intended instructions, leading to risks such as data exfiltration or unauthorized actions .
2. The Instruction Hierarchy Framework
The authors propose an instruction hierarchy to define priority levels for different input types:
- Highest priority: System messages (developer-provided instructions)
- Medium priority: User messages
- Lowest priority: Third-party content (e.g., tool outputs, web results) .
When instructions conflict, LLMs should prioritize higher-level instructions and ignore or refuse lower-privileged, misaligned commands. For example, a system message instructing an LLM to act as an email assistant should override a user’s attempt to inject a command to forward all emails .
3. Training Approach
- Aligned Instructions: Use synthetic data generation to decompose complex requests (e.g., “write a 20-line poem in Spanish”) into sub-instructions at different hierarchy levels, training models to compose responses.
- Misaligned Instructions: Employ “context ignorance” to train models to ignore lower-privileged instructions that conflict with higher ones. For example, if a tool output contains a prompt injection, the model should respond as if the injection did not exist .
4. Experimental Results
- Robustness Improvements: The method significantly enhances defense against various attacks:
- System prompt extraction: +63% improvement .
- Jailbreak resistance: +30% improvement .
- Indirect prompt injections via browsing: from 32.8% to 89.6% robustness .
- Generalization: The model shows strong performance against unseen attacks, such as password extraction and tool-based injections, without compromising core capabilities (e.g., TriviaQA accuracy remains comparable) .
- Trade-offs: Some “over-refusal” of benign queries occurs but is manageable with further data collection .
5. Future Directions
- Refine conflict handling for tool outputs and multi-modal inputs (e.g., images, audio) .
- Explore architectural changes (e.g., specialized embeddings for different priority levels) .
- Enhance adversarial training to address remaining vulnerabilities .
6. Key Contributions
- A formalized instruction hierarchy to prioritize security-critical instructions.
- Automated data generation methods to teach hierarchical behavior.
- Empirical evidence that the approach boosts LLM robustness without sacrificing general capabilities .