Paper Reading - The Instruction Hierarchy - Training LLMs to Prioritize Privileged Instructions

Summary of “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions”

Wallace, Eric, et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208, arXiv, 19 Apr. 2024. arXiv.org, http://arxiv.org/abs/2404.13208.

1. Problem Statement

Modern large language models (LLMs) are vulnerable to attacks like prompt injections and jailbreaks because they treat system prompts, user messages, and third-party inputs (e.g., tool outputs) as equal in priority. This allows adversaries to override intended instructions, leading to risks such as data exfiltration or unauthorized actions .

2. The Instruction Hierarchy Framework

The authors propose an instruction hierarchy to define priority levels for different input types:

Highest priority: System messages (developer-provided instructions)
Medium priority: User messages
Lowest priority: Third-party content (e.g., tool outputs, web results) .

When instructions conflict, LLMs should prioritize higher-level instructions and ignore or refuse lower-privileged, misaligned commands. For example, a system message instructing an LLM to act as an email assistant should override a user’s attempt to inject a command to forward all emails .

3. Training Approach

Aligned Instructions: Use synthetic data generation to decompose complex requests (e.g., “write a 20-line poem in Spanish”) into sub-instructions at different hierarchy levels, training models to compose responses.
Misaligned Instructions: Employ “context ignorance” to train models to ignore lower-privileged instructions that conflict with higher ones. For example, if a tool output contains a prompt injection, the model should respond as if the injection did not exist .

4. Experimental Results

Robustness Improvements: The method significantly enhances defense against various attacks:
- System prompt extraction: +63% improvement .
- Jailbreak resistance: +30% improvement .
- Indirect prompt injections via browsing: from 32.8% to 89.6% robustness .
Generalization: The model shows strong performance against unseen attacks, such as password extraction and tool-based injections, without compromising core capabilities (e.g., TriviaQA accuracy remains comparable) .
Trade-offs: Some “over-refusal” of benign queries occurs but is manageable with further data collection .

5. Future Directions

Refine conflict handling for tool outputs and multi-modal inputs (e.g., images, audio) .
Explore architectural changes (e.g., specialized embeddings for different priority levels) .
Enhance adversarial training to address remaining vulnerabilities .

6. Key Contributions

A formalized instruction hierarchy to prioritize security-critical instructions.
Automated data generation methods to teach hierarchical behavior.
Empirical evidence that the approach boosts LLM robustness without sacrificing general capabilities .

Summary of “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions”#

1. Problem Statement#

2. The Instruction Hierarchy Framework#

3. Training Approach#

4. Experimental Results#

5. Future Directions#

6. Key Contributions#