Paper Reading - The Instruction Hierarchy - Training LLMs to Prioritize Privileged Instructions
Summary of “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions” Wallace, Eric, et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208, arXiv, 19 Apr. 2024. arXiv.org, http://arxiv.org/abs/2404.13208. 1. Problem Statement Modern large language models (LLMs) are vulnerable to attacks like prompt injections and jailbreaks because they treat system prompts, user messages, and third-party inputs (e.g., tool outputs) as equal in priority. This allows adversaries to override intended instructions, leading to risks such as data exfiltration or unauthorized actions ....