Paper Reading - Weak-to-Strong Generalization - Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION. https://arxiv.org/abs/2312.09390

Research Context and Objectives

The paper addresses a critical challenge in aligning superhuman AI models: when human supervision becomes insufficient due to the models’ complex behaviors, can weak supervision (e.g., from weaker models) effectively elicit the full capabilities of stronger models? The authors from OpenAI explore this through empirical experiments, aiming to bridge the gap between current alignment techniques (like RLHF) and the needs for superhuman model alignment.

Key Methodology

Setup: Train strong “student” models using labels generated by weak “supervisor” models, comparing results against strong models fine-tuned with ground truth labels (strong ceiling performance).
Tasks:
- Natural Language Processing (NLP) benchmarks.
- Chess puzzle prediction.
- ChatGPT reward modeling.
Metrics: Performance Gap Recovered (PGR) measures the fraction of the performance gap between weak supervisors and strong ceilings recovered by weak-to-strong training.

Core Findings

Naive Weak-to-Strong Generalization:
- Strong students consistently outperform weak supervisors. For example, fine-tuning GPT-4 with GPT-2-level supervision recovers ~50% of the performance gap on NLP tasks.
- However, significant gaps remain compared to strong ceilings, especially in reward modeling.
Improving Generalization:
- Bootstrapping: Using intermediate model sizes to iteratively refine supervision improves PGR in chess puzzles, where large gaps previously hindered performance.
- Auxiliary Confidence Loss: Encouraging strong models to be confident in their predictions (even when disagreeing with weak labels) boosts NLP PGR to ~80%.
- Generative Finetuning: Unsupervised finetuning on task-relevant data improves reward modeling PGR by 10-20%.
Underlying Mechanisms:
- Imitation of Errors: Strong models initially generalize but can overfit to weak supervisor mistakes, mitigated by confidence losses.
- Task Salience: Pretrained strong models possess latent task knowledge. Prompts and generative finetuning enhance knowledge elicitation.

Implications and Limitations

Feasibility: Weak-to-strong generalization is tractable with simple methods, offering a path to aligning superhuman models.
Challenges: Disanalogies remain, such as superhuman models’ potential to imitate human errors more effectively and pretraining leakage (tasks observed during pretraining).
Future Work: Develop more analogous setups, scalable techniques, and scientific understanding of generalization mechanisms.

Conclusion

The study demonstrates that weak supervision can elicit non-trivial capabilities from strong models, but advanced methods are needed for full alignment. This paves the way for empirical progress on superalignment, a critical step toward safe superhuman AI.

Keywords

Weak-to-strong generalization, model alignment, superhuman AI, RLHF, confidence loss, bootstrapping

Research Context and Objectives#

Key Methodology#

Core Findings#

Implications and Limitations#

Conclusion#

Keywords#