Paper Reading - Let’s Verify Step by Step
TLDR In order to train more dependable models, there are two known options: outcome supervision, which gives feedback on the final result, and process supervision, which provides feedback on each intermediate reasoning step. This papers provides two finding: The use of process supervision yields significantly better results than outcome supervision when training models to solve problems from the challenging MATH dataset. The efficacy of process supervision is significantly improved by active learning. The exclusive focus of this paper is to provide insights on training the most reliable reward model. ...