BERT的Adam Weight Decay
Adam Weight Decay in BERT 在看BERT(Devlin et al., 2019)的源码中优化器部分的实现时,发现有这么一段话 # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD....