BERT的Adam Weight Decay

Adam Weight Decay in BERT 在看BERT(Devlin et al., 2019)的源码中优化器部分的实现时,发现有这么一段话 # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD. 其针对性地指出一些传统的Adam weight decay实现是错误的. ...

2019-03-03 · 4 min · Cong Chan

Machine Learning Note - cs229 - Stanford

参考 CS229: Machine Learning, Stanford 什么是机器学习?目前有两个定义。 亚瑟·塞缪尔(Arthur Samuel)将其描述为:“不需要通过具体的编程,使计算机能够学习”。这是一个较老的,非正式的定义。 汤姆·米切尔(Tom Mitchell)提供了一个更现代的定义: E:经验,即历史的数据集。 T:某类任务。 P:任务的绩效衡量。 若该计算机程序通过利用经验E在任务T上获得了性能P的改善,则称该程序对E进行了学习 “如果计算机程序能够利用经验E,提升实现任务T的成绩P,则可以认为这个计算机程序能够从经验E中学习任务T”。 例如:玩跳棋。E =玩许多棋子游戏的经验,T = 玩跳棋的任务。P = 程序将赢得下一场比赛的概率。 Supervised Learning Linear Regression Weights(parameters) θ: parameterizing the space of linear functions mapping from X to Y Intercept term: to simplify notation, introduce the convention of letting x0 = 1 Cost function J(θ): a function that measures, for each value of the θ’s, how close the h(x(i))’s are to the corresponding y(i)’s Purpose: to choose θ so as to minimize J(θ). Implementation: By using a search algorithm that starts with some “initial guess” for θ, and that repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ that minimizes J(θ). LMS(least mean squares) algorithm: gradient descent learning rate error term batch gradient descent:looks at every example in the entire training set on every step stochastic gradient descent(incremental gradient descent):repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent. The normal equations performing the minimization explicitly and without resorting to an iterative algorithm. In this method, we will minimize J by explicitly taking its derivatives with respect to the θj’s, and setting them to zero. To enable us to do this without having to write reams of algebra and pages full of matrices of derivatives, let’s introduce some notation for doing calculus with matrices ...

2017-12-05 · 18 min · Cong Chan

Machine Learning with Scikit-learn (Sklearn) 机器学习实践

Scikit-learn 提供一套实用的工具,用于解决机器学习中的实际问题,并配合适当的方法来制定解决方案。 涉及数据和模型简介,决策树,误差的作用,最小化误差,回归拟合,逻辑回归,神经网络,感知器,支持向量机,朴素贝叶斯,降维,K均值,简单高斯混合模型,分层聚类,模型评估。 实验和代码在GitHub; 练习作业答案可以参考GitHub

2017-12-01 · 1 min · Cong Chan