Pre-Trained Models

Codex - Evaluating Large Language Models Trained on Code

Codex：M. Chen et al., ‘Evaluating Large Language Models Trained on Code’. arXiv, Jul. 14, 2021. Available: http://arxiv.org/abs/2107.03374 Intro Codex, a GPT language model finetuned on publicly available code from GitHub Task: docstring-conditional code generation Method Codex: fine-tune GPT3 models containing up to 12B parameters on code to produce Codex. Codex-S: fine-tune Codex on standalone, correctly implemented functions. Inference: assemble each HumanEval problem into a prompt consisting of a header, a signature, and a docstring. We use nucleus sampling (Holtzman et al., 2020) with top p = 0.95 for all sampling evaluation in this work ...

Scaling Laws for Neural Language Models

Kaplan, Jared, et al. ‘Scaling Laws for Neural Language Models’. arXiv:2001.08361 [Cs, Stat], Jan. 2020. arXiv.org, http://arxiv.org/abs/2001.08361. TL:DR key findings for Transformer language models are are as follows: Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3) Smooth power laws: Performance has a power-law relationship with each of the three scale factors N, D, C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3) ...

Survey - Pre-Trained Models - Past, Present and Future

Links: https://arxiv.org/abs/2106.07139 最新出炉的 Pre-Trained Models 综述速览。先确定综述中的一些名词的定义 Transfer learning：迁移学习，一种用于应对机器学习中的data hungry问题的方法，是有监督的 Self-Supervised Learning：自监督学习，也用于应对机器学习中的data hungry问题，特别是针对完全没有标注的数据，可以通过某种方式以数据自身为标签进行学习（比如language modeling）。所以和无监督学习有异曲同工之处。一般我们说无监督主要集中于clustering, community discovery, and anomaly detection等模式识别问题而self-supervised learning还是在监督学习的范畴，集中于classification and generation等问题 Pre-trained models (PTMs) ：预训练模型，Pre-training是一种具体的训练方案，可以采用transfer learning或者Self-Supervised Learning方法 2 Background 脉络图谱 Pre-training 可分为两大类： 2.1 Transfer Learning and Supervised Pre-Training 此类可进一步细分为 feature transfer 和 parameter transfer. 2.2 Self-Supervised Learning and Self-Supervised Pre-Training Transfer learning 可细分为四个子类 inductive transfer learning (Lawrence and Platt, 2004; Mihalkova et al., 2007; Evgeniou and Pontil, 2007), transductive transfer learning (Shimodaira, 2000; Zadrozny,2004; Daume III and Marcu, 2006), self-taught learning (Raina et al., 2007; Dai et al., 2008) unsupervised transfer learning (Wang et al., 2008). inductive transfer learning 和 transductive transfer learning 的研究进展主要集中以imageNet为labeled source data资源的图像领域 ...