2020 | Cong's Log

The Curious Case of Neural Text Degeneration

Holtzman, Ari, et al. The Curious Case of Neural Text Degeneration. arXiv:1904.09751, arXiv, 14 Feb. 2020. arXiv.org, http://arxiv.org/abs/1904.09751. Introduction 从语言模型生成文本（例如生成故事）的最佳解码策略是什么仍然是一个悬而未决的问题。违反直觉的经验观察是，即使使用似然作为训练目标可以为广泛的语言理解任务生成高质量的模型，但基于maximization-based decoding的解码方法（例如beam search）会导致退化（degeneration）——输出文本平淡无奇，不连贯，或陷入重复循环。文本生成中的decoding strategy主要可以分为两大类： Argmax Decoding: 主要包括beam search, class-factored softmax等 Stochastic Decoding: 主要包括temperature sampling, top-k sampling等为了解决这个问题，提出了 Nucleus Sampling（Top-p Sampling），这是一种简单但有效的方法，可以从神经语言模型中提取比以前的解码策略质量更高的文本。The key idea is to use the shape of the probability distribution to determine the set of tokens to be sampled from. Method 通过截断概率分布的不可靠尾部分布、从包含绝大多数概率质量的标记的dynamic nucleus中采样来避免文本退化。效果/Analysis/Findings 为了正确检查当前基于最大化和随机的解码方法，我们将这些方法中的每一种的生成与人类文本从几个方向（如可能性、多样性和重复）的分布进行了比较。 ...

Codex - Evaluating Large Language Models Trained on Code

Codex：M. Chen et al., ‘Evaluating Large Language Models Trained on Code’. arXiv, Jul. 14, 2021. Available: http://arxiv.org/abs/2107.03374 Intro Codex, a GPT language model finetuned on publicly available code from GitHub Task: docstring-conditional code generation Method Codex: fine-tune GPT3 models containing up to 12B parameters on code to produce Codex. Codex-S: fine-tune Codex on standalone, correctly implemented functions. Inference: assemble each HumanEval problem into a prompt consisting of a header, a signature, and a docstring. We use nucleus sampling (Holtzman et al., 2020) with top p = 0.95 for all sampling evaluation in this work ...

Scaling Laws for Neural Language Models

Kaplan, Jared, et al. ‘Scaling Laws for Neural Language Models’. arXiv:2001.08361 [Cs, Stat], Jan. 2020. arXiv.org, http://arxiv.org/abs/2001.08361. TL:DR key findings for Transformer language models are are as follows: Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3) Smooth power laws: Performance has a power-law relationship with each of the three scale factors N, D, C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3) ...

CorefQA - Coreference resolution as query-based span prediction

2020, ACL data: CoNLL-2012, GAP task: Coreference Resolution 通过QA方式处理coreference问题，A query is generated for each candidate mention using its surrounding con- text, and a span prediction module is em- ployed to extract the text spans of the corefer- ences within the document using the generated query. 近期的方法有consider all text spans in a document as potential mentions and learn to find an antecedent for each possible mention. There。这种仅依靠mention的做对比的方法的缺点： At the task formalization level：因为当前数据集有很多遗漏的mention， mentions left out at the mention proposal stage can never be recov- ered since the downstream module only operates on the proposed mentions. At the algorithm level：Semantic matching operations be- tween two mentions (and their contexts) are per- formed only at the output layer and are relatively superficial 方法 Speaker information： directly concatenates the speaker’s name with the corresponding utterance. ...

A Frustratingly Easy Approach for Joint Entity and Relation Extraction

2020, NAACL data: ACE 04, ACE 05, SciERC links: https://github.com/princeton-nlp/PURE task: Entity and Relation Extraction 提出了一种简单但是有效的pipeline方法:builds on two independent pre-trained encoders and merely uses the entity model to provide input features for the relation model. 实验说明: validate the importance of learning distinct contextual representations for entities and relations, fusing entity information at the input layer of the relation model, and incorporating global context. 从效果上看, 似乎是因为cross sentence的context加成更大方法 Input: a sentence X consisting of n tokens x1, . . . , xn. Let S = {s1, . . . , sm} be all the possible spans in X of up to length L and START(i) and END(i) denote start and end indices of si. ...

Two are Better than One - Joint Entity and Relation Extraction with Table-Sequence Encoders

2020, EMNLP data: ACE 04, ACE 05, ADE, CoNLL04 links: https://github.com/LorrinWWW/two-are-better-than-one. task: Entity and Relation Extraction In this work, we propose the novel table-sequence encoders where two different encoders – a table encoder and a sequence encoder are designed to help each other in the representation learning process. 这篇ACL 2020文章认为, 之前的Joint learning方法侧重于learning a single encoder (usually learning representation in the form of a table) to capture information required for both tasks within the same space. We argue that it can be beneficial to design two distinct encoders to capture such two different types of information in the learning process. ...

Improving Event Detection via Open-domain Trigger Knowledge

2020, ACL data: ACE 05 task: Event Detection Propose a novel Enrichment Knowledge Distillation (EKD) model to efficiently distill external open-domain trigger knowledge to reduce the in-built biases to frequent trigger words in annotations. leverage the wealth of the open-domain trigger knowledge to improve ED propose a novel teacher-student model (EKD) that can learn from both labeled and unlabeled data 缺点只能对付普遍情况, 即一般性的触发词; 但触发词不是在任何语境下都是触发词. 方法 empower the model with external knowledge called Open-Domain Trigger Knowledge, defined as a prior that specifies which words can trigger events without subject to pre-defined event types and the domain of texts. ...

Cross-media Structured Common Space for Multimedia Event Extraction

2020, ACL Task: MultiMedia Event Extraction Introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. Construct the first benchmark and evaluation dataset for this task, which consists of 245 fully annotated news articles Propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. which takes advantage of annotated unimodal corpora to separately learn visual and textual event extraction, and uses an image-caption dataset to align the modalities ...