Paper Reading - [NAACL 2019] BERT Pre-training of Deep Bidirectional Transformers for Language Understanding

This is my reading note for [NAACL 2019] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

前言

BERT 使得我们可以在一个大的数据集上训练一个模型，再应用到不同的NLP任务上面，而不是对每个任务单独训练一个模型出来

Abstract

和 GPT 的区别：GPT 考虑单向，BERT 考虑双向信息（没怎么看懂什么意思，稍后 Update）
ELMo 用的是一个基于 RNN 的架构，而 BERT 采用的是 Transformer.
BERT 上概念简单，效果好
各种效果好 balabala

Introduction

预训练有用（词嵌入/GPT/…）
对句子情绪的一个识别
在使用预训练模型做特征表示时有两种比较常用的策略：
- 基于特征的（ELMo）：构造一个跟任务相关的神经网络（RNN）
- 微调的（GPT）：TBA
使用一个单向的语言模型，有局限性（实际上可以看完整个句子再来做其他工作）
用 MLM （inspired by Cloze）每次选一些 token 把它们盖住。目标函数是预测被盖住的字。这样可以训练双向模型
对下一个句子的预测（判断两个句子是不是相邻）
有其他工作只合并（双向模型）

Conclusion

非监督预训练很牛
EMLo + GPT -> 双向 + Transformer

Unsupervised Featured-based Approaches
Unsupervised Fine-tuning Approaches
Transfer learning from supervised data
在大量没标号的数据上做训练训练出来的模型可能效果优于小量标号数据上做训练

BERT

2 steps in our framework:

预训练：在一个没有标号的数据集上训练
微调：init 的权重是 pre-Training 一步得到的权重，用有标号数据 fine-tune
所有下游任务可以用同一个 pre-training 得到的权重，但是之后的 fine-tune 用自己的带标号的数据：如下图的左->右

Model Architecture

一个多层双向 Transformer encoder

三个参数：

L：Transformer 块的个数
H：隐藏层大小
A：自注意力机制中的头的个数

两个规模的模型：

\(BERT_{BASE}\)：L = 12, H = 768, A = 12, Total Parameters=110M
\(BERT_{LARGE}\)：L=24, H=1024, A = 16, Total Parameters=340M

具体参数咋算的，我找了两篇 Blog：https://blog.csdn.net/weixin_43922901/article/details/102602557 和 https://blog.csdn.net/qq874455953/article/details/120840276

或者 mli 的 paper-reading 这里也有详细计算。

I/O Representations

Input

single sentence
pairs of sentences (in one token sequence)

In transformer, input is a pair of sentence (one is fed to the encoder, and the other to the decoder). However, in BERT there’s only one decoder so we need to squeeze the pair into one token. Explanation

切词用 WordPiece。

WordPiece 是个啥呢… 就直接贴个 Blog 吧，motivation 和思路讲得大概挺清楚了。

嵌入层长这个样子

embedding 层干了点啥/怎么实现的可以参考这个。

怎么形象理解 embedding 这个概念？

Pre-training BERT

首先说明了下 Deep bidirectional model » a left-to-right model / the shallow concatenation of a left-to right and a right-to-left model.（这玩意好使！）

其次推出了两个非监督任务：

Masked LM（masked language model）

你可以把它想象成完形填空。在 BERT 中，为了训练出一个所谓的 Deep bidirectional model，我们干以下两件事：

随机MASK 掉 15% 的词元（把词元替换为[MASK]）然后利用模型输出的final hidden vector通过 softmax 层来预测被mask掉的token是什么，最终的损失函数只计算被mask掉的token。在这 15% 的词元中：
- 80％概率：用[MASK]标记替换单词，例如，my dog is hairy → my dog is [MASK]
- 10％概率：用随机单词替换单词，例如，my dog is hairy → my dog is apple
- 10％概率：保持单词不变，例如，my dog is hairy → my dog is hairy。这样做的目的是将表示偏向于实际观察到的单词。
预测的时候只预测被 masked 的词汇，而不是重建整个句子

上面的第一步主要是解决[MASK]中的一个问题：因为[MASK]只在预训练的时候出现，微调的时候我们不用这个预测[MASK]的目标函数，所以这会造成一点不同。

Next Sentence Prediction (NSP)

为了理解两个句子间的关系，我们预训练一个 next sentence predictiontask.

具体地说，当选择句子A和B作为预训练样本时，B有50％的可能是A的下一个句子，也有50％的可能是来自语料库的随机句子。例如：

Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext

Pre-training data

BooksCorpus (800M words) (Zhu et al.,2015) and English Wikipedia (2,500M words).

For Wikipedia we extract only the text passages and ignore lists, tables, and headers.

数据最好用 document-level 而不是那种几句话的语料库，这样可以让数据多一些长连续句子（文章的连续性更强一些）

Fine-tuning BERT

这个比较直观：

Experiments

实验太多了… 11 个 NLP Task，感兴趣直接到原论文看一下应该就好。

Reference

https://zhuanlan.zhihu.com/p/260523086

https://zhuanlan.zhihu.com/p/145470341

BERT 论文逐段精读【论文精读】

Share on

Twitter Facebook LinkedIn

Paper Reading - [NAACL 2019] BERT Pre-training of Deep Bidirectional Transformers for Language Understanding

前言

Abstract

Introduction

Conclusion

BERT

Model Architecture

I/O Representations

Input

Pre-training BERT

Masked LM（masked language model）

Next Sentence Prediction (NSP)

Pre-training data

Fine-tuning BERT

Experiments

Reference

Share on

You may also enjoy

Paper Reading - [NeurIPS 2020] Denoising Diffusion Probabilistic Models

Paper Reading - [CVPR 2021] Taming Transformers for High-Resolution Image Synthesis

Training Results of some models against DROP dataset (AI2)

Paper Reading - [CVPR 2022] Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Paper Reading - [NAACL 2019] BERT Pre-training of Deep Bidirectional Transformers for Language Understanding

前言

Abstract

Introduction

Conclusion

Related Work

BERT

Model Architecture

I/O Representations

Input

Pre-training BERT

Masked LM（masked language model）

Next Sentence Prediction (NSP)

Pre-training data

Fine-tuning BERT

Experiments

Reference

Share on

You may also enjoy

Paper Reading - [NeurIPS 2020] Denoising Diffusion Probabilistic Models

Paper Reading - [CVPR 2021] Taming Transformers for High-Resolution Image Synthesis

Training Results of some models against DROP dataset (AI2)

Paper Reading - [CVPR 2022] Learning to Answer Questions in Dynamic Audio-Visual Scenarios