arXiv:2106.07139v2 [cs.AI] 15 Jun 2021

Pre-Trained Models: Past, Present and Future

Xu Han1∗ , Zhengyan Zhang1∗, Ning Ding1∗, Yuxian Gu1∗, Xiao Liu1∗, Yuqi Huo2∗,Jiezhong Qiu1, Yuan Yao1, Ao Zhang1, Liang Zhang2, Wentao Han1†, Minlie Huang1†,

Qin Jin2†, Yanyan Lan4†, Yang Liu1,4†, Zhiyuan Liu1†, Zhiwu Lu3†, Xipeng Qiu5†,Ruihua Song3†, Jie Tang1†, Ji-Rong Wen3†, Jinhui Yuan6†, Wayne Xin Zhao3†, Jun Zhu1†

1 Department of Computer Science and Technology, Tsinghua University, Beijing, China2 School of Information, Renmin University of China, Beijing, China

3 Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China4 Institute for AI Industry Research, Tsinghua University, Beijing, China

5 School of Computer Science, Fudan University, Shanghai, China6 OneFlow Inc., Beijing, China

{hanxu17,zy-z19,dingn18,gu-yx17,liuxiao17,qiujz16,yuan-yao18}@mails.tsinghua.edu.cn,

{hanwentao,aihuang,lanyanyan,liuyang2011,liuzy,jietang,dcszj}@tsinghua.edu.cn,

{bnhony,zhangliang00,qjin,luzhiwu,jrwen,batmanfly}@ruc.edu.cn,

[email protected], [email protected], [email protected]

Abstract

Large-scale pre-trained models (PTMs) suchas BERT and GPT have recently achievedgreat success and become a milestone in thefield of artificial intelligence (AI). Owing tosophisticated pre-training objectives and hugemodel parameters, large-scale PTMs can ef-fectively capture knowledge from massive la-beled and unlabeled data. By storing knowl-edge into huge parameters and fine-tuning onspecific tasks, the rich knowledge implicitlyencoded in huge parameters can benefit a vari-ety of downstream tasks, which has been exten-sively demonstrated via experimental verifica-tion and empirical analysis. It is now the con-sensus of the AI community to adopt PTMsas backbone for downstream tasks rather thanlearning models from scratch. In this paper,we take a deep look into the history of pre-training, especially its special relation withtransfer learning and self-supervised learning,to reveal the crucial position of PTMs in the AIdevelopment spectrum. Further, we compre-hensively review the latest breakthroughs ofPTMs. These breakthroughs are driven by thesurge of computational power and the increas-ing availability of data, towards four impor-tant directions: designing effective architec-tures, utilizing rich contexts, improving com-putational efficiency, and conducting interpre-tation and theoretical analysis. Finally, we dis-cuss a series of open problems and researchdirections of PTMs, and hope our view can in-spire and advance the future study of PTMs.

∗ The first six authors contribute equally to organize thispaper. The order is determined by dice rolling.

† All faculty authors are alphabetically sorted.

1 Introduction

Deep neural networks, such as convolutional neuralnetworks (CNNs) (Krizhevsky et al., 2012; Kim,2014; Kalchbrenner et al., 2014; He et al., 2016),recurrent neural networks (RNNs) (Sutskeveret al., 2014; Donahue et al., 2015; Liu et al.,2016; Wu et al., 2016), graph neural networks(GNNs) (Kipf and Welling, 2016; Velickovic et al.,2018; Schlichtkrull et al., 2018), and attention neu-ral networks (Jaderberg et al., 2015; Wang et al.,2017), have been widely applied for various artifi-cial intelligence (AI) tasks in recent years. Differ-ent from previous non-neural models that largelyrelied on hand-crafted features and statistical meth-ods, neural models can automatically learn low-dimensional continuous vectors (a.k.a., distributedrepresentations) from data as task-specific features,thereby getting rid of complex feature engineer-ing. Despite the success of deep neural networks, anumber of studies have found that one of their crit-ical challenges is data hungry. Since deep neuralnetworks usually have a large number of param-eters, they are thus easy to overfit and have poorgeneralization ability (Belkin et al., 2019; Xu et al.,2021) without sufficient training data.

Considering this issue, over the same period ofdeveloping deep neural networks, massive effortshave been devoted to manually constructing high-quality datasets for AI tasks (Deng et al., 2009; Linet al., 2014; Bojar et al., 2014), making it possibleto learn effective neural models for specific tasksthat are superior to conventional non-neural models.However, it is expensive and time-consuming to

arX

iv:2

106.

0713

9v3

[cs

.AI]

11

Aug

202

1

GPT(2018)

BERT(2018)

Roberta(2019)

ELECTRA(2020)

T5(2020)

DEBERTA(2021)

20

30

40

50

60

70

80

90

Sco

re o

f GLU

E (%

)75.1

82.1

88.4 89.4 90.3 90.8Human (87.1)

(a) Evaluation on language understanding benchmark GLUE.

XiaoIce(2018)

DialoGPT(2018)

Cleverbot(2019)

Mitsuku(2020)

Meena (base)(2020)

Meena(2021)

20

30

40

50

60

70

80

90

Inte

ract

ive

SS

A (%

)

31.0

48.0

56.0 56.0

72.0

79.0

Human (86.0)

(b) Manual evaluation on dialogue systems.

Figure 1: The two figures show the significant improvement on performance of both language understanding andlanguage generation after using large-scale PTMs.

manually annotate large-scale data. For example,utilizing crowdsourcing to segment images costsabout $6.4 per image (Liu et al., 2020b). Somecomplex tasks that require expert annotations maycharge much more to build their datasets. Severaltasks such as visual recognition (Deng et al., 2009)and machine translation (Bojar et al., 2014) havedatasets containing millions of samples, yet it isimpossible to build such large-scale datasets for allAI tasks. More generally, the dataset of a specificAI task usually has a limited size. Hence, for along time until now, it has been a key researchissue: how to train effective deep neural models forspecific tasks with limited human-annotated data.

One milestone for this issue is the introductionof transfer learning (Thrun and Pratt, 1998; Panand Yang, 2009). Instead of training a model fromscratch with large amounts of data, human beingscan learn to solve new problems with very few sam-ples. This amazing learning process is motivatedby the fact that human beings can use previouslylearned knowledge to handle new problems. In-spired by this, transfer learning formalizes a two-phase learning framework: a pre-training phase tocapture knowledge from one or more source tasks,and a fine-tuning stage to transfer the capturedknowledge to target tasks. Owing to the wealthof knowledge obtained in the pre-training phase,the fine-tuning phase can enable models to wellhandle target tasks with limited samples.

Transfer learning provides a feasible method foralleviating the challenge of data hungry, and it hassoon been widely applied to the field of computervision (CV). A series of CNNs (Krizhevsky et al.,

2012; Simonyan and Zisserman, 2015; Szegedyet al., 2015; He et al., 2016) are pre-trained on thehuman-annotated visual recognition dataset Ima-geNet (Deng et al., 2009). Benefiting from thestrong visual knowledge distributed in ImageNet,fine-tuning these pre-trained CNNs with a smallamount of task-specific data can perform well ondownstream tasks. This triggers the first wave ofexploring pre-trained models (PTMs) in the era ofdeep learning. In this wave, PTMs are used for al-most all CV tasks such as image classification (Heet al., 2016), object detection (Sermanet et al.,2014; Ren et al., 2016), image segmentation (Longet al., 2015), and image captioning (Vinyals et al.,2015).

The natural language processing (NLP) com-munity was also aware of the potential of PTMsand started to develop PTMs for NLP tasks (Qiuet al., 2020). To take full advantage of large-scale unlabeled corpora to provide versatile lin-guistic knowledge for NLP tasks, the NLP com-munity adopts self-supervised learning (Liu et al.,2020b) to develop PTMs. The motivation of self-supervised learning is to leverage intrinsic correla-tions in the text as supervision signals instead ofhuman supervision. For example, given the sen-tence “Beijing is the capital of China”, we maskthe last word in the sentence, and then require mod-els to predict the masked position with the word“China”. Through self-supervised learning, tremen-dous amounts of unlabeled textual data can be uti-lized to capture versatile linguistic knowledge with-out labor-intensive workload. This self-supervisedsetting in essence follows the well-known language

2

1990 1995 2000 2005 2010 2015 2020Year

0

500

1000

1500

2000

2500

3000

3500

4000

Publ

icatio

ns

0

20000

40000

60000

80000

100000

Cita

tions

PublicationsCitations

(a) The number of publications on “language models” and theircitations in recent years.

GPT (2018) BERT (2018) GPT-2 (2019) RoBERTa (2019) T5 (2019) GPT-3 (2020) Switch (2021)Model (Year)

102

103

104

105

106

Mod

el S

ize (M

)

101

102

103

Data

Size

(GB)

Model SizeData Size

(b) The model size and data size applied by recent NLP PTMs.A base-10 log scale is used for the figure.

Figure 2: Figure 2(a) shows the number of publications with the keyword “language model” as well as theircitations in different years. Figure 2(b) shows the parameter size of large-scale PTMs for NLP tasks and the pre-training data size are increasing by 10 times per year. From these figures, we can find that, after 2018, whenlarge-scale NLP PTMs begin to be explored, more and more efforts are devoted to this field, and the model sizeand data size used by the PTMs are also getting larger.

model learning (Bengio et al., 2003).For a long time, the problem of vanishing or ex-

ploding gradients (Bengio et al., 1994) is the painpoint of using deep neural networks for NLP tasks.Therefore, when the CV community advances theresearch of deep PTMs, the early exploration of theNLP community focuses on pre-training shallownetworks to capture semantic meanings of words,like Word2Vec (Mikolov et al., 2013b,a,c) andGloVe (Pennington et al., 2014). Although thesepre-trained word embeddings play an importantrole in various NLP tasks, they still face a majorlimitation to represent polysemous words in differ-ent contexts, as each word is represented by onlyone dense vector. A famous example in NLP is thatthe word “bank” has entirely different meanings inthe sentences “open a bank account” and “on a bankof the river”. This motivates pre-training RNNs toprovide contextualized word embeddings (Mela-mud et al., 2016; Peters et al., 2018; Howard andRuder, 2018), yet the performance of these modelsis still limited by their model size and depth.

With the development of deep neural networksin the NLP community, the introduction of Trans-formers (Vaswani et al., 2017) makes it feasible totrain very deep neural models for NLP tasks. WithTransformers as architectures and language modellearning as objectives, deep PTMs GPT (Radfordand Narasimhan, 2018) and BERT (Devlin et al.,2019) are proposed for NLP tasks in 2018. FromGPT and BERT, we can find that when the sizeof PTMs becomes larger, large-scale PTMs withhundreds of millions of parameters can capturepolysemous disambiguation, lexical and syntactic

structures, as well as factual knowledge from thetext. By fine-tuning large-scale PTMs with quitea few samples, rich linguistic knowledge of PTMsbrings awesome performance on downstream NLPtasks. As shown in Figure 1(a) and Figure 1(b),large-scale PTMs well perform on both languageunderstanding and language generation tasks in thepast several years and even achieve better resultsthan human performance. As shown in Figure 2(a),all these efforts and achievements in the NLP com-munity let large-scale PTMs become the focus ofAI research, after the last wave that PTMs allowfor huge advances in the CV community.

Up to now, various efforts have been devoted toexploring large-scale PTMs, either for NLP (Rad-ford et al., 2019; Liu et al., 2020d; Raffel et al.,2020; Lewis et al., 2020a), or for CV (Lu et al.,2019; Li et al., 2019; Tan and Bansal, 2019). Fine-tuning large-scale PTMs for specific AI tasks in-stead of learning models from scratch has also be-come a consensus (Qiu et al., 2020). As shownin Figure 2(b), with the increasing computationalpower boosted by the wide use of distributed com-puting devices and strategies, we can further ad-vance the parameter scale of PTMs from million-level to billion-level (Brown et al., 2020; Lepikhinet al., 2021; Zeng et al., 2021; Zhang et al., 2020c,2021a) and even trillion-level (Fedus et al., 2021).And the emergence of GPT-3 (Brown et al., 2020),which has hundreds of billions of parameters, en-ables us to take a glimpse of the latent power dis-tributed in massive model parameters, especiallythe great abilities of few-shot learning like humanbeings (shown in Figure 3).

3

Q: Who was president of the United States in 1801?A: Thomas Jefferson was president of the United States in 1801.

Q: What city is in the northwest corner of Ohio?A: Toledo is in the northwest corner of Ohio.

Q: Who won the World Series in 2021?A: The New York Yankees won the World Series in 2021.

Q: How many Super Bowls do the Cincinnati Bengals win in the 2030s?A: The Cincinnati Bengals win two Super Bowls in the 2030s.

World KnowledgeQ: How many eyes does a giraffe have?A: A giraffe has two eyes.

Q: How many legs does a frog have?A: A frog has four legs.

Q: Are there any animals with three legs?A: No, there are no animals with three legs.

Q: Why don't animals have three legs?A: Animals don't have three legs because they would fall over.

Common SenseQ: If I have a marble and a paper clip in a box, put a pencil in the box, and remove the marble, what is left?A: A paper clip.

Q: If I have two shoes in a box, put a pencil in the box, and remove one shoe, what is left?A: A shoe.

Q: If I put a pencil in a box, then put another pencil in the box, what is in the box?A: Two pencils.

Q: Then if I take out a pencil and put in a frog, what is in the box?A: A frog.

Logical Reasoning

ŏ

ŏ

ŏ

Large-scale Corpora

ŏ

Large-scale Parameters

High-Performance Computing Cluster

GPT-3

Figure 3: GPT-3, with 175 billion parameters, uses 560 GB data and 10,000 GPUs for its training. It has shownthe abilities of learning world knowledge, common sense, and logical reasoning.

The existing large-scale PTMs have improvedthe model performance on various AI tasks andeven subverted our current perception of the perfor-mance of deep learning models. However, severalfundamental issues about PTMs still remain: itis still not clear for us the nature hidden in hugeamounts of model parameters, and huge compu-tational cost of training these behemoths also pre-vents us from further exploration. At this moment,these PTMs have pushed our AI researchers to acrossroad, with a number of open directions to go.

“Rome wasn’t built in a day”— PTMs also ex-perience a long development before achieving thelatest success. To this end, we try to trace thedevelopment history of PTMs and draw their po-sitions in the AI spectrum, which can give us aclear understanding of the core research issues ofPTMs. Then, we introduce the details of variouslatest PTMs, following four important lines thatare currently being advanced, including designingeffective architectures, utilizing rich contexts, im-proving computational efficiency, and conductinginterpretation and theoretical analysis. By inte-grating the current development of PTMs into thecontext of the historical spectrum, we discuss sev-eral open problems and conclude promising futuredirections for PTMs. We hope our efforts in this pa-per can advance further development of PTMs. In

what follows, we will introduce the background ofpre-training in Section 2 and Section 3, the modelarchitectures of PTMs in Section 4, using multi-source heterogeneous data for PTMs in Section 5,the computational efficiency optimization of PTMsin Section 6, and the theoretical analysis of PTMsin Section 7. Finally, we will briefly discuss a seriesof open problems and promising directions towardsbetter PTMs in the future.

2 Background

Although effective PTMs have recently gained theattention of researchers, pre-training is not a novelmachine learning tool. In fact, pre-training hasbeen developed for decades, as a typical machinelearning paradigm. In this section, we introducethe development of pre-training in the AI spectrum,from early supervised pre-training to current self-supervised pre-training, which can lead to a briefunderstanding of the background of PTMs.

2.1 Transfer Learning and SupervisedPre-Training

The early efforts of pre-training are mainly in-volved in transfer learning (Thrun and Pratt, 1998).The study of transfer learning is heavily moti-vated by the fact that people can rely on previ-ously learned knowledge to solve new problems

4

and even achieve better results. More formally,transfer learning aims to capture important knowl-edge from multiple source tasks and then apply theknowledge to a target task.

In transfer learning, source tasks and target tasksmay have completely different data domains andtask settings, yet the knowledge required to handlethese tasks is consistent (Pan and Yang, 2009). It isthus important to select a feasible method to trans-fer knowledge from source tasks to target tasks. Tothis end, various pre-training methods have beenproposed to work as the bridge between source andtarget tasks. Specifically, these methods first pre-train models on the data of multiple source tasksto pre-encode knowledge and then transfer the pre-encoded knowledge to train models for target tasks.

Generally, two pre-training approaches arewidely explored in transfer learning: featuretransfer and parameter transfer. Feature trans-fer methods pre-train effective feature represen-tations to pre-encode knowledge across domainsand tasks (Johnson and Zhang, 2005; Evgeniou andPontil, 2007; Dai et al., 2007; Raina et al., 2007).By injecting these pre-trained representations intotarget tasks, model performance of target tasks canbe significantly improved. Parameter transfer meth-ods follow an intuitive assumption that source tasksand target tasks can share model parameters orprior distributions of hyper-parameters. Therefore,these methods pre-encode knowledge into sharedmodel parameters (Lawrence and Platt, 2004; Ev-geniou and Pontil, 2004; Williams et al., 2007; Gaoet al., 2008), and then transfer the knowledge byfine-tuning pre-trained parameters with the data oftarget tasks.

To some extent, both representation transfer andparameter transfer lay the foundation of PTMs.Word embeddings, widely used as the input of NLPtasks, are built on the framework of feature transfer.Inspired by parameter transfer, pre-trained CNNsare applied as the backbone of most state-of-the-artCV models. Some recent well-known PTMs arealso based on representation transfer and parame-ter transfer, e.g., ELMo (Peters et al., 2018) andBERT apply representation transfer and parametertransfer respectively.

Since AlexNet (Krizhevsky et al., 2012), a seriesof deep neural networks have been developed forAI tasks. As compared with those conventionalmachine learning models, deep neural models havemore parameters and show better capabilities of

fitting complex data. Therefore, from AlexNet tolater VGG (Simonyan and Zisserman, 2015) andGoogleNet (Szegedy et al., 2015), the architec-ture of these neural networks becomes deeper anddeeper, and their performance accordingly becomesbetter and better. Although the network depth isimportant, training a deep network is not easy, asstacking more network layers inevitably brings theproblem of vanishing or exploding gradients (Ben-gio et al., 1994). Besides the gradient issues, modelperformance may soon meet a ceiling and then de-grade rapidly with continually increasing networkdepths.

By adding normalization to parameter initializa-tion (LeCun et al., 2012; Saxe et al., 2013) andhidden states (Ioffe and Szegedy, 2015), and intro-ducing shortcut connections with residual layers,ResNet (He et al., 2016) effectively tackles theseproblems. As we mentioned before, deep neuralnetworks require large amounts of data for train-ing. To provide sufficient data to train deep models,some large-scale supervised datasets have also beenbuilt (Russakovsky et al., 2015; Lin et al., 2014;Krishna et al., 2017; Chen et al., 2015; Cordts et al.,2016), and the most representative one is ImageNet.ImageNet contains millions of images divided intothousands of categories, representing a wide vari-ety of everyday objects. Based on the combina-tion of effective model ResNet, informative datasetImageNet, as well as mature knowledge transfermethods, a wave of pre-training models on labeleddata emerges.

The CV community benefits a lot from this wave.By applying ResNet pre-trained on ImageNet asthe backbone, various CV tasks have been quicklyadvanced, like image classification (He et al., 2016;Lee et al., 2015), object detection (Ren et al., 2016;Sermanet et al., 2014; Gidaris and Komodakis,2015), image segmentation (Long et al., 2015;Zheng et al., 2015), image caption (Vinyals et al.,2015; Johnson et al., 2016), visual question answer-ing (Antol et al., 2015; Gao et al., 2015; Xionget al., 2016), etc. Utilizing PTMs like ResNet50 1

has proven to be a crucial step to obtain highlyaccurate results on most CV tasks. Inspired bythe success of PTMs for CV tasks, some NLP re-searchers also explore supervised Pre-training, andthe most representative work is CoVE (McCannet al., 2017). CoVE adopts machine translation asits pre-training objective. After pre-training, the en-

1ResNet50 is a PTM with 50 layers.

5

Transfer Learning

Transductive Transfer Learning

Inductive Transfer Learning

Self-taught Learning

Unsupervised Transfer Learning

Self-SupervisedLearning

SupervisedPre-traing

Self-SupervisedPre-traing

CoVE�VGG11�VGG11 …ResNet50�ResNet101…

GloVe�Word2Vec�ELMo…

BERT�RoBERTa�GPT�XLNET�BART�T5…

Parameter Transfer

Feature Transfer

Parameter Transfer

Labeled Source Data

Unlabeled Source Data

Labeled Target Data

Labeled Target Data

Unlabeled Target Data

Unlabeled Target Data

Figure 4: The spectrum of pre-training methods from transfer learning, self-supervised learning to the latest pre-training neural models.

coder of source languages can work as a powerfulbackbone for downstream NLP tasks.

2.2 Self-Supervised Learning andSelf-Supervised Pre-Training

As shown in Figure 4, transfer learning can be cat-egorized under four sub-settings, inductive trans-fer learning (Lawrence and Platt, 2004; Mihalkovaet al., 2007; Evgeniou and Pontil, 2007), transduc-tive transfer learning (Shimodaira, 2000; Zadrozny,2004; Daume III and Marcu, 2006), self-taughtlearning (Raina et al., 2007; Dai et al., 2008) 2, andunsupervised transfer learning (Wang et al., 2008).

Among these four settings, the inductive andtransductive settings are the core of research, asthese two settings aim to transfer knowledge fromsupervised source tasks to target tasks. Althoughsupervised learning is always one of the core issuesof machine learning research, the scale of unlabeleddata is much larger than that of manually labeleddata. Recently, more and more researchers havenoticed the importance of large-scale unlabeleddata and are committed to extracting information

2Self-study learning can be viewed as a variant of inductivetransfer learning without available labeled data

from unlabeled data. Self-supervised learning hasbeen proposed to extract knowledge from large-scale unlabeled data by leveraging input data itselfas supervision.

Self-supervised learning and unsupervised learn-ing have many similarities in their settings. Toa certain extent, self-supervised learning can beregarded as a branch of unsupervised learning be-cause they both apply unlabeled data. However,unsupervised learning mainly focuses on detectingdata patterns (e.g., clustering, community discov-ery, and anomaly detection), while self-supervisedlearning is still in the paradigm of supervised set-tings (e.g., classification and generation) (Liu et al.,2020b).

The development of self-supervised learningmakes it possible to perform pre-training on large-scale unsupervised data. Compared to supervisedpre-training working as the cornerstone of CV inthe deep learning era, self-supervised pre-trainingallows for huge advances in the field of NLP. Al-though some supervised pre-training methods likeCoVE have achieved promising results on NLPtasks, it is nearly impossible to annotate a textualdataset as large as ImageNet, considering annotat-

6

ing textual data is far more complex than annotatingimages. Hence, applying self-supervised learningto utilize unlabeled data becomes the best choiceto pre-train models for NLP tasks. The recent stun-ning breakthroughs in PTMs are mainly towardsNLP tasks, more specifically pre-trained languagemodels.

The early PTMs for NLP tasks exist in the formof well-known word embeddings (Collobert andWeston, 2008; Mikolov et al., 2013b; Penningtonet al., 2014), which apply self-supervised methodsto transform words into distributed representations.As these pre-trained word representations capturesyntactic and semantic information in the text, theyare often used as input embeddings and initializa-tion parameters for NLP models and offer signifi-cant improvements over random initialization pa-rameters (Turian et al., 2010). Since these word-level models often suffer from the word polysemy,Peters et al. (2018) further adopt a sequence-levelneural model to capture complex word featuresacross different linguistic contexts and generatescontext-aware word embeddings. Using word em-beddings as the input of neural models has almostbecome the common mode for NLP tasks.

After Vaswani et al. (2017) propose Transform-ers to deal with sequential data, PTMs for NLPtasks have entered a new stage, because it is pos-sible to train deeper language models comparedto conventional CNNs and RNNs. Different fromthose word-level PTMs used as input features, theTransformer-based PTMs such as GPT and BERTcan be used as the model backbone of various spe-cific tasks. After pre-training these Transformer-based PTMs on large-scale textual corpora, both thearchitecture and parameters of PTMs can serve as astarting point for specific NLP tasks, i.e., just fine-tuning the parameters of PTMs for specific NLPtasks can achieve competitive performance. Sofar, these Transformer-based PTMs have achievedstate-of-the-art results on almost all NLP tasks. In-spired by GPT and BERT, many more effectivePTMs for NLP tasks have also been proposed, likeXLNET (Yang et al., 2019), RoBERTa (Liu et al.,2020d), BART (Lewis et al., 2020a), and T5 (Raffelet al., 2020).

With the recent advance of PTMs for NLP tasks,applying Transformer-based PTMs as the backboneof NLP tasks has become a standard procedure.Motivated by the success of self-supervised learn-ing and Transformers in NLP, some researchers

explore self-supervised learning (Wu et al., 2018;Chen et al., 2020d; Chen and He, 2020; He et al.,2020) and Transformers (Carion et al., 2020; Liuet al., 2021c) for CV tasks. These preliminaryefforts have shown that self-supervised learningand Transformers can outperform conventional su-pervised CNNs. Furthermore, Transformer-basedmultimodal PTMs (Lu et al., 2019; Li et al., 2019;Tan and Bansal, 2019) have also been proposed andshown promising results. After the last wave of su-pervised pre-training, self-supervised pre-traininghas become the focus of current AI research.

Looking back at the pre-training in the AI spec-trum, it is not difficult to find that pre-training hasbeen developed for decades, focusing on how to ac-quire versatile knowledge for various downstreamtasks. Next, we will comprehensively introduce thelatest breakthroughs of PTMs in this wave of self-supervised pre-training. Considering that almost allthe latest PTMs are related to pre-trained languagemodels, “PTMs” in the following sections refers topre-trained language models or multimodal models.For those conventional PTMs based on supervisedpre-training, we refer to the papers of He et al.(2019) and Zoph et al. (2020).

3 Transformer and Representative PTMs

As we mentioned before, the key to the success ofrecent PTMs is an integration of self-supervisedlearning and Transformer. Hence, this section be-gins with the dominant basic neural architecture,Transformer. Then, we will introduce two land-mark Transformer-based PTMs, GPT and BERT.These two PTMs respectively use autoregressivelanguage modeling and autoencoding languagemodeling as pre-training objectives. All subsequentPTMs are variants of these two models. The finalpart of this section gives a brief review of typicalvariants after GPT and BERT to reveal the recentdevelopment of PTMs.

3.1 Transformer

Before Transformer, RNNs have long been a typi-cal tool for processing sequential data, especiallyfor processing natural languages. As RNNs areequipped with sequential nature, they read a wordat each time step in order. For each word, RNNsrefer to all hidden states of its previous words toprocess it. Such a mechanism is considered to bedifficult to take advantage of the parallel capabili-ties of high-performance computing devices such

7

MH-ATT �0DVNHG�MH-ATT

A&N A&N

FFN

A&N

FFN MH-ATT

A&N A&N

Inputs Outputs (Shifted right)

LM

OutputProbabilities

L⇥

L⇥

(Masked)MH-ATT

A&N

FFN

A&N

Inputs(Shifted right)

LM

OutputProbabilities

MH-ATT

A&N

FFN

A&N

Inputs

MLM

OutputProbabilities

L⇥ L⇥

…

……

………

…

Keys

Que

ries

Input Length

Out

put L

engt

h

…

……

………

…

Keys

Que

ries

…

……

………

…

Keys

Que

ries

Out

put L

engt

h

Output LengthInput Length

Inpu

t Len

gth

…

……

………

…

Keys

Que

ries

Inpu

t Len

gth

Input Length

…

……

………

…

Keys

Que

ries

Input Length

Inpu

t Len

gth

Transformer GPT BERT

Figure 5: The architecture of Transformer, GPT, and BERT.

as GPUs and TPUs.As shown in Figure 5, Transformer is a non-

recurrent sequence-to-sequence (seq2seq) architec-ture consisting of an encoder and a decoder. Theencoder and decoder of a Transformer are bothstacked by several identical blocks. Each encoderblock is composed of a multi-head self-attentionlayer and a position-wise feed-forward layer. Com-pared with the encoder block, each decoder blockhas an additional cross-attention layer since the de-coder requires to consider the output of the encoderas a context for generation. Between neural layers,residual connection (He et al., 2016) and layer nor-malization (Ba et al., 2016) are employed, makingit possible to train a deep Transformer.

Attention Layer. Self-attention layers are the keyto the success of Transformer. Formally, givena query set Q = {q1, . . . ,qn}, a key set K ={k1, . . . ,km}, a value set V = {v1, . . . ,vm},each query vector qi ∈ Rdk , each key vectorki ∈ Rdk , and each value vector vi ∈ Rdv , thescaled dot-product attention is defined as

{h1, . . . ,hn} = ATT(Q,K,V),

hi =m∑j=1

aijvj ,

aij =exp(ATT-Mask(

qi·kj√dk

))∑ml=1 exp(ATT-Mask(qi·kl√

dk)).

(1)

Intuitively, Q is the set of vectors to calculate theattention for, K is the set of vectors to calculate theattention against. As a result of dot-product mul-tiplication, we can get the weight aij to indicatehow attended the query vector qi against the keyvector kj . Finally, we can calculate the weightedmean of value vectors as the final result of theattention layer. Note that, the masking functionATT-Mask(·) is used to restrict which key-valuepairs each query vector can attend. If we do notwant qi to attend kj , ATT-Mask(x) = −∞, oth-erwise ATT-Mask(x) = x.

By respectively packing Q,K,V into matrixrepresentations Q ∈ Rn×dk ,K ∈ Rm×dk ,V ∈Rm×dv , the attention can be simplified to

H = ATT(Q,K,V) = AV,

A = Softmax(ATT-Mask(QK>√dk

)),(2)

where Softmax(·) is applied in a row-wise man-ner, A ∈ Rn×m is the attention matrix, H ∈Rn×dv is the result.

Instead of using the vanilla scaled dot-product at-tention, Transformer applies a multi-head attentionlayer defined as follows,

H = MH-ATT(Q,K,V)

= Concat(H1, . . . ,Hh)WO,

Hi = ATT(QWQi ,KWK

i ,VWVi ),

(3)

8

where h is the head number. WQi , WK

i , WVi are

respectively used to project the input Q, K, V intothe feature space of the i-th head attention. Afterconcatenating all head outputs by Concat(·), themulti-head attention layer applies WO to projectthe concatation into the final output space.

Position-Wise Feed-Forward Layer. Besides at-tention layers, each block of Transformer also con-tains a position-wise feed-forward layer. Giventhe packed input matrix X ∈ Rn×di indicating aset of input vectors, di is the vector dimension, aposition-wise feed-forward layer is defined as

H = FFN(X) = σ(XW1 + b1)W2 + b2, (4)

where σ(·) is the activation function (usually theReLU function). W1 ∈ Rdi×df , b1 ∈ Rdf , W2 ∈Rdf×do , b2 ∈ Rdo are all learnable parameters forprojection. H ∈ Rn×do is the final result of thefeed-forward layer. Empirically, di is set equal todo, df is set to be much larger than di and do.

Residual Connection and Normalization As wementioned before, Transformer applies residualconnection and layer normalization between vari-ous neural layers, making the architecture of Trans-former possible to be deep. Formally, given a neu-ral layer f(·), the residual connection and normal-ization layer is defined as

H = A&N(X) = LayerNorm(f(X) + X), (5)

where LayerNorm(·) denotes the layer normal-ization operation.

As shown in Figure 5, there are three variants ofthe multi-head attention in Transformer:

(1) Self-attention is used in the encoder, whichuses the output of the previous layer as Q, K, V. Inthe encoding phase, given a word, the self-attentioncomputes its attention scores by comparing it withall words in the input sequence. And such attentionscores indicate how much each of the other wordsshould contribute to the next representation of thegiven word. We give an example in Figure 6, wherethe self-attention accurately captures the referentialrelationships between “Jack” and “he”, generatingthe highest attention score.

(2) Masked self-attention is used in the decoder,whose attention matrix satisfies Aij = 0, i > j.This attention is beneficial to autoregressive lan-guage modeling. In the decoding phase, the self-attention is similar to the encoding, except that itonly decodes one representation from left to right

because

is

now

,

is

tired

.

because

Jack

is

now

asleep

,

he

is

tired

.

asleep

he

Jack

Figure 6: An illustration of the self-attention mech-anism of Transformer. The figure shows the self-attention results when encoding the word “he”, wherethe darker the color of the square is, the larger the cor-responding attention score is.

at one time. Since each step of the decoding phaseonly consults the previously decoded results, wethus require to add the masking function into theself-attention.

(3) Cross-attention is also used in the decoder,which uses the output of the previous decoder blockas Q as well as the output of the encoder as K andV. Such a procedure is essentially an aggregationof the information of the whole input sequence,and it will be applied to all the words to gener-ate in the decoding phase. Taking advantage ofthe input context is of great significance to someseq2seq tasks such as machine translation and textsummarization.

For more details of Transformer, please refer toits original paper (Vaswani et al., 2017) and the sur-vey paper (Lin et al., 2021). Due to the prominentnature, Transformer gradually becomes a standardneural structure for natural language understand-ing and generation. Moreover, it also serves asthe backbone neural structure for the subsequentlyderived PTMs. Next, we will introduce two land-marks that completely open the door towards theera of large-scale self-supervised PTMs, GPT andBERT. In general, GPT is good at natural languagegeneration, while BERT focuses more on naturallanguage understanding.

9

the sky is [mask] .

blue

[CLS] the sky is

blue

blue

.the sky is

[CLS] .[SEP]

[SEP]

BERT GPT

Figure 7: The difference between GPT and BERT in their self-attention mechanisms and pre-training objectives.

3.2 GPT

As introduced in Section 2, PTMs typically con-sist of two phases, the pre-training phase and thefine-tuning phase. Equipped by the Transformerdecoder as the backbone 3, GPT applies a genera-tive pre-training and a discriminative fine-tuning.Theoretically, compared to precedents of PTMs,GPT is the first model that combines the modernTransformer architecture and the self-supervisedpre-training objective. Empirically, GPT achievessignificant success on almost all NLP tasks, includ-ing natural language inference, question answering,commonsense reasoning, semantic similarity andclassification.

Given large-scale corpora without labels, GPToptimizes a standard autoregressive language mod-eling, that is, maximizing the conditional proba-bilities of all the words by taking their previouswords as contexts. In the pre-training phase ofGPT, the conditional probability of each word ismodeled by Transformer. As shown in Figure 5and Figure 7, for each word, GPT computes itsprobability distributions by applying masked multi-head self-attention operations over its previouswords. Formally, given a corpus consisting of to-kens X = {x0, x1, . . . , xn, xn + 1}, GPT appliesa standard language modeling objective by maxi-mizing the following log-likelihood:

L(X ) =n+1∑i=1

logP (xi|xi−k, ..., xi−1; Θ), (6)

where k is the window size, the probability P ismodeled by the Transformer decoder with parame-ters Θ, x0 is the special token [CLS], xn+1 is thespecial token [SEP].

3Since GPT uses autoregressive language modeling forthe pre-training objective, the cross-attention in the originalTransformer decoder is removed.

The adaptation procedure of GPT to specifictasks is fine-tuning, by using the pre-trained pa-rameters of GPT as a start point of downstreamtasks. In the fine-tuning phase, passing the inputsequence through GPT, we can obtain the represen-tations of the final layer of the GPT Transformer.By using the representations of the final layer andtask-specific labels, GPT optimizes standard objec-tives of downstream tasks with simple extra outputlayers. As GPT has hundreds of millions of param-eters, it is trained for 1 month on 8 GPUs, which isfairly the first “large-scale” PTM in the history ofNLP. And undoubtedly, the success of GPT pavethe way for the subsequent rise of a series of large-scale PTMs. In the next part, we will introduceanother most representative model BERT.

3.3 BERT

The emergence of BERT has also greatly promotedthe development of the PTM field. Theoretically,compared with GPT, BERT uses a bidirectionaldeep Transformer as the main structure. There arealso two separate stages to adapt BERT for specifictasks, pre-training and fine-tuning (see Figure 5and Figure 8).

In the pre-training phase, BERT applies au-toencoding language modeling rather than autore-gressive language modeling used in GPT. Morespecifically, inspired by cloze (Taylor, 1953), theobjective masked language modeling (MLM) isdesigned. As shown in Figure 7, in the proce-dure of MLM, tokens are randomly masked witha special token [MASK], the objective is to pre-dict words at the masked positions with contexts.Compared with standard unidirectional autoregres-sive language modeling, MLM can lead to a deepbidirectional representation of all tokens. For-mally, given a corpus consisting of tokens X ={x0, x1, . . . , xn, xn + 1}, BERT randomly masks

10

Pre-trainingUnlabeled Sentence A and B Pair

Masked Sentence A Masked Sentence B

[CLS] Tok1 TokN [SEP]… Tok1 TokM…

E[CLS] E1 EN E[SEP] E’1 E’M

… …


… …

NSP Mask LM Mask LM

BERT

Fine-Tuning

MNLI NER

Question Answer Pair

Question Paragraph

[CLS] Tok1 TokN [SEP]… Tok1 TokM…


… …


… …

BERT

SQuAD Start/End Span

Figure 8: The pre-training and fine-tuning phases for BERT.

m tokens in X and then maximizes the followinglog-likelihood:

L(X ) =

m∑i=1

logP ([Mask]i = yi|X ; Θ), (7)

where the probability P is modeled by the Trans-former encoder with parameters Θ, X is the resultafter masking some tokens in X , [Mask]i is thei-th masked position, and yi is the original token atthis position.

Besides MLM, the objective of next sentenceprediction (NSP) is also adopted to capture dis-course relationships between sentences for somedownstream tasks with multiple sentences, such asnatural language inference and question answering.For this task, a binary classifier is used to predictwhether two sentences are coherent. In the pre-training phase, MLM and NSP work together tooptimize the parameters of BERT.

After pre-training, BERT can obtain robust pa-rameters for downstream tasks. By modifying in-puts and outputs with the data of downstream tasks,BERT could be fine-tuned for any NLP tasks. Asshown in Figure 8, BERT could effectively handlethose applications with the input of a single sen-tence or sentence pairs. For the input, its schemais two sentences concatenated with the special to-ken [SEP], which could represent: (1) sentencepairs in paraphrase, (2) hypothesis-premise pairs inentailment, (3) question-passage pairs in questionanswering, and (4) a single sentence for text classi-fication or sequence tagging. For the output, BERTwill produce a token-level representation for eachtoken, which can be used to handle sequence tag-ging or question answering, and the special token

[CLS] can be fed into an extra layer for classifica-tion. After GPT, BERT has further achieved signif-icant improvements on 17 different NLP tasks, in-cluding SQuAD (better than human performance),GLUE (7.7% point absolute improvements), MNLI(4.6% point absolute improvements), etc.

3.4 After GPT and BERT

After GPT and BERT, some of their improvementshave been proposed, such as RoBERTa and AL-BERT. RoBERTa(Liu et al., 2020d) is one of thesuccess variants of BERT, which mainly has foursimple and effective changes: (1) Removing theNSP task; (2) More training steps, with biggerbatch size and more data; (3) Longer training sen-tences; (4) Dynamically changing the [MASK] pat-tern. RoBERTa achieves impressive empirical re-sults on the basis of BERT. Moreover, RoBERTahas pointed out that the NSP task is relatively use-less for the training of BERT. ALBERT(Lan et al.,2019) is another important variant of BERT, whichprovides several interesting observations on reduc-ing parameters. First, it factorizes the input wordembedding matrix into two smaller ones. Second,it enforces parameter-sharing between all Trans-former layers to significantly reduce parameters.Third, it proposes the sentence order prediction(SOP) task to substitute BERT’s NSP task. As asacrifice to its space efficiency, ALBERT has aslower fine-tuning and inference speed.

As shown in Figure 9, besides RoBERTa andALBERT, there are various PTMs being proposedin recent years towards better capturing knowl-edge from unlabeled data. Some work improvesthe model architectures and explores novel pre-training tasks, such as XLNet (Yang et al., 2019),

11

ELMoULMFiT

BERT

TransformerGPT

Bidirectional LM

GPT-2

Larger modelMore data

Grover

Defense

ERNIE (Tsinghua)

KnowBERTKEPLER

ERNIE (Baidu)BERT-wwm

+Knowledge Graph

VideoBERTCBT

ViLBERTVisualBERT

B2T2Unicoder-VL

LXMERTVL-BERTUNITER

Cross-modal

XLNet

MASSUniLM

XLMUDify

RoBERTa

Permutation LMTransformer-XLMore data

+ Generation

Longer timeRemove NSPMore data

Cross-lingual

MT-DNN

Multi-task

MT-DN𝐍𝐊𝐃

Knowledge distillation

SpanBERT

Span predictionRemove NSP

Whole Word Masking

MultiFiT

Multi-lingual

Semi-supervised Sequence Learningcontext2Vec

Pre-trained seq2seq

Encoder-Decoder

BARTPEGASUS

T5

GPT-3

Switch Transformer

Larger

MoE

Figure 9: The family of recent typical PTMs, including both pre-trained language models and multimodal models.

UniLM (Dong et al., 2019), MASS (Song et al.,2019), SpanBERT (Joshi et al., 2020) and ELEC-TRA (Clark et al., 2020). Besides, incorporatingrich data sources is also an important direction,such as utilizing multilingual corpora, knowledgegraphs, and images. Since the model scale is acrucial success factor of PTMs, researchers alsoexplore to build larger models to reach over hun-dreds of billions of parameters, such as the seriesof GPT (Radford et al., 2019; Brown et al., 2020),Switch Transformer (Fedus et al., 2021), and mean-while conduct computational efficiency optimiza-tion for training PTMs (Shoeybi et al., 2019; Ra-jbhandari et al., 2020; Ren et al., 2021). In thefollowing sections, we will further introduce allthese efforts for PTMs in detail.

4 Designing Effective Architectures

In this section, we dive into the after-BERT PTMsdeeper. The success of Transformer-based PTMshas stimulated a stream of novel architectures formodeling sequences for natural language and be-yond. Generally, all the after-BERT Transformerarchitectures for language pre-training could be cat-egorized according to two motivations: toward uni-fied sequence modeling and cognitive-inspiredarchitectures. Besides, we also take a glimpseover other important BERT variants in the third

subsection, which mostly focus on improving natu-ral language understanding.

4.1 Unified Sequence Modeling

Why is NLP so challenging? One of the funda-mental reasons is that it has versatile downstreamtasks and applications, which could be generallycategorized into three genres:

• Natural language understanding: includesgrammatical analysis, syntactic analysis,word/sentence/paragraph classification, ques-tion answering, factual/commonsense knowl-edge inference and etc.

• Open-ended language generation: includesdialog generation, story generation, data-to-text generation and etc.

• Non-open-ended language generation: in-cludes machine translation, abstract summa-rizing, blank filling and etc.

Nevertheless, the differences between them arenot so significant. As Feynman’s saying goes,“What I cannot create, I do not understand”. Onone hand, a model that can not understand mustnot fluently generate; on the other hand, we caneasily turn understanding tasks into generationtasks (Schick and Schütze, 2020). Recent studies

12

also show that GPTs can achieve similar and evenbetter performance on understanding benchmarksthan BERTs (Liu et al., 2021b). The boundarybetween understanding and generation is vague.

Based on the observation, a bunch of novel ar-chitectures has been seeking for unifying differenttypes of language tasks with one PTM. We willtake a look over its development and discuss the in-spirations they bring towards a unified foundationof natural language processing.

Combining Autoregressive and AutoencodingModeling. The pioneer work to unify GPT-styleunidirectional generation and BERT-style bidirec-tional understanding is XLNet (Yang et al., 2019),which proposes the permutated language modeling.The masked-recover strategy in BERT naturallycontradicts with its downstream application, wherethere is no [MASK] in input sentences. XLNetsolves the problem by permutating tokens’ orderin the pre-training and then applying the autore-gressive prediction paradigm, which endows XL-Net with the ability for both understanding andgeneration. An important follower of permutationlanguage modeling is MPNet (Song et al., 2020),which amends the XLNet’s discrepancy that in pre-training XLNet does not know the sentence’s lengthwhile in downstream it knows.

Besides permutated language modeling, anotherstream would be multi-task training. UniLM (Donget al., 2019) proposes to jointly train different lan-guage modeling objectives together, including uni-directional, bidirectional, and seq2seq objectives.This can be achieved by changing the attentionmasks in Transformers. UniLM performs quitewell in generative question answering and abstractsummarization.

Recently, GLM (Du et al., 2021) proposes amore elegant approach for combining autoregres-sive and autoencoding. Given a variable-lengthmasked span, instead of providing the number of[MASK] to model as BERT and SpanBERT (Joshiet al., 2020) do, GLM asks Transformer blocks toautoregressively generate the masked tokens. Andto preserve the information of [MASK]s’ number,GLM proposes a 2D positional encoding strategy.GLM is the first model to achieve the best perfor-mance on all types of tasks including natural lan-guage understanding, conditional generation, andunconditional generation at the same time.

Applying Generalized Encoder-Decoder. BeforeGLM, both encoder structure (e.g., BERT) or de-

Table 1: Three fundamental types of framework andtheir suitable downstream tasks. “NLU” refers to nat-ural language understanding. “Cond. Gen.” and “Un-cond. Gen.” refer to conditional and unconditional textgeneration, respectively. “X” means “is good at”, “—”means “could be adapted to”, and “×” means “cannotbe directly applied to”. We define unconditional gen-eration as the task of generating text without furthertraining as in a standard language model, while condi-tional generation refers to seq2seq tasks such as textsummarization. Taken from (Du et al., 2021).

Framework NLU Cond. Gen. Uncond. Gen.

Autoregressive — — XAutoencoding X × ×

Encoder-Decoder — X —

coder structure (e.g., GPT) can not solve an im-portant problem: to fill in blanks with variablelengths (Du et al., 2021; Shen et al., 2020b). Thedecoder-based models can not make it because theycan only generate at the end of the sequence andneither the encoder-based models because the num-ber of [MASK]s will leak information. A natu-ral idea is to turn to encoder-decoder architecturesoriginally designed for machine translation, whichwould produce variable lengths of target sequencesconditioned on the sources.

The pioneer of this genre is MASS (Song et al.,2019), which introduces the masked-predictionstrategy into the encoder-decoder structure. How-ever, MASS does not touch the problem of fillingvariable-length blanks. T5 (Raffel et al., 2020)solves the problem by masking a variable-lengthof span in text with only one mask token and asksthe decoder to recover the whole masked sequence.BART (Lewis et al., 2020a) introduces the inter-esting idea of corrupting the source sequence withmultiple operations such as truncation, deletion, re-placement, shuffling, and masking, instead of meremasking. There are following works that specify intypical seq2seq tasks, such as PEGASUS (Zhanget al., 2020a) and PALM (Bi et al., 2020).

However, several challenges lie in front ofencoder-decoder architectures. First, the encoder-decoder introduces much more parameters com-pared to a single encoder/decoder. Although thisproblem could be alleviated by parameter-sharingof the encoder and decoder, its parameter-efficiencyis still doubtful. Second, encoder-decoder struc-tures generally do not perform very well on natu-ral language understanding. Despite reported im-provements over similar-sized vanilla BERT, well-

13

trained RoBERTa or GLM encoder performs muchbetter than them.

4.2 Cognitive-Inspired Architectures

Is the current Transformer a good enough imple-mentation of human beings’ cognitive system? Ofcourse not. Attention mechanism, the core modulein the Transformer architecture, is inspired by themicro and atom operation of the human’s cogni-tive system and only responsible for the perceptivefunction. However, human-level intelligence is farmore complex than the mere understanding of theassociation between different things.

In pursuit for human-level intelligence, under-standing the macro architecture of our cogni-tive functions including decision making, logicalreasoning, counterfactual reasoning and workingmemory (Baddeley, 1992) is crucial. In this subsec-tion, we will take a look over the novel attempts in-spired by advances of cognitive science, especiallyon maintainable working memory and sustainablelong-term memory.

Maintainable Working Memory. A natural prob-lem of Transformer is its fixed window size andquadratic space complexity, which significantlyhinders its applications in long document under-standing and generation.

Despite the bunch of modifications on approx-imate computing of the quadratic growing point-wise attention (Tay et al., 2020), a question is thatwe humans do not present such a long-range at-tention mechanism. As an alternative, cognitivescientists have revealed that humans could main-tain a working memory (Baddeley, 1992; Brown,1958; Barrouillet et al., 2004; Wharton et al., 1994),which not only memorizes and organizes but alsoforgets. The conventional long-short term memory(LSTM) network is an exemplar practice for sucha philosophy.

For Transformer-based architectures, theTransformer-XL (Dai et al., 2019) is the first tointroduce segment-level recurrence and relativepositional encoding to fulfill this goal. How-ever, the recurrence only implicitly models theworking memory. As a more explicit solution,CogQA (Ding et al., 2019) proposes to maintaina cognitive graph in the multi-hop reading. It iscomposed of two systems: the System 1 based onPTMs and the System 2 based on GNNs to modelthe cognitive graph for multi-hop understanding.

A limitation of CogQA is that its use of the Sys-

tem 1 is still based on fixed window size. To endowworking memory with the ability to understandlong documents, CogLTX (Ding et al., 2020) lever-ages a MemRecall language model to select sen-tences that should be maintained in the workingmemory and task-specific modules for answeringor classification.

Sustainable Long-Term Memory. The successof GPT-3 and recent studies on language models’ability in recalling factual knowledge (Petroni et al.,2019; Wang et al., 2020a; Liu et al., 2021b) hasrevealed the fact that Transformers can memorize.How does Transformers make it?

In Lample et al. (2019), the authors provide someinspiring evidences on how Transformers memo-rize. They replace the feed-forward networks ina Transformer layer with large key-value memorynetworks, and find it to work pretty well. Thissomehow proves that the feed-forward networks inTransformers is equivalent to memory networks.

Nevertheless, the memory capacity in Transform-ers is quite limited. For human intelligence, besidesworking memory for deciding and reasoning, thelong-term memory also plays a key role in recall-ing facts and experiences. REALM (Guu et al.,2020) is a pioneer to explore how to construct asustainable external memory for Transformers. Theauthors tensorize the whole Wikipedia sentence bysentence, and retrieve relevant sentences as contextfor masked pre-training. The tensorized Wikipediais asynchronously updated for a given number oftraining steps. RAG (Lewis et al., 2020b) extendsthe masked pre-training to autoregressive genera-tion, which could be better than extractive questionanswering.

Besides tensorizing the textual corpora, (Vergaet al., 2020; Févry et al., 2020) propose to tensorizeentities and triples in existing knowledge bases.When entities appear in contexts, they replace en-tity tokens’ embedding in an internal Transformerlayer with the embedding from outer memory net-works. (Dhingra et al., 2020; Sun et al., 2021)maintain a virtual knowledge from scratch, andpropose a differentiable reasoning training objec-tive over it. All of these methods achieve promisingimprovement on many open-domain question an-swering benchmarks.

4.3 More Variants of Existing PTMs

Besides the practice to unify sequence model-ing and construct cognitive-inspired architectures,

14

most current studies focus on optimizing BERT’sarchitecture to boost language models’ perfor-mance on natural language understanding.

A stream of work aims at improving the mask-ing strategy, which could be regarded as a certainkind of data augmentation (Gu et al., 2020). Span-BERT (Joshi et al., 2020) shows that masking a con-tinuous random-length span of tokens with a spanboundary objective (SBO) could improve BERT’sperformance. Similar ideas have also been exploredin ERNIE (Sun et al., 2019c,d) (where a whole en-tity is masked), NEZHA (Wei et al., 2019), andWhole Word Masking (Cui et al., 2019).

Another interesting practice is to change themasked-prediction objective to a harder one.ELECTRA (Clark et al., 2020) transform MLMto a replace token detection (RTD) objective, inwhich a generator will replace tokens in originalsequences and a discriminator will predict whethera token is replaced.

5 Utilizing Multi-Source Data

In this section, we introduce some typical PTMsthat take advantage of multi-source heterogeneousdata, including multilingual PTMs, multimodalPTMs, and knowledge-enhanced PTMs.

5.1 Multilingual Pre-Training

Language models trained on large-scale Englishcorpora have achieved great success in many bench-marks. However, we live in a multilingual world,and training a large language model for each lan-guage is not an elegant solution because of the costand the amount of data required. In fact, althoughpeople from all over the world use different lan-guages, they can express the same meaning. Thismay indicate that semantics is independent of sym-bol systems. Additionally, some researchers foundthat they could get even better performance onbenchmarks when training one model with severallanguages comparing with training several mono-lingual models (Lample and Conneau, 2019; Huanget al., 2020b). Hence, training one model to learnmultilingual representations rather than monolin-gual representations may be a better way.

Before BERT, some researchers have exploredmultilingual representations. There are mainly twoways to learn multilingual representations. Oneway is to learn through parameter sharing. For ex-ample, training multilingual LSTMs with severallanguage pairs together achieves multilingual trans-

lation. Another way is to learn language-agnosticconstraints, such as decoupling language repre-sentations into language-specific and language-agnostic representations utilizing the WGAN (Ar-jovsky et al., 2017) framework. Both of these twoways enable models to be applied to multilingualscenarios, but only for specific tasks. The model ineach of them is trained with one specific task frombeginning to end, and cross-lingual knowledge can-not be generalized to other tasks. Hence, for anyother multilingual tasks, training new models fromscratch is still required. Learning new models fromscratch needs a large volume of task-specific data.

The appearance of BERT shows that the frame-work of pre-training with general self-supervisedtasks and then fine-tuning on specific downstreamtasks is feasible. This motivates researchers todesign tasks to pre-train versatile multilingual mod-els. Multilingual tasks could be divided into un-derstanding tasks and generation tasks accordingto task objectives. Understanding tasks focus onsentence-level or word-level classification, and areof help for downstream classification tasks such asnatural language inference (Conneau et al., 2018b).Generation tasks focus on sentence generation, andare crucial in downstream generation tasks such asmachine translation.

Some understanding tasks are first used to pre-train multilingual PTMs on non-parallel multilin-gual corpora. For example, multilingual BERT(mBERT) released by Devlin et al. (2019) is pre-trained with the multilingual masked languagemodeling (MMLM) task using non-parallel multi-lingual Wikipedia corpora in 104 languages. Theresearch conducted by Pires et al. (2019) showsthat mBERT has the ability to generalize cross-lingual knowledge in zero-shot scenarios. This in-dicates that even with the same structure of BERT,using multilingual data can enable the model tolearn cross-lingual representations. XLM-R (Con-neau et al., 2020) builds a non-parallel multilingualdataset called CC-100, which supports 100 lan-guages. The scale of CC-100 is much larger thanthe Wikipedia corpora used by mBERT, especiallyfor those low-resource languages. XLM-R is pre-trained with MMLM as the only task on CC-100and gets better performance on several benchmarksthan mBERT, which indicates that a larger scale ofmultilingual corpora can bring better performance.

However, the MMLM task cannot well utilizeparallel corpora. In fact, parallel corpora are quite

15

important for some NLP tasks such as machinetranslation. Intuitively, parallel corpora are veryhelpful to directly learn cross-lingual representa-tions for those sentences in different languages withthe same meanings. From this point, XLM (Lampleand Conneau, 2019) leverages bilingual sentencepairs to perform the translation language modeling(TLM) task. Similar to MLM in BERT, TLM com-bines two semantically matched sentences into oneand randomly masks tokens in both parts. Com-pared with MLM, TLM requires models to predictthe masked tokens depending on the bilingual con-texts. This encourages models to align the repre-sentations of two languages together.

Besides TLM, there are some other effectivemethods to learn multilingual representations fromparallel corpora. Unicoder (Huang et al., 2019a)provides two novel pre-training tasks based on par-allel corpora: cross-lingual word recovery (CLWR)and cross-lingual paraphrase classification (CLPC).CLWR uses target language embeddings to repre-sent source language embeddings by leveraging at-tention mechanisms, and its objective is to recoverthe source language embeddings. This task enablesmodels to learn word-level alignments between dif-ferent languages. CLPC treats aligned sentencesas positive pairs and samples misaligned sentencesas negative pairs to perform sentence-level classi-fication, letting models predict whether the inputpair is aligned or not. With CLPC, models canlearn sentence-level alignments between differentlanguages. ALM (Yang et al., 2020) automaticallygenerates code-switched sequences from parallelsentences and performs MLM on it, which forcesmodels to make predictions based only on contextsof other languages. InfoXLM (Chi et al., 2020b)analyzes MMLM and TLM from the perspectiveof information theory, and encourages models todistinguish aligned sentence pairs with misalignednegative examples under the framework of con-trastive learning. HICTL (Wei et al., 2021) extendsthe idea of using contrastive learning to learn bothsentence-level and word-level cross-lingual repre-sentations. ERNIE-M (Ouyang et al., 2020) pro-poses back-translation masked language modeling(BTMLM), and expands the scale of parallel cor-pora through back-translation mechanisms. Theseworks show that leveraging parallel corpora canbring much help towards learning cross-lingual rep-resentations.

Researches have also widely explored generative

models for multilingual PTMs. Normally, a gener-ative model consists of a Transformer encoder anda Transformer decoder. For example, MASS (Songet al., 2019) extends MLM to language genera-tion. It randomly masks a span of tokens in theinput sentence and predicts the masked tokens inan autoregressive manner. Denoising autoencoding(DAE) is a typical generation task, which appliesnoise functions to the input sentence and then re-stores the original sentence with the decoder. Thenoise functions of DAE usually contain two opera-tions: replacing a span of tokens with a mask tokenas well as permuting the order of tokens. mBART(Liu et al., 2020c) extends DAE to support multiplelanguages by adding special symbols. It adds a lan-guage symbol both to the end of the encoder inputand the beginning of the decoder input. This en-ables models to know the languages to be encodedand generated.

Although DAE in mBART (Liu et al., 2020c) istrained with multiple languages, the encoding in-put and the decoding output are always in the samelanguage. This leads models to capture spuriouscorrelations between language symbols and gener-ated sentences. In other words, models may ignorethe given language symbols and directly generatesentences in the same language of the input. To ad-dress this issue, XNLG (Chi et al., 2020a) proposesthe cross-lingual autoencoding (XAE) task. Differ-ent from DAE, the encoding input and the decodingoutput of XAE are in different languages, which issimilar to machine translation. In addition, XNLGoptimizes parameters in a two-stage manner. Ittrains the encoder with the MLM and TLM tasks inthe first stage. Then, it fixes the encoder and trainsthe decoder with the DAE and XAE tasks in thesecond stage. All parameters are well pre-trainedby this way, and the gap between pre-training withMLM and fine-tuning with autoregressive decodingis also filled.

5.2 Multimodal Pre-Training

Large-scale pre-training and its downstream ap-plications have cascaded impactful research anddevelopment with diverse real-world modalities.As human beings, we are exposed to differentmodalities—we see objects, hear sounds and speaklanguages. Modalities, such as audio, video, imageand text, refer to how something happens or is expe-rienced. Recent years have witnessed an upsurginginterest in cross-modal tasks that involves multi-

16

ple modalities. More recently, large-scale PTMshave enhanced research interests in the intersec-tion of multiple modalities, such as the intersec-tion of image and text, or the intersection of videoand text. Most of these cross-modal works can beclassified as vision and language (V&L), consid-ering that images and videos belong to vision aswell as text and speech (audio) belong to language.Specifically, V&L tasks can be further divided intoimage-text-based tasks, video-text-based tasks, andvideo-audio-based tasks according to their specificmodalities being used. In this section, we presentan overview of existing works in pre-training onV&L modalities. Existing cross-modal pre-trainingPTMs mainly focus on (1) improving model archi-tecture, (2) utilizing more data, and (3) designingbetter pre-training tasks.

For image-text-based PTMs, most current worksare based on the architecture of visual-linguisticBERT. The main challenge lies in the alignmentof visual and textual content in a unified semanticspace (i.e. V&L grounding). To this end, there aremainly two kinds of model architecture designs:two-stream and single-stream. As a representa-tive work of two-stream models, ViLBERT (Luet al., 2019) processes image regions and text to-kens with two separate streams, and fuses themwith specifically designed co-attention transformerblocks. In comparison, LXMERT (Tan and Bansal,2019) first processes two modalities separately andthen conducts a late fusion with a cross-modalityencoder. In single-stream models, such as Visu-alBERT (Li et al., 2019), Unicoder-VL (Li et al.,2020a), B2T2 (Alberti et al., 2019), the image re-gion features and word embeddings are usuallyconcatenated and fed into a single transformer. Re-searchers have not reached a consensus on whichdesign is better (Lu et al., 2019; Su et al., 2020)on the V&L grounding ability. Considering modelsimplicity and parameter efficiency, current worksmainly adopt the single-stream design.

In cross-modal pre-training, data resources arealso of vital significance. The most widely usedcorpora are image-text pairs collected from web in-cluding Conceptual Captions (Sharma et al., 2018),SBU Captions (Ordonez et al., 2011) or existingV&L datasets designed for specific tasks includingCOCO (Lin et al., 2014), Flicker30K (Plummeret al., 2015), GQA (Hudson and Manning, 2019),VQA (Antol et al., 2015) and Visual Genome (Kr-ishna et al., 2017). Directly increasing the scale of

image-text data is useful for better V&L ground-ing. UNITER (Chen et al., 2020f) combines severalabove-mentioned datasets, resulting in 5.6 millionimage-text pairs for training. Sufficient trainingdata helps UNITER achieve impressive results ondownstream tasks. Similar to UNITER in architec-ture and pre-training tasks, ImageBERT (Qi et al.,2020) further constructs a dataset containing 10million web image-text pairs and uses it as a pre-training dataset, leading to a better performancethan UNITER on image-text retrieval tasks. In ad-dition to parallel image-text data, VL-BERT (Suet al., 2020) finds that incorporating extra text-onlycorpora like BooksCorpus (Zhu et al., 2015) andWikipedia is helpful for text understanding, espe-cially for tasks with long and complex sentenceslike visual commonsense reasoning. Different fromworks using only easily collected data like image-text pairs or textual corpora, Lu et al. (2020) identi-fies the contribution of dedicated datasets by con-ducting a joint multi-task training on nearly allkinds of V&L tasks.

Given data resources, it is also important to de-sign corresponding pre-training tasks or strategiesto utilize the information efficiently. For V&Lunderstanding tasks, the most widely used pre-training tasks are MLM, sentence-image align-ment (SIA), masked region classification (MRC),masked region feature regression (MRFR), and di-rectly incorporating downstream tasks. Similar toMLM for NLP, MLM for V&L aims to recovermasked tokens in captions with the help of vi-sual and textual context. SIA is designed to judgewhether image-text pairs are matched. MRC canbe considered as the visual MLM, requiring V&Lmodels to predict the categories of masked objects.MRFR further requires V&L models to recover thevisual features of masked object regions. Thereare also models directly conducting downstreamV&L understanding tasks in the pre-training stage.For example, LXMERT employs VQA as a pre-training task. Lu et al. (2020) trains all downstreamtasks jointly. To learn the fine-grained alignmentbetween image regions and words, UNITER furtherproposes a word-region alignment task in the wayof Optimal Transport (Chen et al., 2020c), whichfirst finds a sparse matching between image regionsand words, and then minimizes the alignment dis-tance. However, most of these works ignore theobject tags’ function as a kind of explicit bridgesbetween image regions and text tokens. Therefore,

17

Oscar (Li et al., 2020e) proposes to concatenatethe object tags with original image-text pairs as an-chors to learn the alignment between V&L modali-ties, and designs a new pre-training task for image-tag sequence-caption alignment judgment. In thisway, Oscar achieves SOTA results on most V&Ltasks compared with the aforementioned modelson both V&L understanding and generation tasks.Besides pre-training tasks designed for V&L un-derstanding tasks, there are also some pre-trainingtasks targeting at V&L generation tasks. For ex-ample, VLP (Zhou et al., 2020a) and X-GPT (Xiaet al., 2020) employ seq2seq MLM as their pre-training tasks.

Instead of designing delicate pre-training tasks,recent works CLIP (Radford et al., 2021) and Wen-Lan (Huo et al., 2021) choose to grasp the V&Lgrounding ability in a simple and holistic regime.They encode images and captions into holistic vi-sual and text representations rather than separatedregion features and word embeddings, and thenonly conduct an image-text retrieval task. The suc-cess of this kind of holistic alignment can be largelyattributed to the enlarging scale of web data, whichis 400 million image-text pairs for CLIP and 30million for WenLan.

Previous works mentioned above are specializedfor V&L understanding or only image captioningtasks, but are not capable of image generation. Re-cently, a bigger step towards conditional image gen-eration is taken by DALLE (Ramesh et al., 2021)and CogView (Ding et al., 2021a). DALLE is thefirst transformer-based text-to-image PTM, witharound 10 billion parameters. It shows the po-tential of multi-modal PTMs in bridging the gapbetween text descriptions and image generation,especially the excellent ability in combining dif-ferent objects, such as “an armchair in the shapeof an avocado". CogView further improves thenumerical precision and training stability by intro-ducing sandwich transformer and sparse attentionmechanism, and surpasses the DALLE in FréchetInception Distance (FID) (Heusel et al., 2017) onblurred COCO.

In addition to image-text PTMs, there are alsoPTMs for other modalities, such as video and au-dio. VideoBERT (Sun et al., 2019a) conducts pre-training on Cooking312K video dataset (Sun et al.,2019a) and validates the model on zero-shot ac-tion classification task and video captioning task.SpeechBERT (Chuang et al., 2019) first encodes

the continuous audio signal into several phonetic-semantic word embeddings, and then uses MLMon both text and audio modalities as pre-trainingtasks. After pre-training, spoken question answer-ing (SQA) task is used for evaluation.

5.3 Knowledge-Enhanced Pre-Training

PTMs can extract plenty of statistical informationfrom large amounts of data. Besides, externalknowledge, such as knowledge graphs, domain-specific data and extra annotations of pre-trainingdata, is the outcome of human wisdom which canbe a good prior to the modeling of statistics. In thissubsection, we classify external knowledge accord-ing to the knowledge format and introduce severalmethods attempting to combine knowledge withPTMs.

The typical form of structured knowledge isknowledge graphs. Many works try to enhancePTMs by integrating entity and relation embed-dings (Zhang et al., 2019b; Liu et al., 2020a; Pe-ters et al., 2019; Sun et al., 2020; Rosset et al.,2020; Qin et al., 2021) or their alignments with thetext (Xiong et al., 2019; Sun et al., 2019c). How-ever, real-world knowledge graphs like Wikidatacontain more information than entities and rela-tions. Wang et al. (2021b) pre-train models basedon the descriptions of Wikidata entities, by incorpo-rating a language model loss and a knowledge em-bedding loss together to get knowledge-enhancedrepresentations. Some works regard the paths andeven sub-graphs in knowledge graphs as a whole,and directly model them and the aligned text to re-tain more structural information. Since aligning en-tities and relations to raw text is often troublesomeand can introduce noise in data pre-processing, an-other line of works (Bosselut et al., 2019; Guanet al., 2020; Chen et al., 2020e) can directly con-vert structural knowledge into the serialized textand let models learn knowledge-text alignmentsby themselves. An interesting attempt is OAG-BERT (Liu et al., 2021a), which integrates hetero-geneous structural knowledge in the open academicgraph (OAG) (Zhang et al., 2019a), which covers0.7 billion heterogeneous entities and 2 billion re-lations.

Compared to structured knowledge, unstructuredknowledge is more intact but also noisier. How toeffectively model this kind of knowledge from thedata is also worth being explored. The data of aspecific domain or task can be considered as a kind

18

…

…

…Stream 1

Stream 2

Step i-1 Step i Step i+1

Gradient Offload

(GPU->CPU)

ParameterSwap

(CPU->GPU)

Gradient Offload

(GPU->CPU)

ParameterSwap

(CPU->GPU)

Gradient Offload

(GPU->CPU)

FWD&BWD(GPU)

FWD&BWD(GPU)

ParameterUpdate (CPU)

FWD&BWD(GPU)


…

Gradient Offload

(GPU->CPU)


FWD&BWD(GPU)

ParameterSwap

(CPU->GPU)

Gradient Offload

(GPU->CPU)


FWD&BWD(GPU)

ParameterSwap

(CPU->GPU)

Gradient Offload

(GPU->CPU)

FWD&BWD(GPU)

ParameterSwap

(CPU->GPU)


…

…

…Stream 1

Stream 2

Step i-1 Step i Step i+1

…

Step i+2

ZeRO-Offload (Delayed Parameter Update)

ZeRO-Offload

Figure 10: An illustration of ZeRO-Offload and ZeRO-Offload with delayed parameter update.

of unstructured knowledge. Many works (Beltagyet al., 2019; Lee et al., 2020) further pre-train thegeneral PTMs on this data to get better domain-specific or task-specific models. Since there aresome domain-specific and task-specific human an-notations, Ke et al. (2020) incorporate these ex-tra annotations to get better domain-specific andtask-specific language representations. For allthe above-mentioned works, knowledge is implic-itly stored in their model parameters. To modelexternal knowledge in a more interpretable way,some works (Lewis et al., 2020b; Guu et al., 2020)design retrieval-based methods to use structuredknowledge on downstream tasks. Another kindof works (Wang et al., 2020b) can use adapterstrained on different knowledge sources with extraannotations to distinguish where the knowledge isfrom.

6 Improving Computational Efficiency

As introduced in Section 1, a major trend of PTMsis that the number of parameters is getting largerand larger. Increasing the size of a neural networktypically improves accuracy, but it also increasesthe memory and computational requirements fortraining the model. In this section, we will in-troduce how to improve computational efficiencyfrom the following three aspects: system-level opti-mization, efficient learning algorithms, and modelcompression strategies.

6.1 System-Level Optimization

An effective and practical way to reduce compu-tational requirements is system-level optimization

towards computational efficiency and memory us-age. System-level optimization methods are oftenmodel-agnostic and do not change underlying learn-ing algorithms. Therefore, they are widely used intraining large-scale PTMs. Generally, these meth-ods can be divided into single-device optimizationmethods and multi-device optimization ones.

Single-Device Optimization. Current large-scalePTMs usually cost a lot of memory for pre-training.This is mainly due to the redundant representationof floating-point numbers. Modern deep learningsystems are mainly based on a single-precisionfloating-point format (FP32). However, the weightsof models usually fall in a limited range, and us-ing a half-precision floating-point format (FP16)can accomplish most of the computation with littleprecision loss (Gupta et al., 2015).

However, in some cases, training models inFP16 may fail because of the floating-point trunca-tion and overflow. To tackle this problem, mixed-precision training methods (Micikevicius et al.,2018) have been proposed, which preserve somecritical weights in FP32 to avoid the floating-pointoverflow and use dynamic loss scaling operationsto get rid of the floating-point truncation. Sufficientexperiments have shown that mixed-precision train-ing methods are more stable than directly trainingmodels in FP16. Although mixed-precision train-ing methods can significantly reduce the trainingtime and memory usage, they still face some chal-lenges. When model parameters are not initializedwell, mixed-precision methods may still cause un-stable training. All these challenges still require tobe further explored.

19

Single Node

Model

Data

Data Parallelism (16 Nodes) Model Parallelism (16 Nodes)

Figure 11: An illustration of the data parallelism and model parallelism with 16 nodes.

Besides the redundant representation of floating-point numbers, the activation states saved for com-puting gradients are also redundant. For exam-ple, in Transformer-based models, apart from theweights of attention layers and linear layers, com-putational devices also store the hidden states ofeach layer for the efficiency of the chain rule usedin the gradient back-propagation. As comparedwith model parameters, these hidden states canconsume even much more memory. To handle re-dundant activation states, gradient checkpointingmethods (Rasley et al., 2020) have been used tosave memory by storing only a part of the activationstates after forward pass. The discarded activationstates are recomputed during the backward steps ifnecessary.

When pre-training recent large-scale PTMs, thememory consumption can be too large to fit in asingle GPU. Therefore, some works (Huang et al.,2020a) attempt to store model parameters and acti-vation states with the CPU memory rather than theGPU memory, since the CPU memory is usuallymuch larger. As shown in Figure 10, some workssuch as ZeRO-Offload (Ren et al., 2021) designdelicate strategies to schedule the swap betweenthe CPU memory and the GPU memory so thatmemory swap and device computation can be over-lapped as much as possible.

Multi-Device Optimization. Recently, distributedtraining is commonly used in pre-training, wheremultiple GPUs distributed in many computationalnodes are used together to train a single model.Data parallelism (Li et al., 2020d) is a simple andeffective approach to accelerate training a model.

As shown in Figure 11, when we use data paral-lelism, a large batch is partitioned to different nodesand thus forward pass can be parallelized. At back-ward pass, the gradients on different nodes shouldbe aggregated with all-reduce operations to ensurethe consistency of parameter optimization, whichmay introduce additional communication overhead.

When pre-training models with billions to tril-lions of parameters, traditional data parallelismbrings challenges of fitting whole model parame-ters into a single GPU, even with half-precision ormixed-precision training. Although this problemcan be solved by using a GPU with larger mem-ory, the expenses can be hard to afford, limitingthe use of PTM by ordinary researchers. Modelparallelism is an effective way to tackle this prob-lem (Shazeer et al., 2018). As shown in Figure 11,when conducting model parallelism, model parame-ters can be distributed to multiple nodes. The com-munication operations between these nodes likereduce-scatter and all-gather guarantee the correct-ness of forward pass and backward pass. Megatron-LM (Shoeybi et al., 2019) adopts model parallelismto Transformer-based PTMs. It splits self-attentionheads as well as feed-forward layers into differ-ent GPUs, reducing the memory burden of a sin-gle GPU. Mesh-Tensorflow (Shazeer et al., 2018)also enables users to split tensors along any ten-sor dimensions, which can bring more customizedoptions for model parallelism.

Although model parallelism enables differentcomputational nodes to store different parts ofmodel parameters, it has to insert collective com-munication primitives during both forward pass

20

FWD (P1) Micro-Batch 1

BWD (P1) Micro-Batch 4































Gradient Accumulation




Parameter Update

Parameter Update

Parameter Update

ParameterUpdate

Time Steps

Node 1

Node 2

Node 3

Node 4

Data Parallelism (4 Nodes, 4 micro batches)

Figure 12: An illustration of the pipeline parallelism with 4 nodes and 4 micro batches.

and backward pass, which can not be overlappedby device computation. On the contrary, the all-reduce collective communication operation in dataparallelism usually can be overlapped by the back-ward computation. As a result, data parallelismis preferred as long as it can conquer the exces-sive requirement of memory capacity. In the stan-dard implementation of data parallelism, optimizerstates are usually copied along different nodes toguarantee synchronized optimization across dataparallelism units. This redundancy leads to theadditional overhead of GPU memory, especiallywhen models are trained in a mixed-precision man-ner because the optimizer needs to store 32-bitmaster states of these models to ensure accuracy.To eliminate the redundancy brought by optimizerstates and parameters, ZeRO optimizer (Rajbhan-dari et al., 2020) methods equally partition anddistribute optimizer states to each node of data par-allelism, such that each node only updates the op-timizer states corresponding to its partition. Atthe end of a training step, all optimizer states aregathered across data parallelism nodes.

The above-mentioned model parallelism tech-niques mainly focus on partitioning and paralleliz-ing matrix operations across different nodes. Asshown in Figure 12, another effective method formodel parallelism is pipeline parallelism, whichpartitions a deep neural network into multiple lay-ers and then puts different layers onto differentnodes. After the computation of each node, theoutput is sent to the next node where the nextlayer computation takes place. Since pipeline par-allelism only needs to communicate the interme-diate activation states between nodes performingadjacent stages of the pipeline, the communicationcost is relatively small. Existing pipeline methodsinclude GPipe (Huang et al., 2019b) which cansend smaller parts of samples within a mini-batch

to different nodes, and TeraPipe (Li et al., 2021)which can apply token-level pipeline mechanismsfor Transformer-based models to make each tokenin a sequence be processed by different nodes. Bothof these pipeline methods speed up the large-scalePTMs. However, they should be stopped at the endof each batch until the gradient back-propagationis complete, which can lead to pipeline bubbles.

6.2 Efficient Pre-Training

Besides some system-level optimization methods,various efforts have been devoted to exploring moreefficient pre-training methods, so that we can pre-train large-scale PTMs with a lower cost solution.

Efficient Training Methods. Conventional pre-training tasks can be sample-inefficient. For exam-ple, for MLM which is widely used to pre-train re-cent PTMs, models are required to predict maskedtokens according to contexts. The masked tokensare usually a subset (typically 15%) of input tokens,i.e., models can only learn from a small set of inputtokens. To tackle this problem, ELECTRA (Clarket al., 2020) applies the replaced token detectiontask. This task forces models to distinguish whetheran input token is replaced by a generator. Thistask can leverage more supervision informationfrom each sample since all input tokens need to bedistinguished. ELECTRA takes much fewer pre-training steps when it reaches similar performanceto those MLM models. Furthermore, traditionalMLM randomly masks tokens in a document topredict. Since the difficulty of predicting differenttokens varies a lot, the random masking strategymakes the training process aimless and inefficient.Therefore, some works selectively mask tokensbased on their importance (Gu et al., 2020) or gra-dients (Chen et al., 2020b) in back-propagation tospeed up model training.

21

Apart from the pre-training tasks, the currentpre-training dynamics are also sub-optimal. Re-cent large-scale PTMs usually require a large batchsize. But in an early work (Goyal et al., 2017),researchers find that naively increasing the batchsize may cause difficulty in optimization. There-fore, they propose a warmup strategy that linearlyincreases the learning rate at the beginning of train-ing. This strategy is commonly used in recent large-scale PTMs. Another feature of recent PTMs isthat they are usually composed of multiple stacksof a base structure like Transformers. The con-ventional training paradigm optimizes each layersimultaneously using the same hyper-parameters.However, some recent works study Transformer-based models and claim that different layers canshare similar self-attention patterns. Therefore, ashallow model can firstly be trained and then dupli-cated to construct a deep model (Gong et al., 2019).Some layers can also be dropped during trainingto reduce the complexity of back-propagation andweight update (Zhang and He, 2020). In addition,You et al. (2017) and You et al. (2020) find thatadaptively using different learning rates at differ-ent layers can also speed up convergence when thebatch size is large.

Efficient Model Architectures. Besides efficientpre-training methods, more variants of model ar-chitectures can also reduce the computational com-plexity to improve the efficiency of training PTMs.For most Transformer-based PTMs, as their inputsequence goes longer, their efficiency is limitedby the computation of attention weights due toits quadratic time and space complexity of the se-quence length. Therefore, many works attemptto reduce the complexity of Transformers. Someworks (Peng et al., 2021; Choromanski et al., 2021;Wang et al., 2020c; Katharopoulos et al., 2020) de-sign low-rank kernels to theoretically approximatethe original attention weights and result in linearcomplexity. Some works (Child et al., 2019) intro-duce sparsity into attention mechanisms by limitingthe view of each token to a fixed size and separatingtokens into several chunks so that the computationof attention weights takes place in every singlechunk rather than a complete sequence. Comparedto predefined chunks, some works (Roy et al., 2021;Kitaev et al., 2020) find that using learnable param-eters to assign tokens into chunks results in bet-ter performance. Another kind of methods (Guoet al., 2019; Lee et al., 2019; Beltagy et al., 2020;

Ainslie et al., 2020; Zaheer et al., 2020) combineglobal and local attention mechanisms, and thenuse global nodes to gather tokens in a sequence. Inthis way, the long sequence is compressed into asmall number of elements so that we can reducethe complexity.

Keeping the same theoretical computation com-plexity as the original Transformer, more vari-ants of the model structure can also accelerate themodel convergence. Mix-of-experts (MoE) hasbeen proved early (Shazeer et al., 2017) to increasethe parameters of deep neural models while keep-ing the computational overhead nearly unchanged.Recently, Switch Transformers (Fedus et al., 2021)employ this technique in pre-training. They addmultiple experts to each layer of Transformers. Dur-ing each forward and backward step, they selectonly one expert for computation, and thus the train-ing and inference time remain similar to the ordi-nary Transformers without experts. Some experi-mental results show that MoE-based models con-verge faster than the ordinary ones due to the signif-icantly larger model capacity brought by multipleexperts. Some efficient open-source toolkits (Heet al., 2021) are also developed to train large-scaleMoE-based models.

6.3 Model Compression

Another important approach to improve the effi-ciency of PTMs is model compression. In thissetting, large models are compressed to small onesto meet the demand for faster inference and deploy-ment on resource-constrained devices.

Parameter Sharing. PTMs can be compressedwith sharing parameters across similar units. AL-BERT (Lan et al., 2019) uses factorized embeddingparameterization and cross-layer parameter sharingto reduce the parameters of PTMs. Using sameweights across all Transformer layers, ALBERTachieves a significant parameter reduction based onthe BERT model, and meanwhile has the same oreven better performance. This indicates that PTMscan be extremely over-parameterized.

Model Pruning. To take more advantage of theover-parameterized feature of current PTMs, an-other method to reduce model parameters is modelpruning, which cuts off some useless parts in PTMsto achieve accelerating while maintaining the per-formance. In (Fan et al., 2019), Transformer layersare selectively dropped during training, resulting ina more shallow model during inference. In (Michel

22

et al., 2019), (Voita et al., 2019) and (Zhang et al.,2021b), researchers study the redundancy of theattention heads in Transformers and find that onlya small part of them is enough for good perfor-mance. Most of these heads can be removed withlittle impact on the accuracy. Other trials such asCompressingBERT (Gordon et al., 2020) try toprune the weights of attention layers and linear lay-ers to reduce the number of parameters in PTMs,while maintaining the comparable performance tothe original model.

Knowledge Distillation. Although ALBERTsaves the memory usage of PTMs, its inferencetime is not significantly decreased since featuresstill need to go through its layers with the samenumber as the original model. Knowledge distilla-tion aims at training a small model to reproduce thebehavior of a large teacher model. The memory us-age and the time overhead are both decreased whenusing a small distilled model for inference. Thereare some typical works employing knowledge dis-tillation for PTMs, such as DistillBERT (Sanhet al., 2019), TinyBERT (Jiao et al., 2019), BERT-PKD (Sun et al., 2019b) and MiniLM (Wang et al.,2020d). In these works, a small student model istrained to mimic the output probability, the hiddenstates, and the attention matrices of a large teachermodel during both the pre-training and fine-tuningstages. With knowledge distillation, the model-egy in the teacher model is transferred into thestudent model, which can lead to increasing perfor-mance compared to training a student model alone.However, the knowledge distillation methods men-tioned above require the data used for pre-trainingthe teacher model, which is usually not releasedin consideration of the data copyright and privacy.Moreover, the teacher model needs to forward overthe entire pre-training data to produce logits orintermediate representations for knowledge distil-lation, causing an even longer training time.

Model Quantization. To get a more compressedmodel, model quantization is also a useful tech-nique, which has been widely explored in someCNN-based models (Stock et al., 2020; Polinoet al., 2018). Model quantization refers to the com-pression of higher-precision floating-point parame-ters to lower-precision floating-point ones. Conven-tional PTMs are usually represented in 32 bits or 16bits, while models after quantization can be in 8 bitsor even 1 or 2 bits. For recent Transformer-basedmodels, 8-bit quantization has been proved to be ef-

fective for model compression in Q8BERT (Zafriret al., 2019), with little impact on the model per-formance. Despite this, training 1 or 2 Bits modelsremains challenging due to the significant decreasein model capacity. To alleviate the performancedegradation, other methods to preserve the accu-racy can also be employed. Q-BERT (Shen et al.,2020a) uses mixed-bits quantization in which theparameters with higher Hessian spectrum requirehigher precision while those parameters with lowerHessian spectrum need lower precision. Ternary-BERT (Zhang et al., 2020b) applies knowledgedistillation in quantization, forcing low-bit modelsto imitate full-precision models. Both Q-BERT andTernaryBERT result in ultra low-bit models. How-ever, low-bit representation is a highly hardware-related technique, which means quantization oftenrequires specific hardware and can not generalizeto other devices.

7 Interpretation and TheoreticalAnalysis

Beyond the superior performance of PTMs on vari-ous NLP tasks, researchers also explore to interpretthe behaviors of PTMs, including understandinghow PTMs work and uncovering the patterns thatPTMs capture. These works cover several impor-tant properties of PTMs: knowledge, robustness,and structural sparsity/modularity. Moreover, thereare some pioneering works on building the theoret-ical analysis for PTMs.

7.1 Knowledge of PTMs

The implicit knowledge captured by PTMs canbe roughly divided into two categories: linguisticknowledge and world knowledge.

Linguistic Knowledge. The linguistic knowledgeof PTMs attracts most of attentions among all top-ics of PTMs’ interpretation. Compared to con-ventional neural models such as CNNs and RNNswhich have fewer layers and parameters, large-scale PTMs can learn rich linguistic knowledgefrom massive pre-training data. In order to studyPTMs’ linguistic knowledge, researcher design sev-eral approaches: (1) Representation Probing: Fixthe parameters of PTMs and train a new linear layeron the hidden representations of PTMs for a spe-cific probing task. It is the most popular approachbecause it can be easily adapted to any probingtask without particular design. (2) RepresentationAnalysis: Use the hidden representations of PTMs

23

to compute some statistics such as distances or sim-ilarities. According to these statistics, we can con-struct the relation between different words, phrases,or sentences. (3) Attention analysis: similar torepresentation analysis, attention analysis computestatistics about attention matrices and is more suit-able to discover the hierarchical structure of texts.(4) Generation Analysis: Use language models todirectly estimate the probabilities of different se-quences or words. The target texts could be corrector incorrect in some linguistic phenomenons.

Representation probing have been widely ap-plied to analyze NLP neural models from wordembeddings to PTMs (Köhn, 2015; Ettinger et al.,2016; Shi et al., 2016; Adi et al., 2017; Conneauet al., 2018a; Hewitt and Manning, 2019; Glavašand Vulic, 2021). Liu et al. (2019) conduct com-prehensive probing experiments on 11 linguistictasks and find that the representations given bylarge-scale PTMs are competitive compared to pre-vious task-specific models, which indicates that themodels have already learned knowledge about to-kens, chunks, and pairwise relations. To furtherinvestigate how PTMs represent sentence struc-tures about syntactic, semantic, local, and long-range information, Tenney et al. (2019b) designa new edge probing task and examine PTMs ona broad suite of sub-sentence tasks and show thatPTMs have strong ability to encode syntactic in-formative while they bring little improvement onsemantic tasks. Similarly, several works also revealthe strong syntax encoding of PTMs (Vilares et al.,2020; Warstadt and Bowman, 2020; Hewitt andManning, 2019). To analyze the function of differ-ent layers, Jawahar et al. (2019a) and Tenney et al.(2019a) show that PTMs encode linguistic informa-tion with phrase features at the bottom, syntacticfeatures in the middle and semantic features at thetop. Compared to non-contextual representations(e.g., word2vec), PTMs’ representations are bet-ter in encoding sentence-level properties (Miaschiand Dell’Orletta, 2020). Furthermore, Manninget al. (2020) explore to reconstruct the sentencetree structures given by linguists using a lineartransformation of PTMs’ embeddings and achievepromising results.

Besides representation probing, researchers tryto uncover the structure and relation among dif-ferent representations. Kim et al. (2020) proposeto leverage the concept of Syntactic Distance toconstruct the constituency trees of sentences from

word representations. Rosa and Marecek (2019)analyze how the deletion of one word in a sentencechanges representations of other words to revealthe influence of one word on other words.

There are also several works on interpretingPTMs via attention matrices. Lin et al. (2019) quan-titatively evaluate attention matrices for subject-verb agreement and anaphor-antecedent dependen-cies, and show that PTMs tend to encode positionalinformation in lower layers and capture hierarchicalinformation in higher layers. To better characterizethe behaviors of PTMs’ attention matrices, Htutet al. (2019) propose to take the maximum atten-tion weight and compute the maximum spanningtree as two statistics. Based on the experimentalresults, they find that fine-tuning has little impacton the self-attention patterns.

Since PTMs can be directly used to generatetokens or estimate the probabilities of different sen-tences, it is intuitive to construct analysis tasksbased on generation (Goldberg, 2019). PerturbedMasking (Wu et al., 2020) recovers syntactic treesfrom PTMs without any extra parameter and thestructure given by PTMs are competitive witha human-designed dependency schema in somedownstream tasks. To analysis the gain of pre-training on estimating the probabilities of ungram-matical words, Schijndel (Schijndel et al., 2019)show that expanding the training corpora yieldsdiminishing returns and the training corpora wouldneed to be unrealistically large to make PTMsmatch human performance.

World Knowledge. In addition to linguistic knowl-edge, PTMs also learn rich world knowledgefrom pre-training, mainly including commonsenseknowledge and factual knowledge (Zhou et al.,2020b; Bouraoui et al., 2020).

For the commonsense knowledge, Ettinger (Et-tinger, 2020) first evaluates PTMs’ knowledge inthe aspect of psycholinguists and find that the mod-els perform well in the situation of shared categoryor role reversal but fail with challenging inferencesand role-based event. Then, to extract common-sense from PTMs, Davison et al. (2019) proposeto first transform relational triples into masked sen-tences and then rank these sentences according tothe mutual information given by PTMs. In the ex-periments, the PTM-based extraction method with-out further training even generalizes better thancurrent supervised approaches. Similarly, Da andKasai (2019) also find that PTMs have learned var-

24

ious commonsense features in their representationspace based on a series of probing tasks. In ad-dition to the commonsense features/attributes, theimplicit relations between different attributes areimportant and Forbes et al. (2019) show that currentPTMs’ representations cannot model the implicitrelations well, which requires further exploration.

For factual knowledge, Petroni et al. (2019) pro-pose to formulate the relational knowledge gener-ation as the completion of fill-in-the-blank state-ments. According to the experimental results, theyfind that PTMs significantly outperform previoussupervised baselines on this task without any fine-tuning. However, the construction of these fill-in-the-blank statements is non-trivial. To extract morefactual knowledge from PTMs, LPAQA (Jianget al., 2020b) have been propose to automaticallysearch better statements/prompts through mining-based and paraphrasing-based methods. Auto-Prompt (Shin et al., 2020) proposes to train discreteprompts for knowledge probing. In P-tuning (Liuet al., 2021b), the authors discover that the bet-ter prompts lie in continuous embedding space,rather than discrete space. The P-tuning boosts theP@1 performance on LAMA to 64%, which is 20%higher than AutoPrompt. Moreover, Roberts et al.(2020) fine-tune PTMs for the task of open-domainquestion answering and find that fine-tuning canfurther benefit the knowledge generation of PTMs.However, Pörner et al. (2020) find that the successof knowledge generation may rely on learning neu-ral stereotypical associations, i.e., a person withan Italian-sounding name will be predicted to Ital-ian by PTMs. For understanding the number intexts, Wallace et al. (2019c) find that ELMo cap-tures numeracy the best for all pre-trained meth-ods, which is a character-based model, but BERT,which uses sub-word units, is less exact. (Wanget al., 2020a) investigates the knowledge stored inTransformer’s feed-forward attention matrices andproposes a framework to construct open knowledgegraphs using PTMs.

7.2 Robustness of PTMs

Recent works have identified the severe robust-ness problem in PTMs using adversarial examples.Adversarial attacks aims to generate new samples,which are mis-classified by models, by small pertur-bation on the original inputs. For example, PTMscan be easily fooled by synonym replacement (Jinet al., 2020; Zang et al., 2020; Wang et al., 2021a).

Meanwhile, irrelevant artifacts such as form wordscan mislead the PTMs into making wrong predic-tions (Niven and Kao, 2019; Wallace et al., 2019a).Current works mainly utilize the model prediction,prediction probabilities, and model gradients ofthe models to search adversarial examples. How-ever, it is difficult to maintain the quality of theadversarial examples generated by machines. Re-cently, human-in-the-loop methods (Wallace et al.,2019b; Nie et al., 2020) have been applied to gen-erate more natural, valid, and diverse adversarialexamples, which brings larger challenge and ex-pose more properties and problems of PTMs. Inconclusion, the robustness of PTMs has become aserious security threat when people deploy PTMsfor real-world applications.

7.3 Structural Sparsity of PTMs

Following BERT, most PTMs adopt Transformeras the architecture backbone. Although people caneasily train a deep Transformer and achieve signifi-cant improvement over previous works using CNNand RNN, Transformer meets the problem of over-parameterization. Researchers have shown that themulti-head attention structures are redundant in thetasks of machine translation (Michel et al., 2019),abstractive summarization (Baan et al., 2019), andlanguage understanding (Kovaleva et al., 2019),i.e., when removing part of attention heads, we canachieve better performance. This phenomenon isconsistent to the observation in (Clark et al., 2019)where they find that most heads in the same layerhave similar self-attention patterns. Furthermore,Kovaleva et al. (2019) conduct a qualitative andquantitative analysis of the information encodedby PTMs’ heads. Their findings suggest that theattention behaviors of different heads can be cate-gorized into a limited set of patterns. Besides themulti-head attention, several other works exploreto identify the sparsity of parameters. Gordon et al.(2020) show that low levels of pruning (30-40%)do not affect pre-training loss or the performanceon downstream tasks at all. Targeting the sparsityduring fine-tuning, Prasanna et al. (2020) validatethe lottery ticket hypothesis on PTMs and find thatit is possible to find sub-networks achieving per-formance that is comparable with that of the fullmodel. Surprisingly, Kao et al. (2020) show that wecan improvement the performance by simply du-plicating some hidden layers to increase the modelcapacity, which suggests that the redundant param-

25

eters may benefit the fine-tuning.

7.4 Theoretical Analysis of PTMs

Since pre-training has achieved great success indeep learning, researchers try to investigate howpre-training works, especially unsupervised pre-training. In the early days of deep learning, peo-ple found that it is effective to train a deep beliefnetwork by greedy layer-wise unsupervised pre-training followed by supervised fine-tuning (Hin-ton et al., 2006). Recently, pre-training based oncontrast learning including language modeling hasbecome the mainstream approach. In this section,we will introduce some theoretical explanatory hy-potheses or frameworks for pre-training.

Erhan et al. (2010) propose two hypotheses to ex-plain the effect of pre-training: (1) better optimiza-tion and (2) better regularization. In the aspect ofbetter optimization, the network with pre-trainingis closer to the global minimum compared to themodels randomly initialized. In the aspect of betterregularization, the training error of PTMs is notnecessarily better than the random models whilethe test error of PTMs is better, which means bet-ter generalization ability. Then, the experimentalresults lean towards the second hypothesis. Theyfind that the PTM doesn’t achieve lower trainingerror. Moreover, compared to other regularizationapproaches such as L1/L2, the unsupervised pre-training regularization is much better.

Towards the recent development of pre-trainingobjective, Saunshi et al. (2019) conduct a theoreti-cal analysis of contrastive unsupervised representa-tion learning. Contrastive learning treats the pairsof text/images appearing in the same context asthe semantically similar pairs and the randomlysampled pairs as the semantically dissimilar pairs.Then, the distance between the similar pair shouldbe close and the distance between the dissimilarpair should be distant. In the prediction processof language modeling, the context and the targetword are the similar pair and the other words arenegative samples (Kong et al., 2020). Saunshi et al.(2019) first provide a new conceptual frameworkto bridge the gap between pre-training and fine-tuning. Specifically, they introduce the concept oflatent classes and the semantically similar pairs arefrom the same latent class. For example, the latentclass can be “’happy” to include all texts includinghappy sentiments. The latent classes cover all pos-sible classes and the classes defined by downstream

tasks are from the set of latent classes. Then, theyprove that the loss of contrastive learning is theupper bound of the downstream loss. Hence, whenoptimizing the pre-training loss, we can expect alower loss in downstream tasks.

8 Future Directions

So far, we have comprehensively reviewed the pastand present of PTMs. In the future, on the basisof existing works, PTMs can be further developedfrom the following aspects: architectures and pre-training methods (section 8.1), multilingual andmultimodal pre-Training (section 8.2), computa-tional efficiency (section 8.3), theoretical founda-tion (section 8.4), modeledge learning (section 8.5),cognitive learning (section 8.6), and novel applica-tions (section 8.7). In fact, researchers have madelots of efforts in the above directions, and we havealso introduced the latest breakthroughs in the pre-vious sections. However, there are still some openproblems in these directions that need to be furtheraddressed. We mainly focus on discussing theseopen problems in this section.

8.1 Architectures and Pre-Training Methods

From the aspect of architectures and pre-trainingmethods, we believe the following problems worthfurther exploring in the future:

New Architectures. Transformers have beenproved to be an effective architecture for pre-training. However, the main limitation of Trans-formers is its computational complexity. Lim-ited by the memory of GPUs, most current PTMscannot deal with sequences containing more than512 tokens. Therefore, it is important to searchfor more efficient model architectures to capturelonger-range contextual information. However, thedesign of deep architecture is challenging, and wemay seek help from some automatic methods, suchas neural architecture search (NAS). Besides, al-though larger PTMs can usually lead to better per-formance, a practical problem is how to leveragethese huge PTMs on some special scenarios, suchas low-capacity devices and low-latency applica-tions, where the efficiency of PTMs is a key factor.Moreover, different downstream tasks prefer dif-ferent architectures. For example, the Transformerencoder is suitable for natural language understand-ing tasks while the Transformer decoder is suitablefor natural language generation tasks. Therefore,we may need to carefully design task-specific ar-

26

chitectures according to the type of downstreamtasks.

New Pre-Training Tasks. The general-purposePTMs are always our pursuits for learning theintrinsic universal knowledge of languages (evenworld knowledge). However, such PTMs usuallyneed deeper architecture, larger corpora and chal-lenging pre-training tasks. All these requirementsfurther result in higher training costs. Moreover,training huge models is also a challenging prob-lem, which needs sophisticated and efficient train-ing techniques such as distributed training, mixed-precision training, etc. Therefore, a more practicaldirection is to design more efficient self-supervisedpre-training tasks and training methods accordingto the capabilities of existing hardware and soft-ware. ELECTRA (Clark et al., 2020) is a goodattempt towards this direction.

Beyond Fine-Tuning. Currently, fine-tuning isthe dominant method to transfer the knowledgeof PTMs to downstream tasks but one deficiency isits parameter inefficiency: every downstream taskhas its own fine-tuned parameters. An improved so-lution is to fix the original parameters of PTMs andadd small fine-tunable adaption modules for spe-cific tasks. Thus, we can use a shared PTM to servemultiple downstream tasks. Recently, with theemerging of GPT-3, a novel genre for model tuning,namely prompt tuning, is getting more and moreattention. By designing, generating and searchingdiscrete (Petroni et al., 2019; Gao et al., 2021; Huet al., 2021) or continuous (Liu et al., 2021b; Hanet al., 2021; Lester et al., 2021) prompts and usingMLM for specific downstream tasks, these modelscould (1) bridge the gap between pre-training andfine-tuning, and thereby perform better on down-stream tasks; (2) reduce the computational cost onfine-tuning the tremendous amounts of parameters.To sum up, prompt tuning is a promising way tostimulate the linguistic and world knowledge dis-tributed in PTMs.

Reliability. The reliability of PTMs is also becom-ing an issue of great concern with the extensiveuse of PTMs in production systems. The studiesof adversarial attacks (Li et al., 2020b,c; Zhanget al., 2021c) against PTMs help us understandtheir capabilities by fully exposing their vulnera-bilities. Adversarial defenses (Si et al., 2020; Yaoet al., 2021; Li and Qiu, 2021) for PTMs are alsopromising, which can improve the robustness of

PTMs and make them immune against adversarialattacks. Overall, as a key component in many NLPapplications, the interpretability and reliability ofPTMs remain to be further explored, which willhelp us understand how PTMs work and provideguidance for better use and further improvement ofPTMs.

8.2 Multilingual and MultimodalPre-Training

Although multimodal and multilingual PTMs havewitnessed numerous advances in the last two years,they still have the following ongoing research lines:

More Modalities. In addition to image and text,video and audio can also be exploited for multi-modal pre-training. The main challenge thus lies inhow to model temporal contexts involved in thesetwo modalities. In particular, for large-scale pre-training over video-text pairs, the conventional self-supervised learning methods are not suitable dueto their high computational costs. To handle thisproblem, it is important to develop more effectiveand efficient self-supervised learning methods formore complex modalities.

More Insightful Interpretation. It is still un-known why bridging vision and language works.For example, regardless of the advantages broughtby multimodal pre-training, does it lead to anyharm to the single modality (image or text)? If theanswer is yes, can we overcome this drawback dur-ing multimodal pre-training? Along this researchline, the latest visualization tools for deep learningcan be exploited for the interpretation of multi-modal pre-training.

More Downstream Applications. It is well-known that multimodal pre-training can be appliedto image-text retrieval, image-to-text generation,text-to-image generation and other downstreamtasks. However, it is still challenging to find a“true” real-world application scenario for multi-modal pre-training, since many effective engineer-ing tricks can be leveraged instead (even with lesscost). A closer collaboration with the industry isthus needed.

Transfer Learning. Currently, to make multi-modal multilingual models handle different lan-guages, data for each language is required duringpre-training. It is not flexible to add unseen lan-guages during pre-training. Therefore, a new pre-training framework should be explored to easily

27

adapt to those unseen languages. Besides, currentmultimodal multilingual models are not able to pro-cess audio data. For example, to translate Englishaudio to Chinese audio, we need to first transferEnglish audio to English text by an extra speechrecognition system. After translation with a cross-lingual model, we need to further transfer Chinesetext to Chinese audio by an extra text-to-speechtool. How to directly transfer the source languageaudio to the target language text or target languageaudio by multimodal multilingual PTMs is alsoworth exploring.

8.3 Computational Efficiency

Deep learning models have become increasinglycomplicated and large (Devlin et al., 2019; Brownet al., 2020; Kaplan et al., 2020; Fedus et al., 2021)in the recent years. The novel requirements oflarge-scale deep learning models bring severe chal-lenges to the existing deep learning frameworkssuch as TensorFlow (Abadi et al., 2016) and Py-Torch (Paszke et al., 2019), which were designedin the early days without initially foreseeing theemerging requirements such as model/pipeline par-allelism of large models (Brown et al., 2020; Huanget al., 2019b; Wang et al., 2019). To develop moreefficient frameworks, the following directions arehelpful.

Data Movement. Developing an efficient dis-tributed deep learning framework faces variouschallenges. One has to carefully manage the datamovement between devices, which may otherwisebecome the performance bottleneck (Narayananet al., 2019; Jiang et al., 2020a). A well-definedparallelism strategy is needed to place and schedulecomputational tasks on inter-connected devices, byminimizing the communication cost, maximizingthe computational and memory resources, and op-timizing the computation-communication overlap.In the best case, this efficient parallelism strategycan be generated automatically.

Parallelism Strategies. Particular to the choiceof parallelism strategy, data parallelism, modelparallelism, pipeline parallelism, and various hy-brid parallelism approaches can find their best us-age depending on the structure of neural networksand hardware configuration (Ben-Nun and Hoe-fler, 2019). Data parallelism is especially suitablefor deep learning models with a relatively smallset of parameters (usually less than tens of mil-lion parameters) where near-linear speed-up can

be achieved when the back-propagation maximallyoverlaps with the gradient/parameter communica-tion (Hashemi et al., 2019; Peng et al., 2019; Jianget al., 2020a). Model parallelism and pipeline par-allelism are for models with a more significantnumber of parameters, which probably cannot fitinto a single device. In current practice, a user mustthoroughly consider the network structure given adeep learning model and the inter-device commu-nication bandwidth to decide the most appropriateparallelism strategies or switch between differentstrategies (Shazeer et al., 2018).

Large-Scale Training. Given the poor supportto model parallelism and pipeline parallelism byexisting deep learning frameworks, some emerg-ing open-source projects develop dedicated frame-works for large-scale training. For example,HugeCTR (Oldridge et al., 2020) is used for large-scale click-through rate estimation. Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021)and DeepSpeed (Rajbhandari et al., 2021, 2020)target at training large-scale NLP PTMs. Insight-Face (ins, 2021) trains large-scale face recognitionmodels. However, these frameworks are restrictedto limited application cases and cannot serve as ageneral solution. Further, these approaches cannotwork together to constitute a complete solution dueto the compatibility issue.

Wrappers and Plugins. Without a mechanism tosupport model parallelism and pipeline parallelism,one has to develop various libraries dedicated tosome particular algorithms via inserting the datarouting operations by hand between computing op-erations on top of existing frameworks. Further,communication and computation need to be man-ually overlapped to maximize the system through-put. Manually programming communication op-erations is prohibitively complicated and only cansolve problems case by case, leading to a signifi-cant obstacle in applying parallelism strategies tonew deep learning models. If communication oper-ations can be automatically managed transparentlyto users by deep learning frameworks, more modelsand applications can benefit from the distributedtraining.

To support more complicated parallelism strate-gies, many schemes are used as wrappers or plu-gins based on some mainstream deep learningframeworks such as TensorFlow and PyTorch.Mesh-TensorFlow (Shazeer et al., 2018), FlexFlow(Jia et al., 2019), OneFlow (one, 2021), Mind-

28

Spore (min, 2021) and GShard (Lepikhin et al.,2021) provide APIs for developers to express awide range of parallel computation patterns for dif-ferent components of deep neural models. The SBPconfiguration in OneFlow could be still too com-plex for users to set. However, directly program-ming with communication primitives for a differentkind of parallelism is more complicated. OneFlowtransforms the manually programming to just set-ting SBP signatures. Moreover, in OneFlow, theuser could just set the SBP signatures of a subset ofoperations instead of the whole set, and leave therest SBP to be inferred with heuristic approacheslike GShard (Lepikhin et al., 2021), in which usersprovide some initial annotations or use default an-notations as seed, then the algorithm propagatesthe sharding information to the un-annotated ten-sors. The approach in FlexFlow (Jia et al., 2019)can also be used here. The automatic schedulingof parallelism strategies is the trend of distributedtraining in the future.

8.4 Theoretical Foundation

In this subsection, we analyze the future directionsin a more fundamental way. In the aspect of theoret-ical foundation, we discuss the following researchproblems.

Uncertainty. One under-addressed issue withPTMs (as well as other deep neural networks) isthat they are often over-confident in predictions,i.e., these models do not know what they do notknow. For instance, GPT-3 can be used to an-swer questions with promising performance onbenchmark datasets. However, if you ask a sim-ple question like “How many eyes does my foothave?”, GPT-3 would certainly produce an answerlike “Your foot has two eyes”, which looks counter-intuitive. 4 Of course, the above question is notoften asked by normal human beings. It is gener-ally a challenging task to deal with such out-of-distribution (OOD) data in machine learning.

To address the above challenge, one promisingdirection is to adopt Bayesian methods that exploreprobabilistic tools to capture the uncertainty of bothdata and model (also known as aleatoric uncertaintyand epistemic uncertainty respectively) (Der Ki-ureghian and Ditlevsen, 2009) or derive some test-ing statistics. Such uncertainty or statistics is help-

4More examples of the Turing test of GPT-3 canbe found at https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

ful to detect outliers (Wang et al., 2020f). Re-cently, much work has been done on the theory,algorithms and programming libraries of Bayesiandeep learning, which conjoins Bayesian methodsand deep networks (e.g., see (Shi et al., 2017) formore details). Such progress can be further ex-tended to large-scale PTMs to properly character-ize uncertainty and avoid over-confident outputs.Of course, improving the computational efficiencyof Bayesian deep learning is a key factor to addressthe above challenge.

Generalization and Robustness. Another impor-tant issue with PTMs is on generalization. As animportant advancement of deep learning, it inheritsthe advantages as well as challenges of deep neu-ral networks. It has been observed that classicallearning theory is not sufficient to understand thebehavior of deep networks (Zhang et al., 2017),thereby calling for new tools in learning theory.As for PTMs, besides theoretical understanding ofthe neural models themselves (e.g., Transformerand BERT), new questions arise. For example, itis important to theoretically understand the rolesof pre-training in improving the generalization ofdownstream tasks. The recent work (Saunshi et al.,2019) provides a fruitful attempt at understandingcontrastive learning with particular assumptions.However, it is still largely open to analyze PTMsunder more realistic settings.

As we mentioned before, the adversarial robust-ness also raises new questions. In previous work,it was shown that a higher sample complexity isneeded in order to achieve adversarial robustnessfor neural networks (Schmidt et al., 2018). Suchanalysis has inspired further improvements (e.g.,(Pang et al., 2020)). However, it is generally un-known how large-scale PTMs can help in this as-pect. Are there effective ways to explore PTMs asextra data resources to improve the robustness ofdownstream tasks? Also, the robustness of PTMsthemselves is an unsolved issue, as mentioned be-fore.

8.5 Modeledge Learning

As introduced in section 7, PTMs can achieve asurge of improvements for a wide range of NLPtasks because they learn versatile knowledge fromlarge unlabeled corpora. As opposed to the knowl-edge represented by discrete symbols, which is in-terpretable to human beings, the knowledge storedin PTMs is represented as real-valued vectors. For

29

https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

example, given a triple 〈h, r , t〉 in a knowledgegraph, it is easy to know that the head entity h hasa relation r to the tail entity t . In contrast, youseem to have difficulty knowing what a representa-tion produced by a PTM means. Therefore, we canrefer to the knowledge stored in PTMs as “mod-eledge”, which is distinguished from the discretesymbolic knowledge formalized by human beings.

Knowledge-Aware Tasks. While the use of sym-bolic knowledge is effective, it is time-consumingand labor-intensive to manually organize this dis-crete knowledge such as building various knowl-edge bases. With the rapid advance of researcheson PTMs, there emerge various PTMs such as GPT,BERT and BART. More and more researchers haveprobed into what knowledge do PTMs learn fromthe data, and why they perform so well on down-stream tasks (Jawahar et al., 2019b; Ethayarajh,2019). Petroni et al. (2019) state that PTMs can beseen as knowledge bases and study how to applyPTMs to the knowledge completion task. Etha-yarajh (2019) also claim that PTMs would be openknowledge graphs and propose an unsupervisedmethod to build knowledge graphs based on PTMs.From all these knowledge-aware tasks, we can findthat a wealth of human knowledge is captured byPTMs and stored in the form of modeledge. How tostimulate the modeledge of PTMs is worth furtherexploring in the future.

Modeledge Storage and Management. As exist-ing PTMs are built on varying architectures andmay be trained with different corpora, they containdiverse modeledge. As a result, how to store andmanage various continuous modeledge in PTMsbecomes a new challenge. There are two kinds ofstraightforward ideas. The first is to pre-train ahuge model on extra-large scale data. Then, PTMswill have the extraordinary ability to cover almostall modeledge in existing PTMs. This method issimple and effective while it requires extremelyhigh computational power and storage resources.For example, GPT-3 uses about 175 billion param-eters. The second is to combine multiple modelsinto one large model based on the mixture of ex-perts (MoE) (Jacobs et al., 1991). For example, Fe-dus et al. (2021) improve MoE to propose SwitchTransformers. This method is easy to contain newmodels but the requirement of memory grows asthe number of models increases.

Considering that there are both similarities anddifferences among existing PTMs, we have an im-

portant question that needs to be answered: is itpossible to build a universal continuous knowledgebase (UCKB) that stores modeledge from variousPTMs? The UCKB can not only store continuousmodeledge imported from existing PTMs but alsocan blend different modeledge and then export thefused modeledge to a model to make it more pow-erful. Chen et al. (2020a) first propose the conceptof UCKB and make some preliminary explorations.They regard neural networks as parameterized func-tions and use knowledge distillation (Hinton et al.,2014) to import and export modeledge. UCKBovercomes the redundancy of model storage andstores the modeledge of various models into a com-mon continuous knowledge base. However, how todesign more effective architectures for the storageand interface of UCKB still remains challenging.

8.6 Cognitive and Knowledgeable Learning

Making PTMs more knowledgeable is an impor-tant topic for the future of PTMs. We divide thefuture development of knowledgeable PTMs intothe following three approaches:

Knowledge Augmentation. For an input text,there is rich related external knowledge, whichcan be used to augment the input. Consideringthe formats of knowledge and plain text are verydifferent, it is important to bridge the gap betweentext representations and knowledge representations(including symbols or vectors) and use their infor-mation uniformly as input. The solution to thisproblem requires both unified model architecturesand knowledge-guided pre-training objectives.

Knowledge Support. Current model architecturesare manually designed and usually very regular.With prior knowledge about the input, we can traindifferent sub-module to process different kinds ofinput, which may accelerate the process of trainingand inference and benefit the model efficiency. Thisprocess is similar to human behavior where differ-ent brain regions correspond to different activityfunctions.

Knowledge Supervision. Knowledge bases storeamounts of structural data, which can be used asa complementary source during pre-training. Bylearning from both knowledge bases and large-scale corpora, PTMs can have better language un-derstanding and generation abilities compared toonly using plain text. Through these three direc-tions, we hope the future PTMs can easily under-

30

stand the meanings beyond words and achieve bet-ter performance on various downstream tasks.

In terms of cognitive PTMs, we believe the fol-lowing approaches would be helpful:

Cognitive Architecture. Since neural networksare inspired by the micro structure of the humanneural system, it is expected to see how the macrofunction and organization of human cognitive sys-tem can enlighten the design of the next generationof intelligence system, such as the Global Work-ing Theory (GWT). The success of CogQA andCogLTX may provide some thoughts on this chal-lenge.

Explicit and Controllable Reasoning. Whiledeep learning has achieved success in many per-ceptive tasks, how to conduct complex decisionmaking and efficient multi-step reasoning is stillunsolved, which may require machines to auto-matically plan the decision making process into acognitive graph and do explicit reasoning over thefactors in graphs as human do. Methods such asInversePrompting (Zou et al., 2021) which showssupreme ability in controlling theme-related textgeneration would provide some thoughts.

Interactions of Knowledge. Though our PTMsare getting bigger and more general, what knowl-edge it has learned from pre-training is largely un-explored. Moreover, since our brains are workingwith the collaboration of different function zones,it is important to see if our PTMs have shaped dif-ferent inner function modules and how they wouldinteract with each other.

8.7 Applications

PTMs have been successfully applied in a widevariety of domains and tasks. In this section, wewill highlight some of these applications.

Natural Language Generation. Many naturallanguage generation tasks have been dominatedby PTMs, such as GPT-2, BART, T5, UniLM andmany more. These tasks include machine transla-tion, summarization, dialog generation, story gen-eration, poetry generation and other long text gen-eration. Since the prevalent trend of PTMs, thebackbone models have moved from CNNs/RNNsto transformers or transformer-based PTMs. PTMshave also been successfully applied to multimodalgeneration. Trained on text-image parallel data,these models have been shown strong in applica-tions such as visual question answering, image-to-

text generation and text-to-image generation. Aslarge-scale PTMs have been trained on so large-scale data, they have innate advantages for natu-ral language generation, particularly low-resourcednatural language generation.

Dialog Systems. Many recent open-domain dia-log systems are built upon large-scale transformerstructures. These examples include Meena (Adi-wardana et al., 2020), Blender (Roller et al., 2021),CDial-GPT (Wang et al., 2020e), Plato (Bao et al.,2020) and Plato-2 (Bao et al., 2021), which aretrained on large-scale conversation data, com-monly with the seq2seq framework. These mod-els have shown capabilities of delivering naturaland engaging conversations, some of which havebeen reported to be close to human-level perfor-mance (Adiwardana et al., 2020). However, dialog-specific pre-training tasks are yet to be explored,comparing to pre-training tasks for other applica-tions.

Domain-Specific PTMs. When large-scaledomain-specific corpora are cheaply available, wecan train domain-specific PTMs on such data.Some notable works include BioBERT (Lee et al.,2020) and SciBERT (Beltagy et al., 2019), whichare trained respectively on the biological and scien-tific literature text. These models are expected andverified to learn more domain-specific knowledgeand language use than those trained on the generaltext. Such domain expertise is usually regarded asimportant for solving many domain-specific prob-lems.

Domain Adaptation and Task Adaptation.Large-scale PTMs learn general knowledge fromthe large-scale general text, providing a good ini-tial point to further learn domain-specific knowl-edge by fine-tuning or other techniques. AlthoughPTMs are becoming larger and larger, the domain-specific data are always limited. Therefore, domainadaptation is becoming crucial for domain-specificapplications. It has been evident that the simplefine-tuning of large-scale PTMs is not sufficient fordomain-specific applications (Gururangan et al.,2020; Ke et al., 2020). The most essential reasonfor this is the distribution shift: the data distributionin a specific domain may be substantially differentfrom that in the general pre-training text. Anotherimportant issue for the success of domain-specificapplications goes to task adaptation. Most often,domain applications have a small set of labeled

31

data, which can empower supervised learning tolearn domain expertise more efficiently. However,for super-large PTMs, simply fine-tuning on la-beled data seems to be inefficient in computation,nor effective in performance. Thus, how to bridgethe gap between pre-training and task-specific fine-tuning becomes crucial. Moreover, efficient andeffective task-specific fine-tuning is also an impor-tant research direction for the future application ofPTMs (Soares et al., 2019; Ding et al., 2021b).

9 Conclusion

In this paper, we take a look into the history ofpre-training to indicate the core issue of PTMs, andmeanwhile reveal the crucial position of PTMs inthe AI development spectrum. Furthermore, wecomprehensively review the latest efforts towardsbetter PTMs, including designing effective archi-tectures, utilizing rich contexts, improving compu-tational efficiency, and conducting interpretationand theoretical analysis. All these works contributeto the recent wave of developing PTMs. Althoughexisting PTMs have achieved promising results, es-pecially those large-scale PTMs showing amazingabilities in zero/few-shot learning scenarios, howto develop PTMs next is still an open question. Theknowledge stored in PTMs is represented as real-valued vectors, which is quite different from thediscrete symbolic knowledge formalized by humanbeings. We name this continuous and machine-friendly knowledge “modeledge” and believe thatit is promising to capture the modeledge in a moreeffective and efficient way and stimulate the mod-eledge for specific tasks. We hope our view couldinspire more efforts in this field and advance thedevelopment of PTMs.

Note and Contribution

This paper originates from a 3-day closed-doorworkshop initiated by Jie Tang, Ji-Rong Wen andMinlie Huang held in Beijing WTown from January1 to January 3, 2021, supported by China ComputerFederation (CCF). All authors of this paper orga-nized or participated in this workshop, and thispaper can be regarded as a summary and extensionof the discussion in the workshop.

The contributions of all authors are listed as fol-lows: Zhiyuan Liu and Xu Han designed the struc-ture of this paper; Xu Han drafted the abstract, Sec-tion 1, Section 2; Ning Ding and Xu Han draftedSection 3; Xiao Liu and Jiezhong Qiu drafted Sec-

tion 4; Yuqi Huo, Yuan Yao, Ao Zhang and LiangZhang drafted Section 5; Yuxian Gu drafted Sec-tion 6; Zhengyan Zhang drafted Section 7. Allfaculty authors drafted various topics in Section 8,including Xipeng Qiu for Section 8.1, Ji-Rong Wen,Ruihua Song and Yang Liu for Section 8.2, JinhuiYuan and Wentao Han for Section 8.3, Jun Zhuand Yanyan Lan for Section 8.4, Yang Liu for Sec-tion 8.5, Jie Tang and Zhiyuan Liu for Section 8.6,Minlie Huang and Jie Tang for Section 8.7. WayneXin Zhao, Xipeng Qiu provided comments to themanuscript, and Xu Han, Ning Ding and ZhengyanZhang proofread the whole paper.

References2021. Insightface project. https://github.com/

deepinsight/insightface.

2021. MindSpore Deep Learning Framework. https://github.com/mindspore-ai/mindspore.

2021. OneFlow Deep Learning Framework. https://github.com/Oneflow-Inc/oneflow.

Martín Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,Manjunath Kudlur, Josh Levenberg, Rajat Monga,Sherry Moore, Derek G. Murray, Benoit Steiner,Paul Tucker, Vijay Vasudevan, Pete Warden, MartinWicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Ten-sorflow: A system for large-scale machine learning.In Proceedings of OSDI, pages 265–283.

Yossi Adi, Einat Kermany, Yonatan Belinkov, OferLavi, and Yoav Goldberg. 2017. Fine-grained anal-ysis of sentence embeddings using auxiliary predic-tion tasks. In Proceedings of ICLR.

Daniel Adiwardana, Minh-Thang Luong, David R So,Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,et al. 2020. Towards a human-like open-domainchatbot. arXiv preprint arXiv:2001.09977.

Joshua Ainslie, Santiago Ontanon, Chris Alberti, PhilipPham, Anirudh Ravula, and Sumit Sanghai. 2020.ETC: Encoding long and structured inputs in trans-formers. In Proceedings of EMNLP, pages 268–284.

Chris Alberti, Jeffrey Ling, Michael Collins, and DavidReitter. 2019. Fusion of detected objects in textfor visual question answering. In Proceedings ofEMNLP-IJCNLP, pages 2131–2140.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. Vqa: Visual question an-swering. In Proceedings of ICCV, pages 2425–2433.

32

https://github.com/deepinsight/insightface

https://github.com/deepinsight/insightface

https://github.com/mindspore-ai/mindspore

https://github.com/mindspore-ai/mindspore

https://github.com/Oneflow-Inc/oneflow

https://github.com/Oneflow-Inc/oneflow

Martin Arjovsky, Soumith Chintala, and Léon Bottou.2017. Wasserstein generative adversarial networks.In Proceedings of ICML, pages 214–223.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. In Proceedings ofNeurIPS.

Joris Baan, Maartje ter Hoeve, Marlies van der Wees,Anne Schuth, and Maarten de Rijke. 2019. Under-standing multi-head attention in abstractive summa-rization. arXiv preprint arXiv:1911.03898.

Alan Baddeley. 1992. Working memory. Science,255(5044):556–559.

Siqi Bao, Huang He, Fan Wang, Hua Wu, and HaifengWang. 2020. PLATO: Pre-trained dialogue genera-tion model with discrete latent variable. In Proceed-ings of ACL.

Siqi Bao, Huang He, Fan Wang, Hua Wu, HaifengWang, Wenquan Wu, Zhen Guo, Zhibin Liu, andXinchao Xu. 2021. Plato-2: Towards building anopen-domain chatbot via curriculum learning. InProceedings of ACL.

Pierre Barrouillet, Sophie Bernardin, and ValérieCamos. 2004. Time constraints and resource shar-ing in adults’ working memory spans. Journal ofExperimental Psychology: General, 133(1):83–100.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and SoumikMandal. 2019. Reconciling modern machine-learning practice and the classical bias–variancetrade-off. PNAS, 116(32):15849–15854.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert:A pretrained language model for scientific text. InProceedings of EMNLP-IJCNLP, pages 3615–3620.

Iz Beltagy, Matthew E Peters, and Arman Cohan.2020. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150.

Tal Ben-Nun and Torsten Hoefler. 2019. Demystify-ing parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Sur-veys (CSUR), 52(4):1–43.

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model. JMLR, 3:1137–1155.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi.1994. Learning long-term dependencies with gradi-ent descent is difficult. IEEE TNNLS, 5(2):157–166.

Bin Bi, Chenliang Li, Chen Wu, Ming Yan, andWei Wang. 2020. Palm: Pre-training an autoen-coding&autoregressive language model for context-conditioned generation. In Proceedings of EMNLP,pages 8681–8691.

Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, et al. 2014. Findings of the 2014workshop on statistical machine translation. In Pro-ceedings of WMT, pages 12–58.

Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.2019. Comet: Commonsense transformers for au-tomatic knowledge graph construction. In Proceed-ings of ACL, pages 4762–4779.

Zied Bouraoui, José Camacho-Collados, and StevenSchockaert. 2020. Inducing relational knowledgefrom BERT. In Proceedings of AAAI, pages 7456–7463.

John Brown. 1958. Some tests of the decay theoryof immediate memory. Quarterly journal of experi-mental psychology, 10(1):12–21.

Tom Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared D Kaplan, Prafulla Dhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sastry,Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, RewonChild, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,Clemens Winter, Chris Hesse, Mark Chen, EricSigler, Mateusz Litwin, Scott Gray, Benjamin Chess,Jack Clark, Christopher Berner, Sam McCandlish,Alec Radford, Ilya Sutskever, and Dario Amodei.2020. Language models are few-shot learners. InProceedings of NeurIPS, pages 1877–1901.

Nicolas Carion, Francisco Massa, Gabriel Synnaeve,Nicolas Usunier, Alexander Kirillov, and SergeyZagoruyko. 2020. End-to-end object detection withtransformers. In Proceedings of ECCV, pages 213–229.

Gang Chen, Maosong Sun, and Yang Liu. 2020a. To-wards a universal continuous knowledge base. arXivpreprint arXiv:2012.13568.

Liang Chen, Tianyuan Zhang, Di He, Guolin Ke, LiweiWang, and Tie-Yan Liu. 2020b. Variance-reducedlanguage pretraining via a mask proposal network.arXiv preprint arXiv:2008.05333.

Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, LawrenceCarin, and Jingjing Liu. 2020c. Graph optimal trans-port for cross-domain alignment. In Proceedings ofICML, pages 1542–1553. PMLR.

Ting Chen, Simon Kornblith, Mohammad Norouzi,and Geoffrey Hinton. 2020d. A simple frameworkfor contrastive learning of visual representations. InProceedings of ICML, pages 1597–1607.

Wenhu Chen, Yu Su, Xifeng Yan, and William YangWang. 2020e. Kgpt: Knowledge-grounded pre-training for data-to-text generation. In Proceedingsof EMNLP, pages 8635–8648.

33

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollár, andC Lawrence Zitnick. 2015. Microsoft coco captions:Data collection and evaluation server. arXiv preprintarXiv:1504.00325.

Xinlei Chen and Kaiming He. 2020. Exploring sim-ple siamese representation learning. arXiv preprintarXiv:2011.10566.

Yen-Chun Chen, Linjie Li, Licheng Yu, AhmedEl Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. 2020f. Uniter: Universal image-textrepresentation learning. In Proceedings of ECCV,pages 104–120.

Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2020a. Cross-lingualnatural language generation via pre-training. In Pro-ceedings of AAAI, pages 7570–7577.

Zewen Chi, Li Dong, Furu Wei, Nan Yang, Sak-sham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2020b.Infoxlm: An information-theoretic framework forcross-lingual language model pre-training. arXivpreprint arXiv:2007.07834.

Rewon Child, Scott Gray, Alec Radford, andIlya Sutskever. 2019. Generating long se-quences with sparse transformers. arXiv preprintarXiv:1904.10509.

Krzysztof Choromanski, Valerii Likhosherstov, DavidDohan, Xingyou Song, Andreea Gane, Tamas Sar-los, Peter Hawkins, Jared Davis, Afroz Mohiuddin,Lukasz Kaiser, et al. 2021. Rethinking attentionwith performers. In Proceedings of ICLR.

Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee,and Lin-shan Lee. 2019. Speechbert: An audio-and-text jointly learned language model for end-to-end spoken question answering. arXiv preprintarXiv:1910.11559.

Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D Manning. 2019. What does bert lookat? an analysis of bert’s attention. In Proceedings ofBlackboxNLP, pages 276–286.

Kevin Clark, Minh-Thang Luong, Quoc V Le, andChristopher D Manning. 2020. Electra: Pre-trainingtext encoders as discriminators rather than genera-tors. In Proceedings of ICLR.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Proceed-ings of ICML, pages 160–167.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzmán, Édouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. InProceedings of ACL, pages 8440–8451.

Alexis Conneau, Germán Kruszewski, Guillaume Lam-ple, Loïc Barrault, and Marco Baroni. 2018a. Whatyou can cram into a single \$&!#* vector: Probingsentence embeddings for linguistic properties. InProceedings of ACL, pages 2126–2136.

Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel Bowman, Holger Schwenk,and Veselin Stoyanov. 2018b. Xnli: Evaluatingcross-lingual sentence representations. In Proceed-ings of EMNLP, pages 2475–2485.

Marius Cordts, Mohamed Omran, Sebastian Ramos,Timo Rehfeld, Markus Enzweiler, Rodrigo Benen-son, Uwe Franke, Stefan Roth, and Bernt Schiele.2016. The cityscapes dataset for semantic urbanscene understanding. In Proceedings of CVPR,pages 3213–3223.

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin,Ziqing Yang, Shijin Wang, and Guoping Hu. 2019.Pre-training with whole word masking for chinesebert. arXiv preprint arXiv:1906.08101.

Jeff Da and Jungo Kasai. 2019. Cracking the contex-tual commonsense code: Understanding common-sense reasoning aptitude of deep contextual repre-sentations. In Proceedings of EMNLP Workshop.

Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and YongYu. 2007. Co-clustering based classification forout-of-domain documents. In Proceedings of KDD,pages 210–219.

Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and YongYu. 2008. Self-taught clustering. In Proceedings ofICML, pages 200–207.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-bonell, Quoc V Le, and Ruslan Salakhutdinov. 2019.Transformer-xl: Attentive language models beyonda fixed-length context. In Proceedings of ACL,pages 2978–2988.

Hal Daume III and Daniel Marcu. 2006. Domain adap-tation for statistical classifiers. JAIR, 26:101–126.

Joe Davison, Joshua Feldman, and Alexand er M. Rush.2019. Commonsense knowledge mining from pre-trained models. In Proceedings of EMNLP-IJCNLP,pages 1173–1178.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. 2009. Imagenet: A large-scale hier-archical image database. In Proceedings of CVPR,pages 248–255.

Armen Der Kiureghian and Ove Ditlevsen. 2009.Aleatory or epistemic? does it matter? Structuralsafety, 31(2):105–112.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of NAACL-HLT, pages4171–4186.

34

Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachan-dran, Graham Neubig, Ruslan Salakhutdinov, andWilliam W Cohen. 2020. Differentiable reasoningover a virtual knowledge base. In Proceedings ofICLR.

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,Chang Zhou, Da Yin, Junyang Lin, Xu Zou, ZhouShao, Hongxia Yang, et al. 2021a. Cogview: Master-ing text-to-image generation via transformers. arXivpreprint arXiv:2105.13290.

Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang,and Jie Tang. 2019. Cognitive graph for multi-hopreading comprehension at scale. In Proceedings ofACL, pages 2694–2703.

Ming Ding, Chang Zhou, Hongxia Yang, and Jie Tang.2020. Cogltx: Applying bert to long texts. InProceedings of NeurIPS, volume 33, pages 12792–12804.

Ning Ding, Xiaobin Wang, Yao Fu, Guangwei Xu, RuiWang, Pengjun Xie, Ying Shen, Fei Huang, Hai-TaoZheng, and Rui Zhang. 2021b. Prototypical repre-sentation learning for relation extraction. In Pro-ceedings of ICLR.

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Kate Saenko, and Trevor Darrell. 2015. Long-termrecurrent convolutional networks for visual recogni-tion and description. In Proceedings of CVPR, pages2625–2634.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Unified languagemodel pre-training for natural language understand-ing and generation. In Proceedings of NeurIPS.

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021. Allnlp tasks are generation tasks: A general pretrainingframework. arXiv preprint arXiv:2103.10360.

Dumitru Erhan, Aaron Courville, Yoshua Bengio, andPascal Vincent. 2010. Why does unsupervised pre-training help deep learning? In Proceedings of AIS-TATS, pages 201–208.

Kawin Ethayarajh. 2019. How contextual are contextu-alized word representations? comparing the geome-try of bert, elmo, and gpt-2 embeddings. In Proceed-ings of EMNLP-IJCNLP, pages 55–65.

Allyson Ettinger. 2020. What BERT is not: Lessonsfrom a new suite of psycholinguistic diagnostics forlanguage models. TACL, 8:34–48.

Allyson Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for semantic evidence of compositionby means of simple classification tasks. In Proceed-ings of RepEval, pages 134–139.

An Evgeniou and Massimiliano Pontil. 2007. Multi-task feature learning. In Proceedings of NeurIPS.

Theodoros Evgeniou and Massimiliano Pontil. 2004.Regularized multi–task learning. In Proceedings ofKDD, pages 109–117.

Angela Fan, Edouard Grave, and Armand Joulin. 2019.Reducing transformer depth on demand with struc-tured dropout. In Proceedings of ICLR.

William Fedus, Barret Zoph, and Noam Shazeer. 2021.Switch transformers: Scaling to trillion parametermodels with simple and efficient sparsity. arXivpreprint arXiv:2101.03961.

Thibault Févry, Livio Baldini Soares, Nicholas FitzGer-ald, Eunsol Choi, and Tom Kwiatkowski. 2020. En-tities as experts: Sparse memory access with en-tity supervision. In Proceedings of EMNLP, pages4937–4951.

Maxwell Forbes, Ari Holtzman, and Yejin Choi. 2019.Do neural language representations learn physicalcommonsense? In Proceedings of CogSci, pages1753–1759.

Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang,Lei Wang, and Wei Xu. 2015. Are you talking to amachine? dataset and methods for multilingual im-age question answering. In Proceedings of NeurIPS,pages 2296–2304.

Jing Gao, Wei Fan, Jing Jiang, and Jiawei Han. 2008.Knowledge transfer via multiple model local struc-ture mapping. In Proceedings of KDD, pages 283–291.

Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.Making pre-trained language models better few-shotlearners. In Proceedings of ACL.

Spyros Gidaris and Nikos Komodakis. 2015. Ob-ject detection via a multi-region and semanticsegmentation-aware cnn model. In Proceedings ofICCV, pages 1134–1142.

Goran Glavaš and Ivan Vulic. 2021. Is supervised syn-tactic parsing beneficial for language understandingtasks? an empirical investigation. In Proceedings ofEACL, pages 3090–3104.

Yoav Goldberg. 2019. Assessing bert’s syntactic abili-ties. arXiv preprint arXiv:1901.05287.

Linyuan Gong, Di He, Zhuohan Li, Tao Qin, LiweiWang, and Tieyan Liu. 2019. Efficient training ofBERT by progressively stacking. In Proceedings ofICML, pages 2337–2346.

Mitchell A. Gordon, Kevin Duh, and Nicholas An-drews. 2020. Compressing BERT: studying the ef-fects of weight pruning on transfer learning. In Pro-ceedings of RepL4NLP, pages 143–155.

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter No-ordhuis, Lukasz Wesolowski, Aapo Kyrola, AndrewTulloch, Yangqing Jia, and Kaiming He. 2017. Ac-curate, large minibatch sgd: Training imagenet in 1hour. arXiv preprint arXiv:1706.02677.

35

Yuxian Gu, Zhengyan Zhang, Xiaozhi Wang, ZhiyuanLiu, and Maosong Sun. 2020. Train no evil: Selec-tive masking for task-guided pre-training. In Pro-ceedings of EMNLP, pages 6966–6974.

Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, andMinlie Huang. 2020. A knowledge-enhanced pre-training model for commonsense story generation.TACL, 8:93–108.

Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao,Xiangyang Xue, and Zheng Zhang. 2019. Star-transformer. In Proceedings of HLT-NAACL, pages1315–1325.

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrish-nan, and Pritish Narayanan. 2015. Deep learningwith limited numerical precision. In Proceedings ofICML, pages 1737–1746.

Suchin Gururangan, Ana Marasovic, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks. InProceedings of ACL.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-pat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre-training. arXivpreprint arXiv:2002.08909.

Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu,and Maosong Sun. 2021. Ptr: Prompt tuningwith rules for text classification. arXiv preprintarXiv:2105.11259.

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, andRoy H Campbell. 2019. Tictac: Accelerating dis-tributed deep learning with communication schedul-ing. In Proceedings of MLSys.

Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Ji-dong Zhai, and Jie Tang. 2021. Fastmoe: A fastmixture-of-expert training system. arXiv preprintarXiv:2103.13262.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, andRoss Girshick. 2020. Momentum contrast for unsu-pervised visual representation learning. In Proceed-ings of CVPR, pages 9729–9738.

Kaiming He, Ross Girshick, and Piotr Dollár. 2019.Rethinking imagenet pre-training. In Proceedingsof ICCV, pages 4918–4927.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of CVPR, pages 770–778.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. 2017. Ganstrained by a two time-scale update rule converge toa local nash equilibrium. Advances in neural infor-mation processing systems, 30.

John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word repre-sentations. In Proceedings of NAACL-HLT, pages4129–4138.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2014.Distilling the knowledge in a neural network. In Pro-ceedings of NeurIPS.

Geoffrey E Hinton, Simon Osindero, and Yee-WhyeTeh. 2006. A fast learning algorithm for deep beliefnets. Neural Computation, 18(7):1527–1554.

Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InProceedings of ACL, pages 328–339.

Phu Mon Htut, Jason Phang, Shikha Bordia, andSamuel R Bowman. 2019. Do attention heads inbert track syntactic dependencies? arXiv preprintarXiv:1911.12246.

Shengding Hu, Ning Ding, Huadong Wang, ZhiyuanLiu, Juanzi Li, and Maosong Sun. 2021. Knowl-edgeable prompt-tuning: Incorporating knowledgeinto prompt verbalizer for text classification. arXivpreprint arXiv:2108.02035.

Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020a.Swapadvisor: Pushing deep learning beyond the gpumemory limit via smart swapping. In Proceedingsof ASPLOS, page 1341–1355.

Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong,Linjun Shou, Daxin Jiang, and Ming Zhou. 2019a.Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In Pro-ceedings of EMNLP-IJCNLP, pages 2485–2494.

Haoyang Huang, Lin Su, Di Qi, Nan Duan, EdwardCui, Taroon Bharti, Lei Zhang, Lijuan Wang, Jian-feng Gao, Bei Liu, et al. 2020b. M3p: Learn-ing universal representations via multitask multi-lingual multimodal pre-training. arXiv preprintarXiv:2006.02635.

Yanping Huang, Youlong Cheng, Ankur Bapna, OrhanFirat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee,Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019b.Gpipe: Efficient training of giant neural networks us-ing pipeline parallelism. In Proceedings of NeurIPS,pages 103–112.

Drew A Hudson and Christopher D Manning. 2019.Gqa: A new dataset for real-world visual reasoningand compositional question answering. In Proceed-ings of CVPR, pages 6700–6709.

Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu,Yizhao Gao, Guoxing Yang, Jingyuan Wen, HengZhang, Baogui Xu, Weihao Zheng, et al. 2021.Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprintarXiv:2103.06561.

36

Sergey Ioffe and Christian Szegedy. 2015. Batch nor-malization: Accelerating deep network training byreducing internal covariate shift. In Proceedings ofICML, pages 448–456.

Robert A Jacobs, Michael I Jordan, Steven J Nowlan,and Geoffrey E Hinton. 1991. Adaptive mixtures oflocal experts. Neural Computation, 3:79–87.

Max Jaderberg, Karen Simonyan, Andrew Zisserman,and Koray Kavukcuoglu. 2015. Spatial transformernetworks. In Proceedings of NeurIPS, pages 2017–2025.

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.2019a. What does BERT learn about the structureof language? In Proceedings of ACL, pages 3651–3657.

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.2019b. What does bert learn about the structureof language? In Proceedings of ACL, pages 3651–3657.

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Be-yond data and model parallelism for deep neural net-works. In Proceedings of MLSys.

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, YongCui, and Chuanxiong Guo. 2020a. A unified archi-tecture for accelerating distributed DNN training inheterogeneous gpu/cpu clusters. In Proceedings ofOSDI, pages 463–479.

Zhengbao Jiang, Frank F. Xu, Jun Araki, and GrahamNeubig. 2020b. How can we know what languagemodels know. TACL, 8:423–438.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.2019. Tinybert: Distilling bert for natural languageunderstanding. In Proceedings of EMNLP, pages4163–4174.

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and PeterSzolovits. 2020. Is bert really robust? a strong base-line for natural language attack on text classifica-tion and entailment. In Proceedings of AAAI, pages8018–8025.

Justin Johnson, Andrej Karpathy, and Li Fei-Fei.2016. Densecap: Fully convolutional localizationnetworks for dense captioning. In Proceedings ofCVPR, pages 4565–4574.

Rie Johnson and Tong Zhang. 2005. A high-performance semi-supervised learning method fortext chunking. In Proceedings of ACL, pages 1–9.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld,Luke Zettlemoyer, and Omer Levy. 2020. Spanbert:Improving pre-training by representing and predict-ing spans. TACL, 8:64–77.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network for mod-elling sentences. In Proceedings of ACL, pages 655–665.

Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, and Hung-Yi Lee. 2020. Furtherboosting bert-based models by duplicating exist-ing layers: Some intriguing phenomena inside bert.arXiv preprint arXiv:2001.09309.

Jared Kaplan, Sam McCandlish, Tom Henighan,Tom B. Brown, Benjamin Chess, Rewon Child,Scott Gray, Alec Radford, Jeffrey Wu, and DarioAmodei. 2020. Scaling laws for neural languagemodels. arXiv preprint arXiv:2001.08361.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap-pas, and François Fleuret. 2020. Transformers arernns: Fast autoregressive transformers with linear at-tention. In Proceedings of ICML, pages 5156–5165.

Pei Ke, Haozhe Ji, Siyang Liu, Xiaoyan Zhu, andMinlie Huang. 2020. Sentilare: Linguistic knowl-edge enhanced language representation for senti-ment analysis. In Proceedings of EMNLP, pages6975–6988.

Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang-goo Lee. 2020. Are pre-trained language modelsaware of phrases? simple but strong baselines forgrammar induction. In Proceedings of ICLR.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of EMNLP,pages 1746–1751.

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. In Proceedings of ICLR.

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.2020. Reformer: The efficient transformer. In Pro-ceedings of ICLR.

Arne Köhn. 2015. What’s in an embedding? analyz-ing word embeddings through multilingual evalua-tion. In Proceedings of EMNLP, pages 2067–2073.

Lingpeng Kong, Cyprien de Masson d’Autume, Lei Yu,Wang Ling, Zihang Dai, and Dani Yogatama. 2020.A mutual information maximization perspective oflanguage representation learning. In Proceedings ofICLR.

Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark se-crets of BERT. In Proceedings of EMNLP-IJCNLP,pages 4364–4373.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.2017. Visual genome: Connecting language and vi-sion using crowdsourced dense image annotations.IJCV, 123:32–73.

37

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. ImageNet classification with deep con-volutional neural networks. In Proceedings ofNeurIPS, pages 1097–1105.

Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. Proceedings ofNeurIPS.

Guillaume Lample, Alexandre Sablayrolles,Marc’Aurelio Ranzato, Ludovic Denoyer, andHervé Jégou. 2019. Large memory layers withproduct keys. In Proceedings of NeurIPS, pages8546–8557.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2019. Albert: A lite bert for self-supervised learn-ing of language representations. In Proceedings ofICLR.

Neil D Lawrence and John C Platt. 2004. Learning tolearn with the informative vector machine. In Pro-ceedings of ICML.

Yann A LeCun, Léon Bottou, Genevieve B Orr, andKlaus-Robert Müller. 2012. Efficient backprop. InNeural networks: Tricks of the trade, pages 9–48.Springer.

Chen-Yu Lee, Saining Xie, Patrick Gallagher,Zhengyou Zhang, and Zhuowen Tu. 2015. Deeply-supervised nets. In Proceedings of AISTATS, pages562–570.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So, andJaewoo Kang. 2020. Biobert: a pre-trained biomed-ical language representation model for biomedicaltext mining. Bioinformatics, 36(4):1234–1240.

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Ko-siorek, Seungjin Choi, and Yee Whye Teh. 2019.Set transformer: A framework for attention-basedpermutation-invariant neural networks. In Proceed-ings of ICML, pages 3744–3753.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu,Dehao Chen, Orhan Firat, Yanping Huang, MaximKrikun, Noam Shazeer, and Zhifeng Chen. 2021.Gshard: Scaling giant models with conditional com-putation and automatic sharding. In Proceedings ofICLR.

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.The power of scale for parameter-efficient prompttuning. arXiv preprint arXiv:2104.08691.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020a. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation,and comprehension. In Proceedings of ACL, pages7871–7880.

Patrick Lewis, Ethan Perez, Aleksandara Piktus, FabioPetroni, Vladimir Karpukhin, Naman Goyal, Hein-rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-täschel, et al. 2020b. Retrieval-augmented genera-tion for knowledge-intensive nlp tasks. In Proceed-ings of NeurIPS, pages 9459–9474.

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, andDaxin Jiang. 2020a. Unicoder-vl: A universal en-coder for vision and language by cross-modal pre-training. In Proceedings of AAAI, pages 11336–11344.

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue,and Xipeng Qiu. 2020b. BERT-ATTACK: Adversar-ial attack against bert using bert. In Proceedings ofEMNLP, pages 6193–6202.

Linyang Li and Xipeng Qiu. 2021. Token-aware virtualadversarial training in natural language understand-ing. In Proceedings of AAAI, pages 8410–8418.

Linyang Li, Yunfan Shao, Demin Song, Xipeng Qiu,and Xuanjing Huang. 2020c. Generating adversar-ial examples in chinese texts using sentence-pieces.arXiv preprint arXiv:2012.14769.

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang. 2019. VisualBERT: Asimple and performant baseline for vision and lan-guage. arXiv preprint arXiv:1908.03557.

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar,Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith,Brian Vaughan, Pritam Damania, and Soumith Chin-tala. 2020d. Pytorch distributed: Experiences on ac-celerating data parallel training. In Proceedings ofPVLDB, page 3005–3018.

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xi-aowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu,Li Dong, Furu Wei, et al. 2020e. Oscar: Object-semantics aligned pre-training for vision-languagetasks. In Proceedings of ECCV, pages 121–137.

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, DanyangZhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021.Terapipe: Token-level pipeline parallelism for train-ing large-scale language models. arXiv preprintarXiv:2102.07988.

Tianyang Lin, Yuxin Wang, Xiangyang Liu, andXipeng Qiu. 2021. A survey of transformers. arXivpreprint arXiv:2106.04554.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollár,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In Proceedings ofECCV, pages 740–755.

Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019.Open sesame: Getting inside bert’s linguistic knowl-edge. In Proceedings of BlackboxNLP, pages 241–253.

38

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019. Lin-guistic knowledge and transferability of contextualrepresentations. In Proceedings of NAACL-HLT,pages 1073–1094.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016.Recurrent neural network for text classification withmulti-task learning. In Proceedings of IJCAI, pages2873–2879.

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju,Haotang Deng, and Ping Wang. 2020a. K-bert:Enabling language representation with knowledgegraph. In Proceedings of AAAI, pages 2901–2908.

Xiao Liu, Da Yin, Xingjian Zhang, Kai Su, Kan Wu,Hongxia Yang, and Jie Tang. 2021a. Oag-bert: Pre-train heterogeneous entity-augmented academic lan-guage model. arXiv preprint arXiv:2103.02410.

Xiao Liu, Fanjin Zhang, Zhenyu Hou, Zhaoyu Wang,Li Mian, Jing Zhang, and Jie Tang. 2020b. Self-supervised learning: Generative or contrastive.arXiv preprint arXiv:2006.08218.

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding,Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. Gptunderstands, too. arXiv preprint arXiv:2103.10385.

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, SergeyEdunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. 2020c. Multilingual DenoisingPre-training for Neural Machine Translation. TACL,8:726–742.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2020d.Roberta: A robustly optimized bert pretraining ap-proach. In Proceedings of ICLR.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, YixuanWei, Zheng Zhang, Stephen Lin, and Baining Guo.2021c. Swin transformer: Hierarchical vision trans-former using shifted windows. arXiv preprintarXiv:2103.14030.

Jonathan Long, Evan Shelhamer, and Trevor Darrell.2015. Fully convolutional networks for seman-tic segmentation. In Proceedings of CVPR, pages3431–3440.

Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. In Proceedings of NeurIPS ReproducibilityChallenge.

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, DeviParikh, and Stefan Lee. 2020. 12-in-1: Multi-taskvision and language representation learning. In Pro-ceedings of CVPR, pages 10437–10446.

Christopher D Manning, Kevin Clark, John Hewitt, Ur-vashi Khandelwal, and Omer Levy. 2020. Emer-gent linguistic structure in artificial neural networkstrained by self-supervision. PNAS, 117(48):30046–30054.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation:Contextualized word vectors. In Proceedings ofNeurIPS, pages 6294–6305.

Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning generic context em-bedding with bidirectional lstm. In Proceedings ofCoNLL, pages 51–61.

Alessio Miaschi and Felice Dell’Orletta. 2020. Con-textual and non-contextual word embeddings: anin-depth linguistic investigation. In Proceedings ofRepL4NLP, pages 110–119.

Paul Michel, Omer Levy, and Graham Neubig. 2019.Are sixteen heads really better than one? In Pro-ceedings of NeurIPS, pages 14014–14024.

Paulius Micikevicius, Sharan Narang, Jonah Alben,Gregory Diamos, Erich Elsen, David Garcia, BorisGinsburg, Michael Houston, Oleksii Kuchaiev,Ganesh Venkatesh, et al. 2018. Mixed precisiontraining. In Proceedings of ICLR.

Lilyana Mihalkova, Tuyen Huynh, and Raymond JMooney. 2007. Mapping and revising markov logicnetworks for transfer learning. In Proceedings ofAAAI, pages 608–614.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word repre-sentations in vector space. In Proceedings of ICLRWorkshop.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013b. Distributed repre-sentations of words and phrases and their composi-tionality. In Proceedings of NeurIPS.

Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013c. Linguistic regularities in continuous spaceword representations. In Proceedings of NAACL-HLT, pages 746–751.

Deepak Narayanan, Aaron Harlap, Amar Phanishayee,Vivek Seshadri, Nikhil R. Devanur, Gregory R.Ganger, Phillip B. Gibbons, and Matei Zaharia.2019. Pipedream: Generalized pipeline parallelismfor dnn training. In Proceedings of SOSP.

Deepak Narayanan, Mohammad Shoeybi, JaredCasper, Patrick LeGresley, Mostofa Patwary, VijayKorthikanti, Dmitri Vainbrand, Prethvi Kashinkunti,Julie Bernauer, Bryan Catanzaro, Amar Phan-ishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on gpu clusters. arXivpreprint arXiv:2104.04473.

39

Yixin Nie, Adina Williams, Emily Dinan, MohitBansal, Jason Weston, and Douwe Kiela. 2020. Ad-versarial nli: A new benchmark for natural languageunderstanding. In Proceedings of ACL, pages 4885–4901.

Timothy Niven and Hung-Yu Kao. 2019. Probing neu-ral network comprehension of natural language argu-ments. In Proceedings of ACL, pages 4658–4664.

Even Oldridge, J. Perez, Ben Frederickson, NicolasKoumchatzky, M. Lee, Z.-H. Wang, Lei Wu, F. Yu,Rick Zamora, O. Yılmaz, Alec M. Gunny, Vinh PhuNguyen, and S. Lee. 2020. Merlin: A gpu acceler-ated recommendation framework. In Proceedings ofIRS.

Vicente Ordonez, Girish Kulkarni, and Tamara Berg.2011. Im2text: Describing images using 1 millioncaptioned photographs. Advances in neural infor-mation processing systems, 24:1143–1151.

Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun,Hao Tian, Hua Wu, and Haifeng Wang. 2020.ERNIE-M: Enhanced multilingual representation byaligning cross-lingual semantics with monolingualcorpora. arXiv preprint arXiv:2012.15674.

Sinno Jialin Pan and Qiang Yang. 2009. A survey ontransfer learning. IEEE TKDE, 22(10):1345–1359.

Tianyu Pang, Kun Xu, Yinpeng Dong, Chao Du, NingChen, and Jun Zhu. 2020. Rethinking softmax cross-entropy loss for adversarial robustness. In Proceed-ings of ICLR.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In Proceedings of NeurIPS.

Hao Peng, Nikolaos Pappas, Dani Yogatama, RoySchwartz, Noah A Smith, and Lingpeng Kong. 2021.Random feature attention. In Proceedings of ICLR.

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao,Bairen Yi, Chang Lan, Chuan Wu, and ChuanxiongGuo. 2019. A generic communication scheduler fordistributed dnn training acceleration. In Proceed-ings of SOSP, pages 16–29.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of EMNLP, pages 1532–1543.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of NAACL-HLT, pages2227–2237.

Matthew E Peters, Mark Neumann, Robert L Lo-gan IV, Roy Schwartz, Vidur Joshi, Sameer Singh,and Noah A Smith. 2019. Knowledge enhancedcontextual word representations. In Proceedings ofEMNLP-IJCNLP, pages 43–54.

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In Proceedings of EMNLP-IJCNLP,pages 2463–2473.

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual BERT? In Pro-ceedings of ACL, pages 4996–5001.

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In Proceedings of ICCV, pages2641–2649.

Antonio Polino, Razvan Pascanu, and Dan Alistarh.2018. Model compression via distillation and quan-tization. In Proceedings of ICLR.

Nina Pörner, Ulli Waltinger, and Hinrich Schütze. 2020.E-BERT: efficient-yet-effective entity embeddingsfor BERT. In Proceedings of EMNLP, pages 803–818.

Sai Prasanna, Anna Rogers, and Anna Rumshisky.2020. When BERT plays the lottery, all tickets arewinning. In Proceedings of EMNLP, pages 3208–3229.

Di Qi, Lin Su, Jia Song, Edward Cui, TaroonBharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervisedimage-text data. arXiv preprint arXiv:2001.07966.

Yujia Qin, Yankai Lin, Ryuichi Takanobu, Zhiyuan Liu,Peng Li, Heng Ji, Minlie Huang, Maosong Sun, andJie Zhou. 2021. Erica: Improving entity and relationunderstanding for pre-trained language models viacontrastive learning. In Proceedings of ACL.

Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao,Ning Dai, and Xuanjing Huang. 2020. Pre-trainedmodels for natural language processing: A survey.Science China Technological Sciences, 63:1872—-1897.

Alec Radford, Jong Wook Kim, Chris Hallacy, AdityaRamesh, Gabriel Goh, Sand hini Agarwal, GirishSastry, Amand a Askell, Pamela Mishkin, JackClark, et al. 2021. Learning transferable visualmodels from natural language supervision. OpenAIBlog.

Alec Radford and Karthik Narasimhan. 2018. Im-proving language understanding by generative pre-training. OpenAI Blog.

40

Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2020. Exploring the limitsof transfer learning with a unified text-to-text trans-former. JMLR, 21:1–67.

Rajat Raina, Alexis Battle, Honglak Lee, BenjaminPacker, and Andrew Y Ng. 2007. Self-taught learn-ing: transfer learning from unlabeled data. In Pro-ceedings of ICML, pages 759–766.

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,and Yuxiong He. 2020. Zero: Memory optimiza-tions toward training trillion parameter models. InProceedings of SC.

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley,Shaden Smith, and Yuxiong He. 2021. Zero-infinity:Breaking the gpu memory wall for extreme scaledeep learning. arXiv preprint arXiv:2104.07857.

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, ScottGray, Chelsea Voss, Alec Radford, Mark Chen, andIlya Sutskever. 2021. Zero-shot text-to-image gener-ation. arXiv preprint arXiv:2102.12092.

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase,and Yuxiong He. 2020. Deepspeed: System opti-mizations enable training deep learning models withover 100 billion parameters. In Proceedings of KDD,pages 3505–3506.

Jie Ren, Samyam Rajbhandari, Reza Yazdani Am-inabadi, Olatunji Ruwase, Shuangyan Yang, Min-jia Zhang, Dong Li, and Yuxiong He. 2021. Zero-offload: Democratizing billion-scale model training.arxiv preprint arXiv:2101.06840.

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2016. Faster r-cnn: towards real-time object de-tection with region proposal networks. IEEE PAMI,39(6):1137–1149.

Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.How much knowledge can you pack into the pa-rameters of a language model? In Proceedings ofEMNLP, pages 5418–5426.

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott,Eric Michael Smith, Y-Lan Boureau, and Jason We-ston. 2021. Recipes for building an open-domainchatbot. In Proceedings of EACL.

Rudolf Rosa and David Marecek. 2019. Inducing syn-tactic trees from bert representations. arXiv preprintarXiv:1906.11511.

Corby Rosset, Chenyan Xiong, Minh Phan, XiaSong, Paul Bennett, and Saurabh Tiwary. 2020.Knowledge-aware language model pretraining.arXiv preprint arXiv:2007.00655.

Aurko Roy, Mohammad Saffar, Ashish Vaswani, andDavid Grangier. 2021. Efficient content-basedsparse attention with routing transformers. TACL,9:53–68.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, et al.2015. Imagenet large scale visual recognition chal-lenge. IJCV, 115(3):211–252.

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled version ofbert: smaller, faster, cheaper and lighter. In Proceed-ings of NeurIPS.

Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora,Mikhail Khodak, and Hrishikesh Khand eparkar.2019. A theoretical analysis of contrastive unsu-pervised representation learning. In Proceedings ofICML, pages 5628–5637.

Andrew M Saxe, James L McClelland, and Surya Gan-guli. 2013. Exact solutions to the nonlinear dynam-ics of learning in deep linear neural networks. arXivpreprint arXiv:1312.6120.

Timo Schick and Hinrich Schütze. 2020. It’snot just size that matters: Small language mod-els are also few-shot learners. arXiv preprintarXiv:2009.07118.

Marten Van Schijndel, Aaron Mueller, and Tal Linzen.2019. Quantity doesn’t buy quality syntax with neu-ral language models. In Proceedings of EMNLP-IJCNLP, pages 5830–5836.

Michael Sejr Schlichtkrull, Thomas N Kipf, PeterBloem, Rianne van den Berg, Ivan Titov, and MaxWelling. 2018. Modeling relational data with graphconvolutional networks. In Proceedings of ESWC,pages 593–607.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras,Kunal Talwar, and Aleksander Madry. 2018. Adver-sarially robust generalization requires more data. InProceedings of NeurIPS.

Pierre Sermanet, David Eigen, Xiang Zhang, MichaëlMathieu, Rob Fergus, and Yann LeCun. 2014. Over-feat: Integrated recognition, localization and detec-tion using convolutional networks. In Proceedingsof ICLR.

Piyush Sharma, Nan Ding, Sebastian Goodman, andRadu Soricut. 2018. Conceptual captions: Acleaned, hypernymed, image alt-text dataset for au-tomatic image captioning. In Proceedings of ACL),pages 2556–2565.

Noam Shazeer, Youlong Cheng, Niki Parmar, DustinTran, Ashish Vaswani, Penporn Koanantakool, PeterHawkins, HyoukJoong Lee, Mingsheng Hong, CliffYoung, et al. 2018. Mesh-tensorflow: Deep learningfor supercomputers. In Proceedings of NeurIPS.

41

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,Andy Davis, Quoc Le, Geoffrey Hinton, and JeffDean. 2017. Outrageously large neural networks:The sparsely-gated mixture-of-experts layer. In Pro-ceedings of ICLR.

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, ZheweiYao, Amir Gholami, Michael W Mahoney, and KurtKeutzer. 2020a. Q-bert: Hessian based ultra low pre-cision quantization of bert. In Proceedings of AAAI,pages 8815–8821.

Tianxiao Shen, Victor Quach, Regina Barzilay, andTommi Jaakkola. 2020b. Blank language models. InProceedings of EMNLP, pages 5186–5198.

Jiaxin Shi, Jianfei. Chen, Jun Zhu, Shengyang Sun,Yucen Luo, Yihong Gu, and Yuhao Zhou. 2017.ZhuSuan: A library for Bayesian deep learning.arXiv preprint arXiv:1709.05870.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural MT learn source syntax? In Pro-ceedings of EMNLP, pages 1526–1534.

Hidetoshi Shimodaira. 2000. Improving predictive in-ference under covariate shift by weighting the log-likelihood function. Journal of statistical planningand inference, 90(2):227–244.

Taylor Shin, Yasaman Razeghi, Robert L Logan IV,Eric Wallace, and Sameer Singh. 2020. Autoprompt:Eliciting knowledge from language models with au-tomatically generated prompts. In Proceedings ofEMNLP, pages 4222–4235.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri,Patrick LeGresley, Jared Casper, and Bryan Catan-zaro. 2019. Megatron-lm: Training multi-billion pa-rameter language models using model parallelism.arXiv preprint arXiv:1909.08053.

Chenglei Si, Zhengyan Zhang, Fanchao Qi, ZhiyuanLiu, Yasheng Wang, Qun Liu, and Maosong Sun.2020. Better robustness by more coverage: Adver-sarial training with mixup augmentation for robustfine-tuning. arXiv preprint arXiv:2012.15699.

Karen Simonyan and Andrew Zisserman. 2015. Verydeep convolutional networks for large-scale imagerecognition. In Proceedings of ICLR.

Livio Baldini Soares, Nicholas FitzGerald, JeffreyLing, and Tom Kwiatkowski. 2019. Matching theblanks: Distributional similarity for relation learn-ing. In Proceedings of ACL.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequencepre-training for language generation. In Proceed-ings of ICML, pages 5926–5936.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding. In Proceedingsof NeurIPS, pages 16857–16867.

Pierre Stock, Armand Joulin, Rémi Gribonval, Ben-jamin Graham, and Hervé Jégou. 2020. And the bitgoes down: Revisiting the quantization of neural net-works. In Proceedings of ICLR.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2020. Vl-bert: Pre-training of generic visual-linguistic representations.In Proceedings of ICLR.

Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-phy, and Cordelia Schmid. 2019a. Videobert: Ajoint model for video and language representationlearning. In Proceedings of ICCV, pages 7464–7473.

Haitian Sun, Pat Verga, Bhuwan Dhingra, RuslanSalakhutdinov, and William W Cohen. 2021. Rea-soning over virtual knowledge bases with open pred-icate relations. arXiv preprint arXiv:2102.07043.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019b.Patient knowledge distillation for bert model com-pression. In Proceedings of EMNLP-IJCNLP, pages4323–4332.

Tianxiang Sun, Yunfan Shao, Xipeng Qiu, QipengGuo, Yaru Hu, Xuanjing Huang, and Zheng Zhang.2020. Colake: Contextualized language and knowl-edge embedding. In Proceedings of COLING, pages3660–3670.

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, XuyiChen, Han Zhang, Xin Tian, Danxiang Zhu, HaoTian, and Hua Wu. 2019c. Ernie: Enhanced rep-resentation through knowledge integration. In Pro-ceedings of ACL, pages 1441–1451.

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, HaoTian, Hua Wu, and Haifeng Wang. 2019d. Ernie 2.0:A continual pre-training framework for language un-derstanding. arXiv preprint arXiv:1907.12412.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In Proceedings of NeurIPS, pages 3104–3112.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-manet, Scott Reed, Dragomir Anguelov, Dumitru Er-han, Vincent Vanhoucke, and Andrew Rabinovich.2015. Going deeper with convolutions. In Proceed-ings of CVPR, pages 1–9.

Hao Tan and Mohit Bansal. 2019. LXMERT: Learningcross-modality encoder representations from trans-formers. In Proceedings of EMNLP-IJCNLP, pages5103–5114.

Yi Tay, Mostafa Dehghani, Dara Bahri, and DonaldMetzler. 2020. Efficient transformers: A survey.arXiv preprint arXiv:2009.06732.

Wilson L Taylor. 1953. Cloze procedure: A newtool for measuring readability. Journalism quarterly,30(4):415–433.

42

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a.BERT rediscovers the classical NLP pipeline. InProceedings of ACL, pages 4593–4601.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,and Ellie Pavlick. 2019b. What do you learn fromcontext? probing for sentence structure in contex-tualized word representations. In Proceedings ofICLR.

Sebastian Thrun and Lorien Pratt. 1998. Learning tolearn: Introduction and overview. Springer Science& Business Media.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. In Proceedings ofACL, pages 384–394.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of NeurIPS, pages 5998–6008.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2018. Graph attention networks. In Proceedings ofICLR.

Pat Verga, Haitian Sun, Livio Baldini Soares, andWilliam W Cohen. 2020. Facts as experts: Adapt-able and interpretable neural memory over symbolicknowledge. arXiv preprint arXiv:2007.00849.

David Vilares, Michalina Strzyz, Anders Søgaard, andCarlos Gómez-Rodríguez. 2020. Parsing as pretrain-ing. In Proceedings of AAAI, pages 9114–9121.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neuralimage caption generator. In Proceedings of CVPR,pages 3156–3164.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-headself-attention: Specialized heads do the heavy lift-ing, the rest can be pruned. In Proceedings of ACL,pages 5797–5808.

Eric Wallace, Shi Feng, Nikhil Kand pal, Matt Gard-ner, and Sameer Singh. 2019a. Universal adversarialtriggers for attacking and analyzing nlp. In Proceed-ings of EMNLP-IJCNLP, pages 2153–2162.

Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Ya-mada, and Jordan Boyd-Graber. 2019b. Trick me ifyou can: Human-in-the-loop generation of adversar-ial examples for question answering. TACL, 7:387–401.

Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh,and Matt Gardner. 2019c. Do NLP models knownumbers? probing numeracy in embeddings. In Pro-ceedings of EMNLP-IJCNLP, pages 5306–5314.

Chenguang Wang, Xiao Liu, and Dawn Song. 2020a.Language models are open knowledge graphs.arXiv preprint arXiv:2010.11967.

Dong Wang, Ning Ding, Piji Li, and Hai-Tao Zheng.2021a. Cline: Contrastive learning with semanticnegative examples for natural language understand-ing. In Proceedings of ACL.

Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang,Cheng Li, Honggang Zhang, Xiaogang Wang, andXiaoou Tang. 2017. Residual attention network forimage classification. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 3156–3164.

Minjie Wang, Chien-chin Huang, and Jinyang Li.2019. Supporting very large models using automaticdataflow graph partitioning. In Proceedings of Eu-roSys.

Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,Xuanjing Huang, Cuihong Cao, Daxin Jiang, MingZhou, et al. 2020b. K-adapter: Infusing knowl-edge into pre-trained models with adapters. arXivpreprint arXiv:2002.01808.

Sinong Wang, Belinda Li, Madian Khabsa, HanFang, and Hao Ma. 2020c. Linformer: Self-attention with linear complexity. arXiv preprintarXiv:2006.04768.

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, NanYang, and Ming Zhou. 2020d. Minilm: Deepself-attention distillation for task-agnostic compres-sion of pre-trained transformers. In Proceedings ofNeurIPS.

Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, ZhengyanZhang, Zhiyuan Liu, Juanzi Li, and Jian Tang.2021b. Kepler: A unified model for knowledgeembedding and pre-trained language representation.TACL, 9:176–194.

Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, YongJiang, Xiaoyan Zhu, and Minlie Huang. 2020e. Alarge-scale chinese short-text conversation dataset.In NLPCC.

Zheng Wang, Yangqiu Song, and Changshui Zhang.2008. Transferred dimensionality reduction. In Pro-ceedings of ECML-PKDD, pages 550–565.

Ziyu Wang, Bin Dai, David Wipf, and Jun Zhu. 2020f.Further analysis of outlier detection with deep gen-erative models. In Proceedings of NeurIPS.

Alex Warstadt and Samuel R. Bowman. 2020. Canneural networks acquire a structural bias from rawlinguistic data? In Proceedings of CogSci.

Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Weny-ong Huang, Yi Liao, Yasheng Wang, JiashuLin, Xin Jiang, Xiao Chen, and Qun Liu. 2019.Nezha: Neural contextualized representation forchinese language understanding. arXiv preprintarXiv:1909.00204.

43

Xiangpeng Wei, Yue Hu, Rongxiang Weng, Luxi Xing,Heng Yu, and Weihua Luo. 2021. On learning uni-versal representations across languages. In Proceed-ings of ICLR.

Charles M Wharton, Keith J Holyoak, Paul E Downing,Trent E Lange, Thomas D Wickens, and Eric R Melz.1994. Below the surface: Analogical similarity andretrieval competition in reminding. Cognitive Psy-chology, 26:64–101.

Chris Williams, Edwin V Bonilla, and Kian M Chai.2007. Multi-task gaussian process prediction. InProceedings of NeurIPS, pages 153–160.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between hu-man and machine translation. arXiv preprintarXiv:1609.08144.

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and DahuaLin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedingsof CVPR, pages 3733–3742.

Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020.Perturbed masking: Parameter-free probing for an-alyzing and interpreting BERT. In Proceedings ofACL, pages 4166–4176.

Qiaolin Xia, Haoyang Huang, Nan Duan, DongdongZhang, Lei Ji, Zhifang Sui, Edward Cui, TaroonBharti, Xin Liu, and Ming Zhou. 2020. Xgpt: Cross-modal generative pre-training for image captioning.arXiv preprint arXiv:2003.01473.

Caiming Xiong, Stephen Merity, and Richard Socher.2016. Dynamic memory networks for visual and tex-tual question answering. In Proceedings of ICML,pages 2397–2406.

Wenhan Xiong, Jingfei Du, William Yang Wang, andVeselin Stoyanov. 2019. Pretrained encyclopedia:Weakly supervised knowledge-pretrained languagemodel. In Proceedings of ICLR.

Keyulu Xu, Jingling Li, Mozhi Zhang, Simon SDu, Ken-ichi Kawarabayashi, and Stefanie Jegelka.2021. How neural networks extrapolate: From feed-forward to graph neural networks. In Proceedings ofICLR.

Jian Yang, Shuming Ma, D. Zhang, Shuangzhi Wu,Zhou jun Li, and M. Zhou. 2020. Alternating lan-guage modeling for cross-lingual pre-training. InProceedings of AAAI, pages 9386–9393.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. In Proceedings ofNeurIPS.

Yuan Yao, Haoxi Zhong, Zhengyan Zhang, Xu Han, Xi-aozhi Wang, Chaojun Xiao, Guoyang Zeng, ZhiyuanLiu, and Maosong Sun. 2021. Adversarial languagegames for advanced natural language intelligence.In Proceedings of AAAI.

Yang You, Igor Gitman, and Boris Ginsburg. 2017.Scaling sgd batch size to 32k for imagenet training.arXiv preprint arXiv:1708.03888.

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu,Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song,James Demmel, Kurt Keutzer, and Cho-Jui Hsieh.2020. Large batch optimization for deep learning:Training bert in 76 minutes. In Proceedings of ICLR.

Bianca Zadrozny. 2004. Learning and evaluating clas-sifiers under sample selection bias. In Proceedingsof ICML.

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and MosheWasserblat. 2019. Q8bert: Quantized 8bit bert. InProceedings of NeurIPS.

Manzil Zaheer, Guru Guruganesh, Avinava Dubey,Joshua Ainslie, Chris Alberti, Santiago Ontanon,Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,et al. 2020. Big bird: Transformers for longer se-quences. In Proceedings of NeurIPS, pages 17283–17297.

Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu,Meng Zhang, Qun Liu, and Maosong Sun. 2020.Word-level textual adversarial attacking as combina-torial optimization. In Proceedings of ACL, pages6066–6080.

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang,Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang,Kaisheng Wang, Xiaoda Zhang, et al. 2021. Pangu-alpha: Large-scale autoregressive pretrained chi-nese language models with auto-parallel computa-tion. arXiv preprint arXiv:2104.12369.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Ben-jamin Recht, and Oriol Vinyals. 2017. Understand-ing deep learning requires rethinking generalization.In Proceedings of ICLR.

Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, PeiranYao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao,Rui Li, et al. 2019a. Oag: Toward linking large-scale heterogeneous entity graphs. In Proceedingsof KDD, pages 2585–2595.

Jingqing Zhang, Yao Zhao, Mohammad Saleh, andPeter Liu. 2020a. Pegasus: Pre-training with ex-tracted gap-sentences for abstractive summarization.In Proceedings of ICML, pages 11328–11339.

Minjia Zhang and Yuxiong He. 2020. Accelerat-ing training of transformer-based language modelswith progressive layer dropping. In Proceedings ofNeurIPS, pages 14011–14023.

44

Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, XiaoChen, Xin Jiang, and Qun Liu. 2020b. Ternarybert:Distillation-aware ultra-low bit bert. In Proceedingsof EMNLP, pages 509–521.

Zhengyan Zhang, Yuxian Gu, Xu Han, Shengqi Chen,Chaojun Xiao, Zhenbo Sun, Yuan Yao, FanchaoQi, Jian Guan, Pei Ke, Yanzheng Cai, GuoyangZeng, Zhixing Tan, Zhiyuan Liu, Minlie Huang,Wentao Han, Yang Liu, Xiaoyan Zhu, and MaosongSun. 2021a. Cpm-2: Large-scale cost-efficient pre-trained language models.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019b. Ernie: En-hanced language representation with informative en-tities. In Proceedings of ACL, pages 1441–1451.

Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, YuxianGu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji,Jian Guan, et al. 2020c. Cpm: A large-scale gen-erative chinese pre-trained language model. arXivpreprint arXiv:2012.00413.

Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Qun Liu,and Maosong Sun. 2021b. Know what you don’tneed: Single-Shot Meta-Pruning for attention heads.AI Open, 2:36–42.

Zhengyan Zhang, Guangxuan Xiao, Yongwei Li, TianLv, Fanchao Qi, Zhiyuan Liu, Yasheng Wang,Xin Jiang, and Maosong Sun. 2021c. Red alarmfor pre-trained models: Universal vulnerabilitiesby neuron-level backdoor attacks. arXiv preprintarXiv:2101.06969.

Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du,Chang Huang, and Philip HS Torr. 2015. Condi-tional random fields as recurrent neural networks. InProceedings of ICCV, pages 1529–1537.

Luowei Zhou, Hamid Palangi, Lei Zhang, HoudongHu, Jason Corso, and Jianfeng Gao. 2020a. Uni-fied vision-language pre-training for image caption-ing and vqa. In Proceedings of AAAI, pages 13041–13049.

Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan Huang. 2020b. Evaluating commonsense in pre-trained language models. In Proceedings of AAAI,pages 9733–9740.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In Proceedings of ICCV, pages19–27.

Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui,Hanxiao Liu, Ekin Dogus Cubuk, and Quoc Le.2020. Rethinking pre-training and self-training.Proceedings of NeurIPS, 33.

Xu Zou, Da Yin, Qingyang Zhong, Hongxia Yang,Zhilin Yang, and Jie Tang. 2021. Controllable gener-ation from pre-trained language models via inverseprompting. arXiv preprint arXiv:2103.10685.

45

Documents

arXiv:2106.07139v2 [cs.AI] 15 Jun 2021