ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (Lan et al.)
BART (Lewis et al.)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al.)
DeBERTa :Decoding-enhanced BERT with Disentangled Attention (He et al.)
DeBERTa V2 (He et al.)
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (Sanh et al.)
Electra: Pre-training Text Encoders as Discriminators Rather Than Generators (Clark et al.)
FNet, Mixing Tokens with Fourier Transforms (Lee-Thorp et al.)
GPT2 (Radford et al.)
GPT-J
GPT-Neo
Longformer: The Long-Document Transformer (Betalgy et al.)
LongT5 (Efficient Text-To-Text Transformer for Long Sequences)
M2M-100 (Fan et al.)
Marian
MBart (Liu et al.)
MobileBERT (A Compact Task-agnostic BERT for Resource-Limited Devices)
GPT (Radford et al.)
Pegasus (Zhang et al.)
ProphetNet (ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training)
Reformer: The Efficient Transformer (Kitaev et al.)
RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al.)
T5 (Text-To-Text Transfer Transformer)
XLNet (Generalized Autoregressive Pretraining for Language Understanding)