The Deep Natural Language Papers You Need to Read — BERT, GPT-2 and looking forward

  • “Left to right” language modeling — i.e. predict the next word in a sequence
  • Bidirectional or “masked” language modeling — predict a word or sequence of words, with context on either side

Do I have your attention?

Both the GPT2-type and the BERT-type models, are based on word-piece token encoding, and a multi-layer Transformer architecture. The Transformer was introduced in Google’s Attention Is All You Needpaper, and can be summarized as:

  • learned vocabulary embeddings for ~30,000–50,000 word-piece tokens [capable of encoding any English text, and some emoji]
  • fixed or learned positional embedding [encoding the order of the tokens]
  • residual self-attention layers, consisting of multiple attention heads
  • masking, to prevent look-ahead — in the case of left to right models

Most comprehensive hyper-parameter search award

“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” ie T5 — from GoogleAI

  • Encoder-decoder setup, instead of BERT’s encoder only. Meaning that the model handles translation, summarization, and multi-token answers for clozure tasks — much more gracefully than BERT.
  • Comprehensive study and support for multi-word masking [imo this is very important, as BERT can’t directly answer multi-word clozure answers without hacks and modification; yes most questions have one word answers in a large vocabulary, but many do not, including the most significant answers in your niche downstream problem]
  • Multi-task finetuning by defining all downstream tasks as Q&A. It’s not better, but also not worse (and you don’t need a new architecture add-on for every new task!)
  • Definitely shows that encoder-decoder is better than encoder-only or decoder-only across a basket of downstream tasks. I am very convinced.
  • Lots of experiments around finetuning methodology, so you don’t have to.

We’re gonna need a bigger batch award

RoBERTa: A Robustly Optimized BERT Pretraining Approach” from Facebook Research (FAIR)

  • training on cleaner data
  • training for longer
  • using bigger batches (thus more stable gradients?)
  • removing BERT’s auxiliary non-LM sentence-comparison objective

Robustness without changing the loss award

BPE-Dropout: Simple and Effective Subword Regularization” from Yandex Research 🇷🇺

Best human imitation award

Defending Against Neural Fake News” aka GROVER, from Allen Institute/University of Washington

GROVER conditional generation language model
  • vocabulary, sentence length, internal consistency

Most compelling use of Reddit award

CTRL: A Conditional Transformer Language Model for Controllable Generation” from Salesforce Research

“Good to know” award

Language Models as Knowledge Bases?” from FAIR and University College London

T-Rex test questions generated from Wikipedia links.

Best parameter washing machine award

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations” from Google Research.

Best 80/20 Pareto optimization award

Parameter-Efficient Transfer Learning for NLP” from Google Research.

Best original abstraction award

XLNet: Generalized Autoregressive Pretraining for Language Understanding” from Carnegie Mellon and Google Research

  • relative positional embeddings (instead of absolute position for each token)
  • extending attention queries across previous (not trainable) token sequences
  • using permutations to consider masked language modeling to looks ahead to future tokens
XLNet points out the biggest flaw in BERT — conditional generation of multiple tokens

Most creative use of backpropagation award

Universal Adversarial Triggers for Attacking and Analyzing NLP” from Allen Institute

You need me on that wall! Award

“Single Headed Attention RNN: Stop Thinking With Your Head” by S. Merity

Predictions for 2020

I would be remiss not to make a few predictions for what will happen next. By the time you read this, some of this may have already happened, or even published. Don’t @ me.

  • insert, delete, replace
  • move a sentence

More papers.

Here’s a few other NLP papers that I’ve read and liked. Perhaps they deserve a longer writeup, but that would mean an overly long story, or bumping XLNet, Smerity, etc. Who wants that? Noting them below, with a link and a key insight. All of these papers are worth reading, especially if the insight seems relevant to your problem.

  • 64 layer Transformer — still probably the deepest we’ve seen (character level not word level, so a bit different)
  • Key insight: loss is applied after every layer, not just the final output. Those intermediate losses are then decayed over training. Good way to initialize the network and avoid instability.
  • First work to train models up to 8 billion parameters (since superseded by T5’s 11 billion). Support for BERT and GPT-2, in PyTorch.
  • Key insight: uses model parallelism to split each attention layer matrix into several sub-matrix operations, across many GPUs. Really effective (and pretty simple once you know it works) way to train huge models efficiently on a cluster.
  • SOTA Transformers use many attention heads (usually 4–8 but sometimes more). This makes the models memory-bound for modern hardware, and especially for decoding [inference]. What if we can compute a single attention head, and apply a different transformation for it N times instead of using N attention heads?
  • Key insight: GPU/TPU memory is the bottleneck for large Transformer models. Anyone who trains these models know that. Here’s another idea to reduce memory by a lot, without giving back much in test accuracy. Stay tuned.
  • A deep look at text summarization — a task that’s difficult, important, but also somewhat subjective [more so than translation, Q&A, or perhaps even style rewriting]. The authors use RL and “humans in the loop” to teach their model summarization styles that users prefer.
  • Key insights: mostly that this is possible, but also that it’s hard. Especially when “good summarization” is ambiguous even to the human labelers.
  • The work comes off a bit as half-baked and less scientific than most of the papers here. However, these fuzzy problems may be more relevant to real world applications that you may want to optimize than something more objective.



AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nikolai Yakovenko

Nikolai Yakovenko


AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami