The Deep Natural Language Papers You Need to Read — BERT, GPT-2 and looking forward

Nikolai Yakovenko
28 min readDec 3, 2019


Last week I left NVIDIA’s applied deep learning research group, after nearly three years in Santa Clara. As well as starting new projects, thought I’d get back to writing on topics people keep asking me about.

One question I get — these Transformer-type language models are great, they seem to be setting records on language understanding benchmarks every week. But there are so many versions, so many papers. What do I need to know?

Good news! You can keep up with this field, very effectively, reading a paper every couple of weeks, or a few papers a month if you want to be thorough. You just have to know which ones. More on how I choose papers later.

As of December 2019, this is all you need to know — what can be done, and where it might be heading.

First, a quick summary of language modeling, and why any of this matters. The language modeling objective is predicting missing words, given context. There are two main flavors:

  • “Left to right” language modeling — i.e. predict the next word in a sequence
  • Bidirectional or “masked” language modeling — predict a word or sequence of words, with context on either side

The main benefit of left to right LM is that it’s tailor made for generating sequences — like news stories, tweets, answers to interview questions, or Cards Against Humanity. OpenAI’s GPT-2 model is most closely associated with left to right LM, and it is probably the most inspiring to people interested in AGI, or anthropological computing.

Google’s BERT, the first and best-known “masked” language model, however is the architecture beating every NLP benchmark, and the one being used to revamp both Google search and Microsoft’s BING over the past year.

The reason is obvious — two directions is better than one. You won’t do nearly as well, on problems like finding answers in text, synonym matching, text editing, etc., if your model is hard-wired to encode information at any point of the sequence, only from the words that came before. And even when it can, left to right model would be less efficient: using more trainable parameters, for the same level of downstream accuracy.

Do I have your attention?

Both the GPT2-type and the BERT-type models, are based on word-piece token encoding, and a multi-layer Transformer architecture. The Transformer was introduced in Google’s Attention Is All You Needpaper, and can be summarized as:

  • learned vocabulary embeddings for ~30,000–50,000 word-piece tokens [capable of encoding any English text, and some emoji]
  • fixed or learned positional embedding [encoding the order of the tokens]
  • residual self-attention layers, consisting of multiple attention heads
  • masking, to prevent look-ahead — in the case of left to right models

Attention — computing a weighted average over outputs of the previous layer — had been used for recurrent language models (usually LSTMs), as well as over convolutional language models (such as QRNN or quasi-RNN). This seminal paper showed that those extra components were not necessary for good results, and that “attention is all you need.” This approach is memory-intensive (more on that below) but fully parallelizable, scalable, and relatively simple.

Since the original paper came out in mid-2017, variants of the Transformer architecture have been scaled to models with billions of parameters.

It’s amazing to think that, not long ago (less than two years!) a model with 100 million parameters was considered big. Image recognition models based on the ResNet architecture, operated in that range of tens of millions of parameters. Now models like “BERT large” at ~355 million parameters seem small compared to the biggest Transformers.

I thought we’d hit diminishing returns by now, but if you want a model that understands all aspects of language — just English — as well as has concepts of style, passes basic popular knowledge tests, and is numerate as well as literate — it’s not clear that just a few hundred million parameters is enough. So far from what I’ve seen and the papers below have written, bigger is better. Although nobody thinks that the biggest models are particularly parameter-efficient. More on that, as well.

Edit: For a deeper (but engaging and very understandable) Transformer overview, take a look at this one from Rémi Louf of Hugging Face. It also explains the difference between decoder, encoder and encoder-decoder models (i.e. BERT, GPT-2, and T5…)

Enough background. Say you’ve read the BERT and GPT2 papers — or not. What else is worth knowing, to keep up to date with the field?

Most comprehensive hyper-parameter search award

“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” ie T5 — from GoogleAI

If you had to read just one paper, this would be it. However it may be as long as the other papers listed below, combined. Not joking — it’s 30 pages, not counting references and samples. In our team’s weekly reading group, we spent two weekly sessions on T5 paper, and even so did not really finish it.

Beyond the excellent hyper-parameter search — restricting a model to the size of “BERT base” (~220 million parameters) — what choices matter? The paper also innovates on the BERT objective slightly, but in ways that seem small but can be very significant to your niche downstream problem:

  • Encoder-decoder setup, instead of BERT’s encoder only. Meaning that the model handles translation, summarization, and multi-token answers for clozure tasks — much more gracefully than BERT.
  • Comprehensive study and support for multi-word masking [imo this is very important, as BERT can’t directly answer multi-word clozure answers without hacks and modification; yes most questions have one word answers in a large vocabulary, but many do not, including the most significant answers in your niche downstream problem]
  • Multi-task finetuning by defining all downstream tasks as Q&A. It’s not better, but also not worse (and you don’t need a new architecture add-on for every new task!)
  • Definitely shows that encoder-decoder is better than encoder-only or decoder-only across a basket of downstream tasks. I am very convinced.
  • Lots of experiments around finetuning methodology, so you don’t have to.

I’ll be honest — I like GPT2-type generative models much more than BERT-type models (masking 15% of subword tokens). In part, this paper made me understand why I felt that way, and really does a lot to turn BERT-type models into something more capable of answering questions and testing knowledge, not just a hacky Mad Libs trick that happens to work really well in practice.

The authors also include excellent sections on dataset generation and scaling. Despite a more complicated and comprehensive objective, their model achieves SOTA performance compared to previous BERTs on most tasks, even at the same parameter scale. They also take their model to 11 (billion parameters) and show it keeps getting better.

It’s useful to keep in mind game AI legend Rich Sutton’s “Bitter Lesson”: cute and clever methods in AI, are eventually beaten by simple, general methods that scale well with computational resources.

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation.

T5 not only scales the biggest Transformer language model to date, by an order of magnitude over most of the other works mentioned here. They also made most downstream tasks into versions of a general method. I wish I had more space to explain what a big deal that is — obvious as it may be.

Having never run it yet (the biggest T5 will only run with Mesh Tensorflow on the Google Cloud — see their GitHub), this would be my model of choice given all options.

The one thing that T5 does not do is auto-regressive text generation. That said, it does do summarization and translation. Auto-regressive writing could be trained as yet another downstream objective. AGI to some, just another option in an API to others.

OK — that was entirely too long, for a long paper that was really great could also have used an editor. Let’s keep the next reviews to a page limit.

It’s all about tradeoffs. Few architectures are better at everything, with no cost. These two papers come closest — significant improvements with little downside.

We’re gonna need a bigger batch award

RoBERTa: A Robustly Optimized BERT Pretraining Approach” from Facebook Research (FAIR)

The folks at FAIR give us a better pre-trained BERT by

  • training on cleaner data
  • training for longer
  • using bigger batches (thus more stable gradients?)
  • removing BERT’s auxiliary non-LM sentence-comparison objective

Best of all, their best model is available in a few lines of python code from the PyTorch Hub. You can even try it in Google Colab in your browser.

The downside of course is that this is still BERT — the “RoBERTa large” model is 355 million parameters — trained mostly on web articles. For most users I’d recommend starting with pre-trained RoBERTa, and finetune it if you have domain-specific data. It can’t do everything that T5 can do (multi-token answers are still a problem) but it makes it easy to get started with a very good model.

Remember, you are reading about Transformers taking over the world of NLP. But your colleagues are not. RoBERTa is a great way to wow them, in a few lines of code.

What a world we live in!

Robustness without changing the loss award

BPE-Dropout: Simple and Effective Subword Regularization” from Yandex Research 🇷🇺

It’s long bothered me, and many others, that lexically similar words get treated as distinct tokens in all of these models’ vocabularies. This actually gets at a deeper problem that all word embedding systems suffer from long-tail problems, because of Zipf’s law (the notion that word-document frequencies are distributed on a log scale) — unless this long tail problem is explicitly accounted for, as in Word2Vec, Swivel, etc.

Language models are trained by averaging loss across batches, meaning that rare embeddings are not updated frequently, and thus end up in a common “rare token” area of the embedding space, as opposed to near their synonyms or lexically similar tokens. One could account for this explicitly, but that would require an additional penalty or loss terms, making it tricky to balance that loss with the overall objective.

The team from Yandex solves this elegantly, simply applying “dropout” on the tree merging letters and subwords into longer tokens. During every iteration of training, the “text to tokens” encoding comes out slightly differently. Such dropout is computationally cheap and can be composed with any loss function — like other form of dropout. They can be beneficial — both to downstream tasks like translation, and to resulting token embeddings, especially for rare and mis-spelled words.

An option I’d love to have for all of my models! And something that could potentially make a bigger than average difference if you are actually interested in your model’s accuracy on that specialized, niche data problem.

Let’s take a break from architecture, and take a look at some of my favorite papers exploring specific downstream tasks. The most captivating task, of course, is conditional text generation — writing articles, essays etc — on a topic or in the style of one’s choosing.

These are the models that AGI speculators are the most afraid of: how far are we from a model that can download your web presence, and learn to speak just like you? If you’re a Silicon Valley character or a New York Times opinion writer, perhaps not very long…

Best human imitation award

Defending Against Neural Fake News” aka GROVER, from Allen Institute/University of Washington

This is an excellent paper on conditional text generation. In this case, the text is internet news articles — which happen to be our largest source of LM pre-training data. Unlike an unconditional Transformer, GROVER writes articles conditioned on date, title, source and author. Or it can give you a title based on an article’s content.

The problem, is that article generation across hundreds of auto-regressive steps is not differentiable. Hence you can’t just train a discriminator to detect human and machine-written pieces longer than a few words. The author don’t solve this problem directly, but they do show ways to tweak the amount of perplexity that a model exhibits, to be more human-like.

Their Krugman is not half-bad.

GROVER conditional generation language model

Having spent a lot of time looking at outputs of conditionally generated left to right models, I think these giant Transformers are much further ahead on style, than they are on content.

By style I mean:

  • vocabulary, sentence length, internal consistency

These models are not writing stories — they are playing a game of improv. Trying to “yes-and” themselves, based on what’s already been written.

Think about it from the model’s perspective —which was trained to optimize perplexity — the easiest gains, relatively speaking, are from keeping internally consistent with what’s already been said. I don’t think we have great tools yet, to also ask the model to “speak” compellingly on a specific topic, to persuade or to convey specific information. But given that information, perhaps it could learn to say it better than some of us humans. I’m bullish on Transformer-based text rewriting tools. Probably BERT-based though — sadly.

Most compelling use of Reddit award

CTRL: A Conditional Transformer Language Model for Controllable Generation” from Salesforce Research

Most pre-training throws out the source of the text, but what if we supply those priors to the model? Not surprisingly, this helps disambiguate many cases.

Not only does this 1.5 billion parameter left or right model generate better answers, conditional on data, source, subreddit, etc. It also comes with a module— flawed but still cool — to attempt “source attribution” given a text. Just Reddit alone contains hundreds of sources. Gives some idea of the possibilities…

“Good to know” award

Language Models as Knowledge Bases?” from FAIR and University College London

These models are big. Optimizing them is fun. But do they actually know anything? Clearly they do. But is there a good metric — how do they compare to other methods, and do we need to finetune?

In this brief study that cuts a few corners (multi-token answers not considered), the scientists at FAIR show that BERT-type models are very competitive with SOTA knowledge extraction methods, tested on knowledge datasets derived from Wikipedia. They do well, even zero-shot, out of the box, not trained specifically for knowledge recall (multi-token aside).

T-Rex test questions generated from Wikipedia links.

Nothing about the paper is terribly surprising. But it gives you good confidence to try BERT on your domain-specific fact-based dataset. I would suggest the pre-trained RoBERTa-large.

What about people trying to change the underlying architecture. Are any of those directions promising? Here’s a few.

Best parameter washing machine award

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations” from Google Research.

Kicking myself for not trying this months ago — I’m sure others are doing the same. Transformer-type models consist of several expensive self-attention layers. Why not just re-use the same parameters, maybe after the first layer of self-attention? This should add some value, as nobody thinks the current parameters are used efficiently (vast majority are rarely touched) and yet we all know empirically that deeper is better. Perhaps there is juice to be squeezed from adding more compute, but without increasing the number of randomly-initialized layers.

And indeed there is. The ALBERT model is not faster to run than BERT-large, but you do benefit from not having to re-load huge layers into GPU memory one at a time [the bigger Transformer models are trained with layer checkpointing, paying a small perf hit to not have to load the entire trainable model into GPU memory at once]. As a result, they end up implementing very wide BERT models, with hidden size of 4096 but only 235 million trainable parameters. That’s pretty wide, T5 aside.

One thing I’m confused about — why not explore sharing some but not all of the layer parameters? Why not blend the parameters between a “first” and “final” layer? It’s impressive that literally just re-using one big layer in a residual model worked, without any tweaking to allow the layers to operate differently.

I’m still mad I didn’t run this basic experiment.

They say in Roman times, when an unexceptional emperor died, the marble busts of his face would be “re-carved” to resemble the new emperor. Why waste the good stone?

What first got me into Transformer-based language models, away from previously popular recurrent (LSTM) models, was their adaptability to fine-tuning on downstream tasks.

Sometimes LSTMs would respond well to finetuning, sometimes they wouldn’t. It seemed that Transform models always responded. Even for nuanced and noisy labels like multi-dimensional sentiment (fear, anger, trust, etc). Hey — gotta cite yourself at some point.

The problem is these Transformer models are huge. Finetuning all of the layers is expensive (and may not fit in memory)— however finetuning only the last layer misses out on the full benefits of finetuning.

Best 80/20 Pareto optimization award

Parameter-Efficient Transfer Learning for NLP” from Google Research.

The paper suggests taking a pre-trained model and adding zero-initialized “adapter modules” between layers of the network. These start out at zero, and only these modules are trainable in finetuning — not the original model parameters.

Although T5 shows that full model finetuning is better in a comprehensive study across lots of tasks, the adapters get solid results, with an order of magnitude smaller trainable parameters in finetuning process.

Better yet when you’re done, take the shims out, and you get your original emperor back.

At this point we’ve covered many papers. When will this end? And more importantly, why have you not cited my paper??

I tried to include not just the most famous Transformer papers of the year, but also some helpful but obscure ones. Let’s move on to the honorable mentions. Good, interesting papers that make you think. Perhaps better papers than some of the ones above. But perhaps, papers that you can’t really act on, an an applied NLP mercenary. Think more of a fun, esoteric read and a useful exploration. Something you’d want to be aware of, but probably won’t use very often.

Best original abstraction award

XLNet: Generalized Autoregressive Pretraining for Language Understanding” from Carnegie Mellon and Google Research

Paired with its predecessor “Transformer-XL” professor Ruslan Salakhutdinov and his student are on a mission to “make the world safe for language modeling” (as opposed to BERT-style masking). They claim great results compared to BERT — but not really compared to improved BERT models like RoBERTa or T5 — and along the way they demonstrate some very cool techniques:

  • relative positional embeddings (instead of absolute position for each token)
  • extending attention queries across previous (not trainable) token sequences
  • using permutations to consider masked language modeling to looks ahead to future tokens

The implementation for each is pretty challenging — which is probably why while everyone cites XL-Net, I don’t think many outside of the authoring group actually use it. Relative positional embeddings are a good idea, and clearly better than absolute embeddings. However actually using these means decomposing the attention matrix operations — rather than just adding your positional embeddings to the token embedding at the input layer. Simplicity of implementation matters.

The token order permutations are so complex, that I’m guessing most readers don’t really understand them. They also require a second set of hidden states. Worse yet, the permutations are so general, so board, that they authors don’t actually train the “pure” form of their own model. Instead they use heuristics to limit the set of possible permutations to a manageable subset.

Lastly, while the problems with BERT pointed out in the XLNet paper are 100% real and serious problems — some of these are in practice are addressed in T5, which came out after XLNet.

XLNet points out the biggest flaw in BERT — conditional generation of multiple tokens

Overall, XLNet is an important paper, one of my favorite reads of the year. But the architecture itself is too general to be useful, not fully-baked and under-trained in its current state. Sad as it makes me to say this, it probably makes more sense to improve the BERT objective (RoBERTa, T5) than to try to generalize the left to right LM objective.

I do hope the XLNet team keeps working on it.

PS If you do want to try XLNet, it is available on Github, and accessible through the wonderful libraries at Hugging Face.

Here’s an obscure and flawed paper, that I really liked.

Most creative use of backpropagation award

Universal Adversarial Triggers for Attacking and Analyzing NLP” from Allen Institute

The authors manage to pull off something pretty cool. We all know that these Transformer models have holes in their coverage, so to speak. Some texts will be mis-classified, and adding some “trigger phrases” will make the model behave badly. Of course it’s much cooler if a method can use backprop to find you such phrases. A hack, to be sure, but a hack you can then further hack to find you phrases that are short, that sound like real English, or which avoid a list of obviously triggering words, or other constraints.

All of this is very cool. It makes you think, in its fullest expression, about finding ideal conditional phrases for any topic. Maybe even a key to exploring the model’s internal state, or using it to robustify model training.

However, the demo posted along with the project, does not appear to match the examples cited in the paper. Kudos to the Allen Institute for providing demos of their projects. I don’t wish to knock them. But one has to be a bit skeptical of the result, if copy-pasting their own examples from the paper, does not trigger the model.

No neural NLP overview would be complete without a contribution from Stephen Merity. Thankfully, he happened to drop a good one as I was putting this together.

You need me on that wall! Award

“Single Headed Attention RNN: Stop Thinking With Your Head” by S. Merity

The only non-Transformer paper on this list, and it’s a good one. Besides a witty and fun read, with several useful tangents (more on that below), Stephen demonstrates that a LSTM with a single attention head, gets very close in language model (left to right) perplexity as a 64-layer character-level Transformer. Moreover, Stephen’s model was trained on a single GPU, in under 24 hours — hundreds of times less compute than some of the models mentioned above.

It’s an excellent read, not least because Stephen summarizes and links back to pre-Transformer language modeling research, having been an influential part of it as part of Salesforce Research. He’s also a witty and informative writer, citing several of his tweet storms in the paper, on various NLP topics.

He brings up good points about entropy — decision points in the language modeling problem — which are by no means equally distributed across a sequence (hence sensitive to vocabulary choices).

While I won’t be switching back to LSTM from my Transformers — even if the perplexity was as good, I don’t have any reason to think the finetune-ability has improved — I will keep reading Smerity to benefit from his insights. And you should too. Few have thought as long about the deep learning language modeling problem.

Well, that was longer than expected. I’m sure I missed some good papers. Might append to the list here, feel free to let me know.

People have asked: how do I find good papers? Truth is, it’s not hard. See what bubbles up on Twitter. See what the top young researchers in the field are doing — or what other peoples’ work they are sharing. I’ve embedded some of those peoples’ tweets above. And if you want to find papers beating SOTA, with code and published results, you can do that too.

You may also notice that half the papers above came from Google Research, or one of its affiliates like DeepMind. Sadly, perhaps, more than half of the researchers in our field, and in any deep learning research sub-speciality, work for Google. Google also uses a rigorous internal review process, so most papers coming out of Google were edited and thoroughly checked, before posting to ArXiv. I’d like to promote more papers from less well known institutions, but it’s hard to make major contributions these days, and get them out ahead of AI powerhouses like Google, FAIR, or smaller focused groups like OpenAI and the Allen Institute. There are thousands of papers published every year in top AI journals — it’s hard to get in — and yet most of those papers aren’t read by more than a few people. I guess it’s just the way it is. Pareto principle is un-defeated.

The good news is that anyone can use these public checkpoints, and benefit from these discoveries. Not only Google and Microsoft can use the latest BERT models in production — you can too. In fact, on your niche text extraction/classification problem, you can probably afford to run RoBERTa-large, while they can’t. You can use T5, and not even have to train it. Only months ago, nobody had that power, now everyone does if they know where to look. Yes, you’ll probably have to get an account on Google Cloud or throw some coin at AWS, NVIDIA, etc.

You can also use intermediaries like the Paris/Brooklyn startup Huggingface, who have made a name for themselves over the past year getting new NLP SOTA models available to use in their library. You might be able to compare BERT to RoBERTa to ALBERT, in a single script — and XLNet to boot. Think of it as the Gensim of cutting edge deep language modeling. I’ve seen new models go up on Huggingface in weeks.

Bookmark this page, read a few of these blogs and papers, download models and try them on your dataset. You’ll be a deep NLP expert in no time. And who knows — maybe you’ll find the next wrinkle. Happy Transforming.

Predictions for 2020

I would be remiss not to make a few predictions for what will happen next. By the time you read this, some of this may have already happened, or even published. Don’t @ me.

More scaling. As our group at NVIDIA showed with Megatron and Google showed with T5, there are still gains to be made, even on existing NLP metrics, from training bigger models for longer, on more fresh data. The gains may seem to be diminishing, but that will change as GLUE, SuperGLUE, etc add harder, more niche tasks. As Amit Singhal used to tell us on the Google Search team — as we make search better, the users keep asking harder questions.

Data augmentation, selection, efficiency. As T5 points out, big Transformers are really good at overfit (since memorization a part of text understanding)— thus you should never train it on the same text twice. As finding more good data on the internet gets harder and harder, we’ll see more focus on augmenting the data we have (BPE-dropout is a great example, and Quoc Le’s group at Google had other good suggestions). Right now, data augmentation for text is not where it is for deep computer vision. And even in CV, Prof Le’s same group is still making breakthroughs.

Furthermore, as a FB executive friend liked to point out — the data you really care about, is always scarce, even if less important data is abundant.

For example, take resume screening — a project I worked at for NVIDIA early in my tenure. The number of outstanding resumes can be tiny — especially for a specific role. Overfitting is a problem — and data augmentation is often our best solution.

Another problem is that text datasets may contain 90% news and clickbait, and 0.01% of articles on your topic of interest. In practice, that’s enough for a huge model to learn good embeddings for your niche task, be it baseball, science or financial topics. But wouldn’t you like to do better? Or to do as well, more efficiently?

Given that the naive approach to this works (I’ve done it, I’m sure others have too), I think we’ll see methods emerge — think of them as RL for downstream task — then select documents for you, to help with niche domain pre-training. Instead of over-fitting to your task, why not just over-sample relevant documents in pre-training? The main reason we’ve not seen this, I think, is because the benefits of scale have swamped everything else so far. But at some point the lines between more training and better selection will start to cross.

For now, filtering by sub-reddit, is a fine way to build your niche-specific pre-training dataset — for some topics.

Rewriting — for style, and otherwise. I’ve found the Transformer models are sneaky good at style detection and generation. Much better than at reasoning, specific knowledge, etc. Style is mostly a local feature — which word or phrase to use, can we keep a consistent vocabulary with the previous sentence (including implied vocabulary that was not actually used). Using the very big BERT-style models (T5 seems tailor made), I expect big breakthroughs in text rewriting, document editing tools, etc. Right now Google (and to a lesser extent Apple) will help you write short emails and finish your sentences. Imagine a tool instead that trims the fat, and suggests good rewrites. Doing a full email re-write in one shot seems hard, but all text edits more or less break down to individual operations:

  • insert, delete, replace
  • move a sentence

There’s no reason to think that a model could not make such suggestions. And that you could not agree/disagree with them one at a time, after each it will recompute more suggestions. Shoot, an RL agent could even make those choices for you. And it probably should.

Non-local gradients and backprop — through reinforcement learning. I’ve touched on this several times, but the best and worst feature of language modeling is that we get very far by optimizing for local losses, perhaps with a large context window, but writing the outputs — in permanent marker — one token at a time. However, what we care about in good writing, can only be measured across multiple words. To make this more concrete, conditioned image generation works on the whole image, because you can backprop into the individual pixels. You can’t really do that with large Transformers. Or can’t you? I expect to see some progress in this space. Maybe on a small scale, but it doesn’t seem impossible — to pass some signal, between multiple trainable text tokens.

Think of this as complimentary to data selection and re-writing.

Given the interest in Turing test and generative models, I expect some serious resources and brain power are being devoted to this already.

You can’t manage, what you can’t m̶e̶a̶s̶u̶r̶e̶ backprop.

Transformers beyond text. It’s obvious that Transformer modules will be useful on problems other than text. We’ve already seen large Transformers make a big improvement on protein modeling and I’ll have a paper out soon on our own genomics work, that also includes a Transformer module. Transformers have been useful for some computer vision tasks — mainly because they easily support a larger receptive field than convolutions. It can be useful to add a Transformer module after a few initial convolutional layers (and that’s what we’ve done on our genomics problem).

The question is: will the mammals go back to the ocean? Can we learn anything anything from non-text Transformers that will be helpful back to text?

Which of these will pay off most for my practical problems? Honestly, in the medium term, I think the re-writing. That’s a strange thing to say given that Transformers are not doing this at all right now. But we all need an editor. Summarizing and re-writing content for a particular use case in mind, perhaps paired with a human in the loop to choose “which is better” will be huge. Not just for jokes and games, although those will probably be the first impressive use cases.

More papers.

Here’s a few other NLP papers that I’ve read and liked. Perhaps they deserve a longer writeup, but that would mean an overly long story, or bumping XLNet, Smerity, etc. Who wants that? Noting them below, with a link and a key insight. All of these papers are worth reading, especially if the insight seems relevant to your problem.

“Character-Level Language Modeling with Deeper Self-Attention” from Google Research (2018)

  • 64 layer Transformer — still probably the deepest we’ve seen (character level not word level, so a bit different)
  • Key insight: loss is applied after every layer, not just the final output. Those intermediate losses are then decayed over training. Good way to initialize the network and avoid instability.

“MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism” from my colleagues at NVIDIA

  • First work to train models up to 8 billion parameters (since superseded by T5’s 11 billion). Support for BERT and GPT-2, in PyTorch.
  • Key insight: uses model parallelism to split each attention layer matrix into several sub-matrix operations, across many GPUs. Really effective (and pretty simple once you know it works) way to train huge models efficiently on a cluster.

We showcase this approach by training an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2. We have published the code that implements this approach at our GitHub repository.

“Fast Transformer Decoding: One Write-Head is All You Need” from the OG Noam Shazeer (Google Research)

  • SOTA Transformers use many attention heads (usually 4–8 but sometimes more). This makes the models memory-bound for modern hardware, and especially for decoding [inference]. What if we can compute a single attention head, and apply a different transformation for it N times instead of using N attention heads?
  • Key insight: GPU/TPU memory is the bottleneck for large Transformer models. Anyone who trains these models know that. Here’s another idea to reduce memory by a lot, without giving back much in test accuracy. Stay tuned.

Noam Shazeer is one of the true greats of this field. I’ve know about his work since I was in college, as his undergrad project at Duke — solving the NYTimes crossword puzzle — was a big deal at the time in early post-Google AI.

One of my first projects at Google was tweaking the original “Did you mean” spelling correction system — another early-Google applied NLP project. A project that I heard Noam coded up in his spare time, and then didn’t touch for years, moving on to bigger and better things. That was almost 15 years ago.

He’s also a co-author of T5, and probably mentioned on a third of the papers above — that’s in 2019. Pareto principle is real folks. If you’re a student and get a chance to intern with someone like that, do it. Don’t ask how much they are paying — you should be paying them, frankly.

“Fine-Tuning GPT-2 from Human Preferences” from OpenAI

  • A deep look at text summarization — a task that’s difficult, important, but also somewhat subjective [more so than translation, Q&A, or perhaps even style rewriting]. The authors use RL and “humans in the loop” to teach their model summarization styles that users prefer.
  • Key insights: mostly that this is possible, but also that it’s hard. Especially when “good summarization” is ambiguous even to the human labelers.
  • The work comes off a bit as half-baked and less scientific than most of the papers here. However, these fuzzy problems may be more relevant to real world applications that you may want to optimize than something more objective.

We’ve demonstrated reward learning from human preferences on two kinds of natural language tasks, stylistic continuation and summarization. Our results are mixed: for continuation we achieve good results with very few samples, but our summarization models are only “smart copiers”: they copy from the input text but skip over irrelevant preamble. The advantage of smart copying is truthfulness: the zero-shot and supervised models produce natural, plausible-looking summaries that are often lies. We believe the limiting factor in our experiments is data quality exacerbated by the online data collection setting, and plan to use batched data collection in the future.

We believe the application of reward learning to language is important both from a capability and safety perspective. On the capability side, reinforcement learning lets us correct mistakes that supervised learning would not catch, but RL with programmatic reward functions “can be detrimental to model quality.” On the safety side, reward learning for language allows important criteria like “don’t lie” to be represented during training, and is a step towards scalable safety methods such as a debate and amplification.

Maybe “smart copiers” is what you need. Say you’re given a stack of reports, and need to extract the main takeaways. You’re given GPUs, a few interns and a couple of months. But also perhaps, the outputs of your model plug into a system where people are used to running topic modeling with LDA or other Gensim tools. Can you do better than take the first paragraph from each story? OpenAI shows are really nice study for how this can be done. Not the most glorious use of AI, but maybe this is what you need, and what your organization will accept (since it improves an existing work-flow).

Before scaling and SOTA takeover defined Transformer models, one of the earlier applications from Google Research was using Transformers to take in a top-30 Google search (on a named entity), and produce a Wikipedia stub. Too bad nobody outside of Google could really reproduce this.

Don’t sleep on summarization; ain’t nothing new under the sun.



Nikolai Yakovenko

AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami