Deep NLP: Predictions for 2020

More scaling.

As our group at NVIDIA showed with Megatron and Google showed with T5, there are still gains to be made, even on existing NLP metrics, from training bigger models for longer, on more fresh data. The gains may seem to be diminishing, but that will change as GLUE, SuperGLUE, etc add harder, more niche tasks. As Amit Singhal used to tell us on the Google Search team — as we make search better, the users keep asking harder questions.

Data augmentation, selection, efficiency.

As T5 points out, big Transformers are really good at overfit (since memorization a part of text understanding) — thus you should never train it on the same text twice. As finding more good data on the internet gets harder and harder, we’ll see more focus on augmenting the data we have (BPE-dropout is a great example, and Quoc Le’s group at Google had other good suggestions). Right now, data augmentation for text is not where it is for deep computer vision. And even in CV, Prof Le’s same group is still making breakthroughs.

Rewriting — for style, and otherwise.

I’ve found the Transformer models are sneaky good at style detection and generation. Much better than at reasoning, specific knowledge, etc. Style is mostly a local feature — which word or phrase to use, can we keep a consistent vocabulary with the previous sentence (including implied vocabulary that was not actually used). Using the very big BERT-style models (T5 seems tailor made), I expect big breakthroughs in text rewriting, document editing tools, etc. Right now Google (and to a lesser extent Apple) will help you write short emails and finish your sentences. Imagine a tool instead that trims the fat, and suggests good rewrites. Doing a full email re-write in one shot seems hard, but all text edits more or less break down to individual operations:

  • move a sentence

Non-local gradients and backprop — through reinforcement learning.

I’ve touched on this several times, but the best and worst feature of language modeling is that we get very far by optimizing for local losses, perhaps with a large context window, but writing the outputs — in permanent marker — one token at a time. However, what we care about in good writing, can only be measured across multiple words. To make this more concrete, conditioned image generation works on the whole image, because you can backprop into the individual pixels. You can’t really do that with large Transformers. Or can’t you? I expect to see some progress in this space. Maybe on a small scale, but it doesn’t seem impossible — to pass some signal, between multiple trainable text tokens.

Transformers beyond text.

It’s obvious that Transformer modules will be useful on problems other than text. We’ve already seen large Transformers make a big improvement on protein modeling and I’ll have a paper out soon on our own genomics work, that also includes a Transformer module. Transformers have been useful for some computer vision tasks — mainly because they easily support a larger receptive field than convolutions. It can be useful to add a Transformer module after a few initial convolutional layers (and that’s what we’ve done on our genomics problem).

Which of these will pay off most for my practical problems?

Honestly, in the medium term, I think the re-writing. That’s a strange thing to say given that Transformers are not doing this at all right now. But we all need an editor. Summarizing and re-writing content for a particular use case in mind, perhaps paired with a human in the loop to choose “which is better” will be huge. Not just for jokes and games, although those will probably be the first impressive use cases.



Nikolai Yakovenko

AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami