Mo’ data, mo’ problems: deep learning NLP into 2021

Nikolai Yakovenko
13 min readJan 4, 2021

Overdue for a blog post, and it started with a tweet. A VC-Twitter thread about the value of Twitter, and moving to Miami — how late 2020. I promise not to mention the price of bitcoin 📈. Maybe.

See you all at Joe’s Stone Crab in a couple of weeks.

The best text content is all on Twitter.

It’s a threaded conversation, quote-tweeting another thread, which in turn mentions other threads. In a GPT-3 world, it’s easy to imagine a deep NLP model making logical inferences based on this conversation — yet almost impossible that it would actually do so. Much less be able to respond to the thread in a coherent way.

At best it could compose a “reply-guy type object” to misquote Janelle Shane, curator of AI Weirdness and features speaker at the NeurIPS Machine Learning Creativity Workshop. AI-written songs and poems are terrible, at best terrible in an amusing way.

Twitter would make the best internet text-based dataset. And also perhaps a measure of actual “human performance,” not just what we call human performance on artificial tasks. Useful as those artificial baselines might be.

Quick refresher on how deep natural language (NLP) models work:

  • Collect (or stream) a huge volume of (coherent) text from the internet.
  • Tokenize the text with a word/character scheme — details not important.
  • Mask part of the text, and train a model to predict the masked text from the remaining context.
  • [Generative models like GPT-3 are a special case of masking the last token.]
  • Once your model “converges” on the masked language modeling task, you can fine-tune it on tasks you actually care about, such as summarization, sentiment prediction, entity extraction, or something custom, secret which probably can’t be shared without an NDA. My favorite pipe dream is a personalized email editor. God knows we ramble on for too long.

We have been doing this in deep NLP since 2018, without changing the basic formula. Of course the details matter, the models are getting stronger, fast, and much, much bigger. Transformer-based deep NLP models are killin’ it, without a major rethink or redesign. Nor am I here to suggest one. Apologies.

Just because the internal combustion engine of deep NLP hasn’t changed since BERT and GPT-2, that doesn’t mean nothing has happened. A lot has happened.

In practice, nobody trains their own language models. That task has been taken up by Google, Facebook, my former team at NVIDIA, Microsoft, and the OpenAI/Microsoft partnership around GPT-3, which I don’t quite understand (even from Microsoft CTO’s Kevin Scott’s interview with the head of OpenAI — Kevin was also my first boss at Google).

Deep NLP model number of parameters

The reasons for using someone else’s pretrained model cascade as follows:

  1. You don’t want to spend the GPU credits.
  2. You don’t want to write multi-GPU training scripts, much less allocate N>8 machines, i.e. a single NVIDIA DGX box.
  3. You don’t have access to a huge dataset, and lack the skills/interest/access to build one, much less to clean and de-duplicate it.
  4. You don’t know how to optimize memory on GPU — since all the Transformer model are GPU-memory bounded. Whether that’s gradient checkpointing, FP16, much less model parallelism — where your model is so big a single layer doesn’t fit on the biggest GPU, so the matrix math has to be split between several machines.

You run into a smaller version of these problems in fine-tuning — hence many practitioners don’t use models that require FP16, or can’t fit in memory on a single GPU. The biggest models — you can’t even inference without a pretty sophisticated setup. But that’s beyond the needs of 90%+ of users, even those calling themselves AI experts on LinkedIn or Twitter dot com.

Most applications of deep NLP, submitted to AI conferences, use BERT-base (they should probably be using RoBERTa). The GPT-3 API being a notable exception…

Here’s how it works.

  • pip install transformers from HuggingFace.
  • Call from_pretrained() to get a RoBERTa pretrained checkpoint, or T5 if you’re edgy.
  • Cast your downstream task as a RoBERTa or T5 output, and construct a balanced dataset. Graft something onto the final layer, if you really need to.
  • Trawl the HuggingFace forums for code snippets and advice.

Since this is still too hard for many people, HuggingFace has started publishing fine-tuned models, both to download, and to try out on the web via API. This is probably good-enough for most of what you want — unless your task or data is truly unique. HuggingFace also provides ingestions for 400+ public NLP datasets, with formatting and examples. What a country!

So, is this all that I do, in my very important and super secret job — download HuggingFace models and pick the best one? In a sense, yes. You shouldn’t have to do original research or get a PhD, in order to use deep NLP models. We do write some custom model code, but that’s outside the scope, and probably not that interesting to those of you not stalking the HuggingFace forums.

What you need to do, even with all this pretrained firepower just a download away, if focus on the data.

  • Domain specific truth data — at least as a test set.
  • Calibrating your model outputs. Sometimes as important as finetuning.
  • (Maybe) finetune the language model itself — think of it as extra pre-training — if you have enough in-domain data [you don’t — if you have to ask].

Sliding into predictions, I think we are going to see customization of “general purpose” large NLP models, to more specific domains and tasks. I’m a bit surprised this hasn’t become a focus sooner. I suppose the gains from scaling up the models and datasets were so large, that all the energy went into that, as it should have. Large models have many advantages.

At some point, you won’t want to keep building one huge model that’s good at everything. You may want to over-emphasize a context, which is present but rare in the overall training dataset. This is discussed often in terms of societal bias and impolite speech — rightfully so. The internet has many other biases. Very little of the content in the giant OpenAI “Web Text” dataset, for example is about science. Even less is about something niche like chess, math, or machine learning, ironically enough. Almost none of the dataset consists of financial reports, even though some of those are publicly available. Much of the content isn’t dated.

Back to Twitter, and to some extent Reddit. Imagine training a single large model to learn “general language” then customizing it to focus on your chosen community. Or even customizing the model to a single person. You would do so gradually, changing the mix of content as you go along, from random text, to content more and more alike your target.

This may sound like a small change, but the result may not be so tiny. It’s hard to grok 100–1 ratios in volumes of content, much less 10,000–1 ratios, which are pretty common on the internet. It is great to know that the multi-billion parameter LMs are big enough to keep so much global context, and still have room for your niche, without forgetting anything, given the proper training procedure.

Reddit is the “easy” large dataset to organized by niche, and already used to that effect — especially for chatbots.

Twitter would be better, if well organized. It has all the highest quality temporal content, albeit in short form.

The problem with Twitter data is you’d need to index it to make it useful — and no such index ships with the data. The vast majority of tweets are useless, include no information and getting no engagement. This isn’t obvious as a user, since the tweets you see are over-sampled from the best content, based on social proof and previous interactions.

Twitter will sell you the firehose and historical data, but it’s on you to sort this data — which you don’t own and can’t redistribute without Twitter’s prior approval. Moreover, a lot of the categorizations you’d want are by topic, by clout, etc. These are imprecise, and perhaps could not be offered by a central provider, much less Twitter itself. We are perhaps egalitarian at heart — not all tweets are created equal. As it stands, I can’t train a model only on “full threads” about biology and tech, with content that is roughly as quality as Balaji’s — measured by reputation or by user impressions. There are public Twitter datasets, but they are nothing like what you’d want. You get all tweets, over a relatively short time, that match a specific key word or hashtag, perhaps with a minimum number of retweets. It’s hard to describe how much more useful “the best of Twitter” would be as a dataset, than this glorified random sample.

One of my few regrets is not trying harder to publish a good Twitter NLP dataset when I was on the Twitter Cortex team. We also trained graph-NN models (very simple — based on semi-random walks) that produced author embedding similarities, accurate down to micro communities — like specific programming languages, and bloggers for specific professional sports teams. It helps that Twitter authors mostly stay in their lane 😜

I’d also like to see NLP models trained to predict subtle and noisy targets. We dabbled with this on multi-emotion classification at NVIDIA — training models to predict elements of Plutchik’s wheel of emotions on “company tweets.”

Plutchik’s wheel of emotions
Plutchik’s wheel of emotions (1979)

You can get pretty high agreement on positive/negative sentiment between human raters. Academic datasets like SST-2 will anyhow throw out any neutral sentiments, or any text with significant disagreement. Hence models are competing to get to 100% accuracy on difficult but clear examples. This satisfies the mathematical mind. Modeling a vague concept like Fear or Anticipation, will never get you that kind of clarity — and frankly I don’t think that’s what you necessarily want in real world applications.

On real problems, your model should be making reasonable predictions in murky and ambiguous cases, not only ones with high human consensus.

Our Plutchik work was old enough that we compared big LSTMs to Transformers. On some of the harder categories, we noticed that the LSTM failed to learn anything at all during finetuning, while Transformers made significant progress.

These models have a huge reserve of richness and capacity, which we don’t know how to harness quite yet.

Starting to lean into that power, while the underlying technology improves [and HuggingFace integrates the improvements], is why I’m excited about this space for 2021.

What would an end of year post be without awards and predictions?

Paper of the Year: OpenAI’s GPT-3, not close.

You could argue that their performance on SuperGlue isn’t state of the art — leading to scholarship clickbait comparing well-tuned small models to GPT-3. But who cares.

There’s no question that GPT-3 inspired the (hacker) masses to embrace large pre-trained NLP models. Especially during the first few weeks of the GPT-3 release, when people were lining up for access to the API, and tweeting cool demos with it on various tasks. I’m sure most of the startups raising money based on using GPT-3 as a general purpose AI are silly. You gotta start silly sometimes.

GPT-3 (almost) getting a job at Google
GPT-3 diagnosing (most) of your illnesses

Platform of the Year: HuggingFace Transformers, with shoutouts to PyTorch and PyTorch Lightning.

All three are Python-based open source projects, with 100,000 Github stars between them. Thousands of active users, hundreds of non-trivial code contributors, and high output professional teams leading the core development.

This seems to be the way — real community, not just a “mirror” of internal tools used by a large company. But also substantial funding and professional teams owning the project and doing much of the work — without not necessarily contributing all of the ideas.

PyTorch Lightning is the newest and most uncertain here, from a deep NLP perspective. We do need a good-enough general purpose training library, which can at least support multi-GPU and Google Colab TPU, save and load models, report metrics, and possibly get deeper into the weeds like low-precision training. Even if the pros keep spinning their own training loops, just as they keep implementing their own language models, it’s nice to have a default standard for training.


  • More domain customization
  • More and better problem-specific datasets
  • More NLP for genomics and other “language like” problems
  • More “online” computation and simulations
  • Better methods for human-computer collaboration
  • Growing acceptance of “AI” and computer judgement as part of our work, and part of our lives

Most current NLP use cases boil down to classification (perhaps as a regression task, like click prediction), named entity extraction (more generally, content markup) and Q&A (which could be anything, but mostly means information retrieval from a document, and/or from knowledge embedded in the model parameters).

We may see a push toward something more generative, as well. The do-able text-generation problems to me mostly break down into:

  • Summarization
  • Editing and content-preserving rewriting (for style, brevity, grammar)
  • Original content suggestion (such as one line of a poem)

I don’t even consider that AI might write good original content yet — ie something a human would choose to read instead of Twitter. But it could help us rewrite existing content, and possibly to make small semi-random contributions that a human might find useful when not sure what to say next.

All of these will benefit from context-focused models, raised (or finished — in the cattle sense) on high quality text in the domain of what you’re writing. These will also benefit from attentive humans picking from multiple machine generations, creating training data along the way. OpenAI tried a version of such a system for news summarization.

Models will also improve if we break the GPT-inspired straight jacket of single-shot generation. Compute costs aside, some problems are small enough to solve well with look-ahead, simulation and backtracking, but too big to get right in one shot. Think writing a paragraph or a stanza, as opposed to two words, or the whole novel.

If memory serves me right, Nabokov wrote in his autobiography that he could never write poems that worked right “in pen” the first time — although his father could, said the great writer.

It should be possible to correct your writing as you go along, to generate many continuations and choose the best one — without back-propagating model updates through multiple tokens. Beam search is a poor version of this, and Nucleus Sampling, even if Pareto-dominant over beam search, is still trying to do too much in a single pass.

Sometimes you just have to roll up your sleeves, generate 10,000 possible summaries, use a detached model to pick the best ten, with as much edit distance as possible. Then ask the humans to rank them and explain why. Eventually it becomes more of a GAN-type process with a generator and discriminator. We need a joke-GAN. Maybe.

Russian in the loop

Coming from the games AI world, we expect simulated state-evaluation functions with multiple rollouts to beat their own single shot versions. Nobody would expect a chess, GO or poker program, to give you the best answers without simulation. I’m surprised this doesn’t get more attention in the deep NLP world.

Last week I was at the great American institution, Walmart. A shock for us city folk.

Ringing up my meats and spices through the self-checkout line, the clerk overseeing those machines comments on how “the AI can’t keep up with what people want it to do.” This is what the word means now, my friends. It’s also a good development, I believe.

The self-checkout computer at Walmart is excellent — much better than Safeway’s, which seemed to be designed by someone who’d never been to a grocery store.

Normal people expect computer “AI” to make micro decisions and to handle more and more routine rich-data use cases, with less human intervention. Extrapolating a bit, this means not just a computer vision (CV) systems to read barcodes and recognize broccoli, but something like good-enough NLP integration as well, both for printed text, and for voice commands. Voice at the moment is mostly a hands free command line, but even for that to work correctly — if helps if your speech-to-text system is grounded in something resembling common sense.

That common sense will come from NLP, since most human knowledge is encoded in text form, especially if you want to finetune the common sense to match your industry or use case. Unless you get your common sense largely through memes. [Another downside to the Reddit dataset.]

Lastly, I’m excited to see NLP work and the HuggingFace project start being used for non-text purposes, like chemistry and biology problems. As usual I’m talking my book, having spend a chunk of my time at NVIDIA learning genomics and contributing to a nascent deep genomics effort. DNA/RNA is more than just a sequences of letters — but if we knew what those sequences control, it would unlock more mysteries than we can probably imagine. Consider problems like how DNA methylation influences aging.

The thing that will crack our genetic codes, so to speak, won’t be a Transformer based NLP model exactly. But if you had a ton of symbol sequence data, and needed a toolset for how to train models on it — wouldn’t this be a good toolset to consider? Add to that an improving ability to generate coherent synthetic sequences — for data augmentation or even for IRL experimentation. Wouldn’t you think about using a Transformer based deep NLP model? I might.

Oh, and put that on the blockchain.

Happy New Year everyone! 🎉



Nikolai Yakovenko

AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami