Mo’ data, mo’ problems: deep learning NLP into 2021

The best text content is all on Twitter.
  • Collect (or stream) a huge volume of (coherent) text from the internet.
  • Tokenize the text with a word/character scheme — details not important.
  • Mask part of the text, and train a model to predict the masked text from the remaining context.
  • [Generative models like GPT-3 are a special case of masking the last token.]
  • Once your model “converges” on the masked language modeling task, you can fine-tune it on tasks you actually care about, such as summarization, sentiment prediction, entity extraction, or something custom, secret which probably can’t be shared without an NDA. My favorite pipe dream is a personalized email editor. God knows we ramble on for too long.
Deep NLP model number of parameters
  1. You don’t want to spend the GPU credits.
  2. You don’t want to write multi-GPU training scripts, much less allocate N>8 machines, i.e. a single NVIDIA DGX box.
  3. You don’t have access to a huge dataset, and lack the skills/interest/access to build one, much less to clean and de-duplicate it.
  4. You don’t know how to optimize memory on GPU — since all the Transformer model are GPU-memory bounded. Whether that’s gradient checkpointing, FP16, much less model parallelism — where your model is so big a single layer doesn’t fit on the biggest GPU, so the matrix math has to be split between several machines.
  • pip install transformers from HuggingFace.
  • Call from_pretrained() to get a RoBERTa pretrained checkpoint, or T5 if you’re edgy.
  • Cast your downstream task as a RoBERTa or T5 output, and construct a balanced dataset. Graft something onto the final layer, if you really need to.
  • Trawl the HuggingFace forums for code snippets and advice.
  • Domain specific truth data — at least as a test set.
  • Calibrating your model outputs. Sometimes as important as finetuning.
  • (Maybe) finetune the language model itself — think of it as extra pre-training — if you have enough in-domain data [you don’t — if you have to ask].
Plutchik’s wheel of emotions
Plutchik’s wheel of emotions (1979)
GPT-3 (almost) getting a job at Google
GPT-3 diagnosing (most) of your illnesses
  • More domain customization
  • More and better problem-specific datasets
  • More NLP for genomics and other “language like” problems
  • More “online” computation and simulations
  • Better methods for human-computer collaboration
  • Growing acceptance of “AI” and computer judgement as part of our work, and part of our lives
  • Summarization
  • Editing and content-preserving rewriting (for style, brevity, grammar)
  • Original content suggestion (such as one line of a poem)
Russian in the loop



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nikolai Yakovenko

Nikolai Yakovenko


AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami