The best end-of-year posts and predictions for 2021 — future of Tech, AI, Miami, etc

  • Companies don’t see the benefits of streaming: their systems aren’t at scale, and their applications don’t benefit.
  • Applications might benefit, but they don’t know because they have never tried online predictions.
  • This requires a high initial infrastructure investment — and a mentality shift, away from batching, for example.
  • DALL·E” — a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a large dataset of text–image pairs.
  • CLIP — an image-classification model trained entirely on web captions — which generalizes to datasets and tasks, outside of its distribution
  • Scale models and datasets — bigger than anyone
  • Clean the data — better than anyone has ever done for massive datasets
  • Use relatively simple training schedules and loss functions, albeit with some magic around clipping and regularization, so that the model remains stable during training [this is not easy, as anyone training multi-billion parameter models can assure you]
  • Train across large cluster of GPU machine — with model parallelism
  • Leverage sparsity and custom CUDA kernels where possible — it help when Scott Gray is on your team
Bitter Lesson for ML Researchers
  • Disentangled attention — splitting token embeddings and relative position embeddings into separate calculations [and greatly speeding up the later]
  • Adding absolute position embeddings to the model — at the end of the attention layers, instead of at the beginning
  • Parameter re-use — as in ALBERT but more nuanced
  • Increasing the vocabulary to 128k tokens and training with SiFT — although it’s unclear how much these changes matter
DeBERTa training speeds up convergence over RoBERTa — perhaps the most impressive result
  • Each token is routed to just one expert (this was 2+ experts before, so that the problem could be fully differentiable)
  • If an expert’s batch is full, the overflow input is simply passed along unchanged (the full model consists of multiple Switch Transformer layers)
  • A single auxiliary loss is used to balance the experts, which is stable across many numbers of experts
  • With big enough batches of inputs, everything evens out well enough — not much overflow or underflow of experts takes place, and nor does occasional overflow hurt the models
  • Switch Transformers get, from what it looks like, pretty small gains (in speed or memory) from moving from float32 to bfloat16 — and they were not able to move all of the model to low precision, only parts of it.
  • It’s been suggested that this will be the last major “Noam Shazeer project” to be done in Mesh-Tensorflow. Another impressive but complicated technology that is not used much outside of Google, or perhaps even within Google, outside of Noam.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nikolai Yakovenko

Nikolai Yakovenko


AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami