The best end-of-year posts and predictions for 2021 — future of Tech, AI, Miami, etc

  • Companies don’t see the benefits of streaming: their systems aren’t at scale, and their applications don’t benefit.
  • Applications might benefit, but they don’t know because they have never tried online predictions.
  • This requires a high initial infrastructure investment — and a mentality shift, away from batching, for example.
  • DALL·E” — a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a large dataset of text–image pairs.
  • CLIP — an image-classification model trained entirely on web captions — which generalizes to datasets and tasks, outside of its distribution
  • Scale models and datasets — bigger than anyone
  • Clean the data — better than anyone has ever done for massive datasets
  • Use relatively simple training schedules and loss functions, albeit with some magic around clipping and regularization, so that the model remains stable during training [this is not easy, as anyone training multi-billion parameter models can assure you]
  • Train across large cluster of GPU machine — with model parallelism
  • Leverage sparsity and custom CUDA kernels where possible — it help when Scott Gray is on your team
Bitter Lesson for ML Researchers
https://twitter.com/hardmaru/status/1350285435830878211?s=20
  • Disentangled attention — splitting token embeddings and relative position embeddings into separate calculations [and greatly speeding up the later]
  • Adding absolute position embeddings to the model — at the end of the attention layers, instead of at the beginning
  • Parameter re-use — as in ALBERT but more nuanced
  • Increasing the vocabulary to 128k tokens and training with SiFT — although it’s unclear how much these changes matter
DeBERTa training speeds up convergence over RoBERTa — perhaps the most impressive result
  • Each token is routed to just one expert (this was 2+ experts before, so that the problem could be fully differentiable)
  • If an expert’s batch is full, the overflow input is simply passed along unchanged (the full model consists of multiple Switch Transformer layers)
  • A single auxiliary loss is used to balance the experts, which is stable across many numbers of experts
  • With big enough batches of inputs, everything evens out well enough — not much overflow or underflow of experts takes place, and nor does occasional overflow hurt the models
  • Switch Transformers get, from what it looks like, pretty small gains (in speed or memory) from moving from float32 to bfloat16 — and they were not able to move all of the model to low precision, only parts of it.
  • It’s been suggested that this will be the last major “Noam Shazeer project” to be done in Mesh-Tensorflow. Another impressive but complicated technology that is not used much outside of Google, or perhaps even within Google, outside of Noam.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nikolai Yakovenko

Nikolai Yakovenko

1.2K Followers

AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami