The best end-of-year posts and predictions for 2021 — future of Tech, AI, Miami, etc

Nikolai Yakovenko
20 min readJan 18, 2021

--

I’ve been reading a lot lately. Maybe it’s cold, certainly part of it is procrastination. As my friend Kyle Boddy likes to remind I’ve been know to say — the only things that count as “work” are coding, making technical reports, and presentation. Everything else is not real work.

In an effort to avoid real work — anything that might lead to a git commit — I’ve been training jiu jitsu, watching jiu jitsu videos, and reading a lot of good stuff. Both blogs/Substacks and AI papers — at times with a poker game on the iPad in the background. It’s amazing what a good read will do to prevent you from playing too many bad hands. Sorry guys — I’m not a nit, just got something better to read than playing this KJ offsuit under the gun. Oh, it’s on me? My bad.

I also went skiing in New Hampshire. And survived a skiddy, harrowing drive back to New York. Don’t drive in active snowfall — especially without 4WD and snow tires, believe me! And check the forecast for your entire drive. Now I know why boomers watch the weather channel.

Allow me to justify some of that procrastination time, by sharing the cool ideas I’ve read over the past month or so. I can’t believe it’s been almost a month since Christmas!

It’s been a great month for idea, and I’m looking forward to some of these coming true. If you will it, it is no dream!

Happy Saturnalia everyone!

The first post is from Chip Huyen — a former colleague at NVIDIA, and now at Snorkel AI. Chip maintains an excellent ML blog, often drawing on conversations with practitioners across the industry. Everyone who’s been around her remembers that energy — asking questions, wanting to know what people are doing, what might be next.

This post is about “ML going real-time.” It makes a few interesting points. Mostly that ML practitioners are not thinking about their models running live, for the most part. Models are trained statically, deployed, and often not re-trained on a regular schedule, much less in real time.

When I was at Twitter, some of the ML pipelines attempted to train with continuous learning. Let’s just say it was not great success, and leave it at that.

In general, Chip is right. Training models “live” or in “streaming” mode requires a rethink. Summarizing from the article:

  • Companies don’t see the benefits of streaming: their systems aren’t at scale, and their applications don’t benefit.
  • Applications might benefit, but they don’t know because they have never tried online predictions.
  • This requires a high initial infrastructure investment — and a mentality shift, away from batching, for example.

If you’re interested in some of the specifics, check out the piece. Chip is definitely right that in some cases, a stale model will become very stale, very quickly [unlike this blog, which is fresh as a daisy] — and this could be fixed with a little bit of online learning.

Up-front engineering costs aside, why not throw some compute at it? Your data is getting stale, as soon as it hits the lake.

Online learning is crucial for systems to adapt to rare events. Consider online shopping on Black Friday. Because Black Friday happens only once a year, there’s no way Amazon or other ecommerce sites can get enough historical data to learn how users are going to behave that day, so their systems need to continually learn on that day to adapt.

Or consider Twitter search when someone famous tweets something stupid. For example, as soon as the news about “Four Seasons Total Landscaping” went live, many people were going to search “total landscaping”. If your system doesn’t immediately learn that “total landscaping” here refers to the press conference, your users are going to get a lot of gardening recommendations.

Online learning can also help with the cold start problem. A user just joined your app and you have no information on them yet. If you don’t have the capacity for any form of online learning, you’ll have to serve your users generic recommendations until the next time your model is trained offline.

All great examples. Going back to Twitter — I remember the Ads team trained a big Word2Vec model for something Ads-related. Good idea! The model was trained pretty often, maybe weekly or so. It trained on a large number of fresh tweets… but the sampling was over something like the past six months. You see the problem right!

Not only would the model not have good context for whomever just won this year’s Super Bowl, or for what is hot this year for Black Friday. It had almost no context at all, for last year’s Super Bowl. Of course it had some context… but it’s hard to over-estimate how season and event-driven Twitter discussion can be. It would have been prudent for us to sample tweets over 13 months, and also to augment with some real-time training.

Chip had another good post — an update of her overview of ML tools. Amazing how many tools are out there— hundreds of significant companies, many of them startups. [Keep in mind that “ML expert” you just hired, probably only knows a few of these ML tools, and has used even fewer. Shoot, he probably hasn’t even read Chip’s blog!]

A worth read; many good points. Both this and the previous piece, also get into the divergence between deployed ML in the US, and in China. From the first piece:

I’ve read a lot about the AI race between the US and China, but most comparisons seem to focus on the number of research papers, patents, citations, funding. Only after I’ve started talking to both American and Chinese companies about real-time machine learning that I noticed a staggering difference in their MLOps infrastructures.

Few American Internet companies have attempted online learning, and even among these companies, online learning is used for simple models such as logistic regression. My impression from both talking directly to Chinese companies and talking with people who have worked with companies in both countries is that online learning is more common in China, and Chinese engineers are more eager to make the jump.

Read more, if you are interesting in any of that. And you should be!

Going in a different direction, I really enjoyed Eli Dourado’s predictions for the next decade. Many others have said as much — if you follow “intellectual tech Twitter” you’ve probably seen people raving about the piece.

All the biotech stuff is very cool. DNA sequencing for the COVID vaccine, continued advances in CRISPR, and especially getting into some of the cool anti-aging stuff that might be around the corner.

Therapeutic plasma exchange is FDA-approved (not for aging, but for a bunch of other conditions). I imagine there remain prohibitions on advertising that it can add years to your life, but it is safe, and a doctor can prescribe it off label. It’s also cheap. An automated plasmapheresis machine — which lets you do treatment after treatment — can be bought online for under $3,000. That is less than the cost of a single transfusion of young blood sold by the startup Ambrosia. How long until someone opens a clinic offering plasma dilution? I bet someone tries it in 2021. If it works, people will get over the weirdness, and it could be commonplace by 2030.

Eli does not go too far down the aging-reversal rabbit hole. Balaji S. Srinivasan did a great podcast episode recently that goes deeper, including suggesting that basically everyone will be optimizing their hormones soon, that you should be able to take a pill that simulates good diet and exercise — and that Lance Armstrong’s team deserves a Nobel Prize in medicine.

I agree broadly that many people will opt for hormone therapy to help them get out of bad metabolic health. Once that becomes common place, we will wonder why we went so long without — and why are those guys not in the baseball Hall of Fame, again? As well as some of the more fringy-sounding treatments mentioned in Eli’s blog.

I’m intrigued — but a lot more skeptical — of the Space section. Cargo costs dropping by 100-1,000x would make a big difference. I’m always one to say that — new technology is exciting not for what it does to make us better at what we already do now, but for the things we could not imagine with previous methods. That’s been my pitch for AI and DL for years, and it’s always my answer for what I’m most excited about.

Starship promises to take this trend much further. On Falcon 9, only the first stage is reusable, whereas on Starship, the entire system — both the booster and the space vehicle — is reusable. Starship runs on dirt cheap liquid methane instead of expensive rocket fuel. It is made out of stainless steel instead of more expensive traditional aerospace materials. SpaceX is talking about churning out Starships at a rate of one every 72 hours for a cost of $5 million each. Operating costs come down with a high flight rate, so Elon is figuring a $1.5-million fully burdened launch cost for 150 tons to LEO. That is $10/kg, more than 100 times cheaper than a Falcon 9 launch today.

It gets even more insane. Because Starship is designed to be refuelable on orbit, its 150-ton payload capacity to LEO equals its payload capacity to anywhere in the solar system. You will be able to launch 150 tons to LEO, load up on fuel while orbiting Earth, and then fly the same payload the rest of the way to the moons of Jupiter. The whole thing could cost less than one Falcon 9 launch — which is limited to 15 tons to LEO in a reusable configuration or 4 tons to Mars in an expendable configuration.

Let’s apply the gravity model of trade once more, this time to commerce between Earth and LEO. Meta-analyses have found that trade (on Earth) is roughly inverse-linear in transport costs. If that holds for space, a 200x cost reduction in travel between Earth and LEO should increase “trade” between Earth and LEO by 200x. Commerce between the Earth and the moon, or between the Earth and Mars, starting from a base close to zero, would be stimulated even more.

It’s worth noting a second-order effect of cheap launch costs. When launch is expensive, more engineering has to go into the payload to ensure reliability. You don’t want to spend $1.8 billion on launch, and then find out, as NASA did with the Hubble Space Telescope, that your new satellite needs repairs. This dynamic has caused over-engineering of space payloads. With launch for a new low price of $10–20/kg, companies and research agencies will be able to reduce engineering expenses by simply taking on the risk of paying for another (cheap) launch.

But like much the rest of the predictions — this one depends on Elon Musk. Who frankly, great as he is, talks a lot of smack, and makes a lot of promises about things that never ship [or delay for years, like Tesla’s FSD]. If anyone from $TSLAQ is reading, help with a good list of Elon’s whoppers please.

Either way, a great read. Fascinating predictions, and you’ll learn about a lot of cool technologies, with links but also explained, rather compactly. I like the idea that we’ll break out of this “Great Stagnation” through amazing technology. I hope Eli is right, and I don’t think that’s a crazy prediction over the next decade. This world belongs to the optimists. Sorry $TSLAQ.

Shortly after New Year, OpenAI release a pair of cool new models.

  • DALL·E” — a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a large dataset of text–image pairs.
  • CLIP — an image-classification model trained entirely on web captions — which generalizes to datasets and tasks, outside of its distribution

The works, especially the generative model, got a lot of attention on Twitter, with users trying out its capabilities for all kinds of “not quite open” — but pretty wide ranging — generations.

Generative work inspires the imagination — images even more so than text [although generating coherent images is easier than text, in a sense — your eye will fill in the missing details]. Good to see OpenAI is still releasing something amazing every few months, giving the rest of us envy, FOMO, and a 40-page paper to read.

Method-wise, it looks like again, the OpenAI way remains:

  • Scale models and datasets — bigger than anyone
  • Clean the data — better than anyone has ever done for massive datasets
  • Use relatively simple training schedules and loss functions, albeit with some magic around clipping and regularization, so that the model remains stable during training [this is not easy, as anyone training multi-billion parameter models can assure you]
  • Train across large cluster of GPU machine — with model parallelism
  • Leverage sparsity and custom CUDA kernels where possible — it help when Scott Gray is on your team

Putting it simply, “quantity, has a quality all it’s own,” to quote Joseph Stalin.

Bitter Lesson for ML Researchers
https://twitter.com/hardmaru/status/1350285435830878211?s=20

Closer to what I spend my time on — deep NLP — two significant new papers came out as well. Both of them superseding (or matching, at least) benchmarks set by Google’s T5 — the best Deep NLP model for over a year — according to both my opinion, and according to the SuperGLUE leaderboard.

Well, perhaps no longer.

Microsoft’s DeBERTa edged out the updated submission from T5. I would not say DeBERTa is better than T5, but it does offer a nice alternative, as well as some significant improvements to the standard BERT model (or RoBERTa, which is better than BERT).

The main trick is how position and content embeddings are “disentangled” in the attention operation. This has been done before, in the XLNet paper for example, but DeBERTa offers a fresh take on the problem. In their SuperGLUE-leading model, they also add extra tricks, which is unfortunately a bit confusing to follow — and do not include full ablation studies. The paper and SuperGLUE submission were clearly rushed. As best I can understand, the main improvements on BERT were:

  • Disentangled attention — splitting token embeddings and relative position embeddings into separate calculations [and greatly speeding up the later]
  • Adding absolute position embeddings to the model — at the end of the attention layers, instead of at the beginning
  • Parameter re-use — as in ALBERT but more nuanced
  • Increasing the vocabulary to 128k tokens and training with SiFT — although it’s unclear how much these changes matter

With all of these changes, DeBERTa is able to train a 1.5 billion parameter model that’s on par with an 11 billion parameter T5 model. Impressive!

Even if the compute is a bit slower, since not all of the new attention operations can run in fast GPU kernels, this is still quite a reduction in parameter count, thus saving memory and fitting into GPU memory on a single machine.

I look forward to DeBERTa — already on GitHub in PyTorch — available in Hugging Face soon!

DeBERTa training speeds up convergence over RoBERTa — perhaps the most impressive result

The other new Deep NLP paper is from Google. And definitely not trying to reduce parameter count — quite the opposite.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

A few years ago, Google came up with a huge new model called “Mixture of Experts” that beat NLP perplexity metrics — estimating the next word given the previous words — but with a method that nobody else could use or replicate.

In this less heralded update, a small team of William Fedus, Barret Zoph and the wizard Noam Shazeer, replace the “expert routing” with an attention operation, and greatly simplify the expert routing system.

In simple terms, the point of an MoE system is to have different compute done by different machines, each with its own parameters. Over time, these machines become “experts” at different sections of the NLP problem, and the “routing” mechanism simply needs to choose the right expert, while this “expert” does the expensive operations on every token. The model uses GPU parallelism (each GPU has a different part of the model), but differently from models that split a large computation structurally across multiple machines. Instead, the compute is “routed” differently depending on context. Other than the routing, the rest of the computation is pretty normal.

The problem with MoE 1.0 was that the routing was complicated, and the GPU-to-GPU communication was kind of difficult to understand, much less implement. Moreover, a complicated loss was tuned to make sure that different experts got approximately the same amount of work.

In MoE 2.0, the process is simplified:

  • Each token is routed to just one expert (this was 2+ experts before, so that the problem could be fully differentiable)
  • If an expert’s batch is full, the overflow input is simply passed along unchanged (the full model consists of multiple Switch Transformer layers)
  • A single auxiliary loss is used to balance the experts, which is stable across many numbers of experts
  • With big enough batches of inputs, everything evens out well enough — not much overflow or underflow of experts takes place, and nor does occasional overflow hurt the models

Will this model be used outside of Google? Maybe. It seems more likely than MoE 1.0.

I would note two optimization details.

  • Switch Transformers get, from what it looks like, pretty small gains (in speed or memory) from moving from float32 to bfloat16 — and they were not able to move all of the model to low precision, only parts of it.
  • It’s been suggested that this will be the last major “Noam Shazeer project” to be done in Mesh-Tensorflow. Another impressive but complicated technology that is not used much outside of Google, or perhaps even within Google, outside of Noam.

I look forward to seeing future big Google Brain projects in JAX. See my teammate George’s blog, for a good introduction to the various tensor computation accelerators from a programmer’s perspective, including JAX.

I’m sad to see low precision computation remain difficult. Not just unstable for softmax, but for other parts of large Transformer models. I won’t bore you with my troubles trying to finetune a pre-trained T5 model in HuggingFace, with any part of the computation in FP16. In theory, low precision should save you a lot of compute, and a good chunk of communication and memory. But given that half-precision leads to an theoretical linear improvement — and we live in a Moore’s Law world, maybe you should just use more compute. NVIDIA’s latest GPU, the A100, puts great emphasis on improved performance in “full precision.”

Finally, for something more spicy 🌶️. An obscure blog with a provocative title has been making the rounds, reaching #1 on Hacker News.

It’s a fun read, even if many of the author’s points and predictions are 💩. Let’s not focus on those. I don’t think he’d be that offended by me saying so, given he peppers his predictions/analysis with hot takes and memes, including the AI researcher midwit graph meme shown above.

Elon Musk is the richest and perhaps the most significant person in the world. He says untrue things all the time, and posts spicy memes at 4:20 PM daily. That’s how genius works in 2021.

Back to the blog — ML is in a weird place, and deserves to be ridiculed. Enjoy the read, and don’t take any of it too seriously. I’d personally be willing to bet against most of his predictions. If I thought the author could afford to pay…

Also, Transformers are killin’ the game. I’ve been fine-tuning them for years. Somehow they didn’t tell me to HODL Tesla stonks.

What else happened? Tech is moving to Miami.

What started as a tweet, is definitely happening. At this point, you’ve heard about it — and possibly sick of hearing about it.

I know, you’ve heard about “the Silicon Valley of X” before, and it didn’t work out.

This time is different. Not only are VCs and angel investors moving to Miami, as well as hedge funds and all kind of money men. New York’s best AI startup Hugging Face has moved to South Florida, as well.

In the post-2020 world of “remote,” it’s hard to think of location the same. It’s also hard to look at that office, and wish your co-working space didn’t look like that. It’s the middle of winter, and we all need more sunshine.

I’m with Balaji S. Srinivasan here — Miami is not about being a better place to live than NYC and SF — or certainly cheaper. It’s that the Bay has diminished as the undisputed tech capital of the world, at the expense of “everywhere else.” Once you can live anywhere — or anywhere with density, infrastructure, and where you can persuade tech talent to move to — Miami starts looking pretty good.

Spending most of 2020 in Brooklyn has been great. I’m glad I prioritized a view over space. I’m watching the sunset over 🗽 as writing this piece.

Going outside, I notice not only the cold, the shuttered businesses, but also the dozen regular drug addicts huddling in the subway entrance (they’ve been here since the city converted a hotel next door to a homeless shelter, back in March).

There’s always been a cost to living in The City. Critics pointing out that New York is dirty, smelly and expensive — compared to say Savannah Georgia (or any other pretty, second-tier city in America)— I’ve never disputed them on the facts. For some of us, the city was fun, and the cost was worth it — for others you had no choice. Imagine working for a hedge fund, a bank or Big Tech outside of NYC or the Bay Area. It wasn’t really an option — or if it was, you’d be out of the loop in your company, and compensated at a discount.

Things are different now. Remote has leveled the field — not completely but to a great extent. Living in SF or New York is no longer a must. Especially when the quality of life is better, you’re happier and more productive.

Last Saturday night, many tuned in on Clubhouse, as the mayors of SF, Austin and Miami called in to pitch their cities and answer questions from the tech community. The differences between SF and the new tech hubs were striking.

There’s more to say, but others have written better pieces on tech, Miami and SF. Please comment below and I’ll link them here.

I think Miami is much more likely to succeed than other “tech hubs” had done over the previous decade — primarily because existing working ecosystems are moving in, as opposed to starting from scratch. Just as the NYC startup scene was primed by Bay Area veterans moving to New York — now some of them are moving to Miami. Unlike, say the Canadian tech scene, which suffers from the “cold start” problem, without existing “Good Angels willing to play infinite games” as Alex Danco puts it in his Substack. You meet great 🇨🇦 founders and engineers, in the Bay Area.

Keith Rabois mentions in Antonio García Martínez’s Substack Pull Request — he envisions many tech companies that need “some engineering” but don’t do cutting edge technical work with huge teams — can move over pretty quickly.

I think there’s a lot of healthcare innovation here, which I actually like, I like to fund healthcare. So that alone might be an area where I can double down without any complete transformation of Miami.

I think building engineering culture will be harder. I’m less convinced though we need a massive engineering culture here in the short term. I do think you can get designers to want to be here. This is a much better place for designers. The sense of art, style, and design here is so much better than Silicon Valley. It’s why I wanted to move here personally, as it’s something that’s important to me. And there’s none of it in the Bay Area. So I’m pretty excited about that. And if you can get designers here, you can certainly build iconic companies like Airbnb or Square, which are more design driven than engineering driven.

I think that’s an easier starting place. The designers, whether they’re in New York, whether they’re in the Bay Area, whether they’re just recent graduates, get them to come here and then build around them. I also think you can build companies that require less engineering. So Cameo, for example, I don’t think they’d mind me sharing, it’s basically only 50 engineers for a billion dollar company. So you could build that sort of company here. If you need 1000 engineers, though, that might be a problem.

Then again you have Hugging Face moving to South Florida — and they are definitely doing cutting edge engineering and AI work. It helps, perhaps that they were alway a part-remote company, split between Brooklyn and Paris. Well now, Fort Lauderdale and Paris perhaps.

I’m looking forward to spending more time in South Florida, and glad to see it’s becoming a serious place for tech and business — with a morning run/bike by the beach year round. A “Tel Aviv of the West” as some have put it.

It also helps, perhaps, that a new and hungry startup location might be less sanctimonious. This is speculation — but everyone in the Bay Area tech community knows a bunch of non-mainstream thinkers with 🌶️ ideas, putting it gently.

Most people in tech are left-leaning pragmatists, a substantial number are further to the left, and a non-trivial minority are libertarian-leaning, or describe themselves as anarchists. These people and groups represent a number of heterodox views — way more than among what we call the “normie” population. Being able to work with, or at least have a technical conversation with people with divergent and even “out there” views — and different lifestyles — has been a part of the tech community I’ve grown up in. Perhaps an integral part of it — we’ll see.

SF-based tech, is different now. While everyone will quietly works with anyone who is good at what they do — I’d work with Satan himself if his code compiles [and certainly with a Satanist like Gilfoyle] — and will often become friends with such people — prominent persons in tech have had no problem of late, publicly denouncing and distancing themselves from large swathes of the population.

Miami and Austin are left-leaning, liberal cities. Which are surrounded by and part of larger right-leaning “Red” states. The mayor of Miami, every bit a pragmatic moderate, runs as a Republican, and is proud of his cross-partisan support — getting 86% of the vote in the latest general election.

That kind of pragmatism, may I say tolerance, seems useful in a tech world that has traditionally over-represented unorthodox views and contrarian people.

This past few weeks, everyone has been moving to Signal, a encryption and privacy-focused communication app. It grew out of an open version of Whisper, an SF-based company, bought by Twitter in 2011 — co-created by a anarchist who named himself Moxie Marlinspike — the current Signal CEO.

I’m not so sure that the next wave of tech, even communication tech, will come out of the Bay Area.

Friends and readers, what else would you like to see in 2021? I am trying to write more often. What topics would you be interested in? Should I move to Substack?

For now, I should probably try to get a little “real work” done — ie look at some code. Then jiu jitsu and a steak.

P.S. The steak was great.

--

--

Nikolai Yakovenko

AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami