GANs will change the world

Nikolai Yakovenko
11 min readJan 3, 2017

--

It’s New Year’s 2017, so time to make predictions. Portfolio diversification has never been me, so I’ll make just one.

Generative Adversarial Networks — GANs for short — will be the next big thing in deep learning, and GANs will change the way we look at the world.

Specifically, adversarial training will change how we think about teaching AIs complex tasks. In a sense, they are learning how to imitate an expert. You know you’ve reached sufficient mastery of a task once an expert discriminator can not tell the difference between your outputs, and those that you were learning to imitate. This probably does not apply to large tasks like writing a paper — everyone’s final product is a little bit different, but adversarial training will be an essential tool for making progress on mid-sized outputs like writing sentences and paragraphs, and it’s already one of the keys to realistic image generation.

In a 🌰 GANs solve a problem by training two separate networks with competitive goals.

  • one network produces answers (generative)
  • another network distinguishes between the real and the generated answers (adversarial)

The concept is to train these networks competitively, so that after some time, neither network can make further progress against the other. Or the generator becomes so effective that the adversarial network can not distinguish between real and synthetic solutions, even with unlimited time and substantial resources.

The details are interesting, but let’s put those aside for now. GANs are being used to draw images, given an image category and a random seed:

“draw me a woodpecker, and it can’t be one of the woodpeckers that I showed you before.”

Synthetic birds generated by StackGAN, available in open source.

For the mathematically inclined, scientists at Google Research used GANs to invent an “encryption” protocol. Generator Alice passes messages to Bob, encrypted using a convolutional network and a shared secret key. Eve acts as Eve always does, in the adversarial role, with access to the encrypted messages but not the shared key. Eve trains a network to distinguish between noise and Alice’s encrypted information, but is not able to distinguish the two apart.

It’s early days, and I don’t know of a GAN public demo that writes compelling short text in a way that’s better than a feed-forward LSTM. Although if a feed-forward LSTM (such as Karpathy’s character RNN) is the baseline, it is hard to imagine that someone soon won’t create a GAN to improve performance on something like synthesizing Amazon product reviews, given a product and star rating.

Humans learn through directed feedback

To me the adversarial process sounds close to how humans learn, and more so than reinforcement learning (RL). Or perhaps I am just an adversarial person.

RL is training that maximizes for (average) eventual rewards. These rewards may occur several steps away from the current state, but these eventual payoffs must be reasonably well defined by a “reward function.” I’m down with RL, and it’s led to significant advances in our field. But unless you’re playing a game, it’s hard to come up with a reward function that accurately values all of the significant feedback from one’s environment.

Reinforcement learning led to breakthroughs in backgammon in the 1990s, it was an important component of DeepMind’s AlphaGo, and the DeepMind team has even used RL to save Google money on datacenter cooling.

It is possible to imagine that RL can find optimizations in the datacenter organization space, since the reward function (saving money while maintaining a maximum temperature) is reasonably well defined. It’s an example, perhaps a rare one outside of Hollywood, of a real world problem fully parameterized as a game.

For less game-y problems, what is the reward function? Even for game-like tasks like driving, the goal is not really to get there very fast, or to stay as close to possible within the dotted lines. It’s easy to think of the negative rewards (damaging the car, scaring the passenger with unreasonable rates of acceleration), but it’s harder to correctly define positive rewards accrued during the drive.

Monkey see, monkey do

How do we learn something like handwriting? Unless you went to a very strict elementary school, the process was not about optimizing for the reward of writing letters correctly. More likely you traced out the lines that you a teacher drew out on a screen overhead, until you internalized the process.

Your generative network drew the letters, and your discriminator network learned to note the differences between your page, and the Platonic ideal from McGraw-Hill.

adversarial training for third graders

Your own toughest critic

Five years ago I was paralyzed above the waist on the right side of my body, after taking a blow to the head in a Columbia University rugby game. Two weeks later and out of the ICU, I started teaching myself to write, back at my apartment in Brooklyn.

Teaching myself to write again, May 2012.

My brain had sustained severe damage to the left hemisphere, so I had forgotten how to move my right (dominant) arm. However, the rest of my brain was relatively undamaged or sufficiently replicated, so I knew what proper writing looked like. In other words, I had a corrupt generative writing model, but my discriminator network remained intact.

I joked that on account of the process, I might learn a new (and better) handwriting. However while I was able to teach myself to write rather quickly, the style I taught myself ended up in close proximity to the handwriting style I had before.

I don’t know how our brains manages this “actor-critic” approach to learning, and whether this is really true or just a cute analogy. But it does appear that we learn more effectively when we are able to try something new, and get immediate feedback from an expert.

When learning to code or to rock climb, you progress faster when receiving “beta” from an expert. Until you have enough experience to act as your own internal critic, is much easier to train the generative part of your brain when a good external critic corrects every low-level mistake. Even with an internal critic, learning an effective generator takes deliberate practice. We can’t offload our own personal training to an AWS spot instance GPU.

Burn the ships?

GANs are working for some problems. They are used to add “realistic” effects, like sharp edges for generated images, even if said images don’t necessarily have the correct number of heads on every animal.

Placing the generative network in competition against a worthy adversary forces it to make hard choices. As a colleague put it, there is a choice where you can draw the parrot green, or you can draw the parrot blue. But it has to be one or the other. A supervised network trained on real parrots without an adversarial component will tend to draw some sort of average of the blue and the green. Hence the fuzzy lines. An adversarial network has to draw that parrot green. Or it has to draw the parrot blue. Or it can sample from a probability distribution over the {blue, green} parrot space. But it won’t draw an image with some midpoint color that doesn’t exist in the real parrot distribution. Which is by now the distribution of ex-parrots.

My colleague recently catalogued his thoughts about GANs, including pessimism about their ability to fully converge or to generalize.

In part, this is because the see-saw approach — training the generator for a while, training the discriminator for a while, repeat — is not guaranteed to converge on a stable solution, much less the optimal solution. As simply illustrated in this tweet from Alex J. Champandard.

But let’s ignore these concerns for now, and dream a little. Given that LSTM models can write coherent product reviews, image captions, and to tweet in the voice of President-elect Donald Trump [strangely silent since election night].

Doesn’t this suggest that even a slightly aware discriminator could improve performance on these tasks? We could use the generator we have now, and ask the discriminator to select amongst the top 20 choices offered, assuming that the LSTM generates outputs with some randomness. Isn’t this something that the man behind DeepDrumpf already does manually?

Generator or Discriminator — which knows best?

A natural question arises — which network internalizes the understanding of the problem, the generator or the discriminator? Is it the student or the teacher, who knows more about writing letters.

In the real world it might be the teacher, but in the examples above, I think it would have to be the student. A discriminator for the product review generator could make itself useful simply by marking down grammatical mistakes that humans don’t usually make. What requires more skill, learning to paint like Michelangelo, or looking up at the Sistine Chapel?

As I understand it, the Prisma App trains a generative network for each of its styles, using an adversarial framework. That is how most of the styles generate crisp lines. I wish they could train the GAN for a little bit longer, so that it would not only recognize the shadows in a photo and paint them different colors, but if it could do so in a style that an impressionist might. Occasionally it gets the light and shadow just right, and when it does, the results can be pretty astounding.

Taking this line of thinking to its natural conclusion, the generative-adversarial approach gives the AIs an ability to run experiments and A/B tests. An AI creates a fully functional generative solution. Then it collects feedback on how well this generator compares to a gold standard, or to other AIs that it is learning to replicate, or has already internalized. You won’t have to design a loss function. It might take a while, but the AI will figure out its own evaluation rules.

Know when to hold ’em, know when to fold ’em

I write all of this, having not trained adversarial networks myself. In the sprits of mimicry, I’m waiting for others to achieve noticeable improvements with GANs, ideally on the text generation problem. I predict that soon there will be accepted techniques that work well enough to get compelling results. Our field advances by building on top of previous improvements.

Instead of speculating about things I haven’t built, I should be spending this time improving my “PokerCNN” No Limit Hold’em AI for this year’s Annual Computer Poker Competition. Code is due January 13th, 2017.

For next year’s contest I do plan to add some adversarial training. It’s not hard to imagine that adversarial training might help learn a good poker strategy. Especially if one has access to a strong black-box poker AI to compete against.

Since the goal is science and my poker AI code is already in open source [by the time you see this, I should have cleaned up the repo and added a real README so that it’s a little bit easier to get started], please feel free to try this yourself.

Links: looking back, and looking ahead

I would be remiss not to point out some of my favorite advances in deep learning from 2016. A few of my favorite lists:

Others in the deep learning community are also making predictions for 2017. I’ll add more links below as I see them, so please post below if there are other “deep learning in 2017” predictions that I should read and include.

Ask AI researchers what their next big target is, and they are likely to mention language. The hope is that techniques that have produced spectacular progress in voice and image recognition, among other areas, may also help computers parse and generate language more effectively.

AI will be the new mobile. Investors will ask management what their “AI strategy” is before investing and will be wary of companies that don’t have one.

On that note, have a happy, healthy and productive 2017. Don’t get stuck on a bad problem, since there are so many good problems waiting to be explored, and there are not people available to try them all.

For a bit of pushback against the AI-everywhere hype, bradford cross gives his “Five AI Predictions for 2017” — with the benefit of posting these in March. Still, very insightful. And I agree. AI will become more of a commodity in the coming year. Full-stack businesses with AI at the core will emerge as the big winners. (This last point supports my theory that AI will create more jobs than it destroys in the medium term… but that’s a separate post.)

Update

This piece has been well received, but I did take some criticism about being too rosy on GANs, especially as I have — admittedly — not learned the dark arts of training GANs myself.

What’s happened over the past two months since I published this piece right after New Year’s?

  • Dev Nag has written up the generator-discriminator GAN for image generation as 50 lines of PyTorch code, along with an engaging and thorough explanation. I recommend that you check it out, as have many of my colleagues.

In other words, what started as a good idea with practical results and a messy implementation has become cleaner, more principled, and easier to use. No one should be surprised. The code, theory and algorithms will continue to improve.

The question remains — which real problems with GANs help us break through next?

PS

GANs are starting to look spooky good.

The street scene below is generated from a segmentation map. This can be transferred real scenes, video game scenes, or created from your imagination.

Demo and code on GitHub https://tcwang0509.github.io/pix2pixHD/

--

--

Nikolai Yakovenko

AI (deep learning) researcher. Moscow → NYC → Bay Area -> Miami