Imagine a world where we can generate endless pokemon, (real-looking) fake human faces, fake cats, dogs, rabbits… the list goes on forever!! Well, that future is NOW. Pretty sure Ian Goodfellow, the good gentleman who introduced us to the world of GANs, wasn’t thinking about pokemon, dogs and cats when he came up with the idea of GANs (actually he was having an argument in a bar) but that’s all right… right?
I know we all want to jump right in to running our own GANs (as did I), but let’s go a little into the background of GANs first so that we can fully appreciate its tremendous power…
Before GANs, how did we generate new images? Apart from picking up a pencil and drawing them ourselves, we attempted to teach machines to generate new images by showing them many many MANY images, looking at their results and telling them what is right, and what is wrong. Machines learn differently from humans - say a child could learn to identify an apple after he is shown an apple picture 5 times, but a machine needs upwards of 10,000 images (or more!) to learn to identify an apple! Imagine being a teacher to a machine… pretty painful. Also unfortunately (or fortunately?), there is only so much labelled data and human effort we can dedicate to teaching the machines.
In any case, that was what motivated Ian Goodfellow to develop GANs - he thought (disclaimer, not his exact words lol) why not create another machine to teach the machine? Hence, the GAN was born (more details here! Also listen to the podcast here - interesting stuff!).
Let’s try to understand this on a conceptual level. A GAN is made up of two models: a Generator to generate fake images (with the aim of fooling the Discriminator), and a Discriminator to differentiate on the real and fake images (but it should not reveal too much - otherwise the Generator learns to make better fakes, making life more difficult for himself!). Each party gets better at what they do as they compete with each other (the Generator at producing real-looking fake images, and the Discriminator at classifying).
The GAN model converges when the discriminator and the generator reach a Nash Equilibrium - when one player will not change its action regardless of what its opponent may do (recall the prisoner’s dilemma - if the other prisoner confesses, I should confess; if he doesn’t I am still better off confessing).
In reality, convergence is difficult (see here for more!), but let’s take this for now. In fact, since the introduction of GANs in 2014, the tech community has made significant progress in terms of improving the original model. You will be amazed at how fast people innovate - we can even generate images from text now!! Imagine reading a Harry Potter book and images generating as you go along…
Generating new Pokemons:
Credits: https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html Generating fake bedrooms:
Credits:https://arxiv.org/abs/1511.06434v2 Generating flowers from text:
Credits: Generative Adversarial Text to Image Synthesis (Scott Reed et al 2016), https://arxiv.org/pdf/1605.05396.pdf
General Model Structure
Let’s look at the general structure of a GAN from a more practical perspective. In a GAN, two models are trained at the same time. One model is the Generator: it takes random noise as input and produces fake images. The second model is the Discriminator: it takes both real images and fake images (created by the Generator) as input, and has to figure out how to identify which are real images and which are fake.
Credits: Chris Olah, https://twitter.com/ch402/status/793911806494261248/photo/1
Basically, the Generator tries to learn all the possibilities of a real image, and returns the most likely image of a dog. For example, it can learn that images of dogs with black, brown or white fur can all be real; an image of a dog with two eyes is real, but an image of a dog with three eyes is not! Hence it can return a two eyed black dog or a two eyed brown dog, but not likely a three eyed black dog. In such a way, it tries to produce fake images that are as real as possible, so as to trick the Discriminator. In technical terms, the Generator is trying to capture the distribution of the training data and produce the most likely outcome, to maximise the probability of the Discriminator making a mistake.
Below is an example with the red distribution being the fake distribution the Generator is coming up with, and the blue distribution being the actual. The Generator is trying to move towards the blue distribution, and at equilibrium, have the same exact distribution.
But how does the Generator actually ‘learn’? Honestly, at the beginning things are quite random. The Generator generates random images (it can even be just randomly coloured pixels!) and gives them to the Discriminator. Perhaps the Discriminator gets tricked, and says that a few of the images are real! Aha, now the generator reviews the ‘real’ images and attempts to make more fake images similar to those. Slowly, it gets better at tricking the Discriminator.
You might wonder how can the Discriminator be so easily fooled as to think random images are real? In fact, at the beginning the Discriminator does not know what is real or fake as well! It is randomly picking, before it receives feedback as to which images are actually real images or fake images produced by the Generator. From this feedback, it begins to learn all the different possibilities of a real image (capturing the distribution of the true data), and from there learn to differentiate between real or fake images. Technically, it assesses the likelihood of the picture coming from the true data distribution; the lower this likelihood, the lower the chances of it being a real image.
From the above example, we can see that it is important for the Generator and the Discriminator to be at similar levels, so that they can teach each other! Imagine if the Discriminator is so good at identifying fake images that it knows that all of the Generator images are fake and rejects all of them - the Generator will have no feedback to improve! This is one of the more troubling problems when training a GAN.
Zooming into the Generator model (which produces the fake image) - specifically, the Generator takes in random numbers (say, an array of 100 points) as an input, and projects it to a 3D array. In between, transposed convolution (also known as fractionally strided convolutions/ deconvolution) is increasing the size of the image (e.g. from a 2x2 to a 4x4). The animations below show the different manners in which a smaller image (blue squares) can be formed up into a larger image (green). The final output is an image, e.g. a 56x56x3 array which gives a 56x56 RGB image (3 colour channels).
For those interested, Adit did a fantastic writeup here explaining in detail the workings of CNNs.
|No padding, no strides, transposed||No padding, strides, transposed||Padding, strides, transposed|
For the discriminator, it takes in an image (either real or fake), passes the image through convolution layers and reduces it in size (e.g. 4x4 to 2x2). The convolution gif belows shows how the original number of values (in blue) is reduced (green). Eventually it returns a binary output, classifying the image as real or fake.
Credits: vdumoulin, https://github.com/vdumoulin/conv_arithmetic
Types of GAN
Now that we understand in general how a GAN works, let’s look at the different types of GAN. Yes, just like pokemon (and all living things), they evolve. People got very excited after Ian introduced GANs, and given how difficult it was to train stable GANs (imagine all the different things you can tweak!), everyone started trying new things to improve GANs.
There are many different varieties of GAN - you can refer here for a (highly) extended list, but I shall briefly touch on a few of the more popular ones:
This is the original GAN by Ian. Basically, it minimises the f-divergence (read: difference in two distributions) between the real data distribution and the generated data distribution.
DCGAN (Deep Convolutional GAN):
This was the first major improvement on GAN architecture - basically the authors proposed a set of constraints to make GANs stable to train. Apparently, they are usually used as the baseline to compare with other newer GANs (aka if your fancier GAN doesn’t perform better than a DCGAN, you’re out of the game buddy.)
cGAN (Conditional GAN):
This GAN is interesting, it gets by with a little bit of help - it takes in conditional information that describes some aspect of the data (aka labeled points for eyes, nose for a face). So we are giving our GANs a little help here, if you will.
WGAN (Wasserstein GAN):
Remember the original GAN just looks at the difference between the real data distribution and the generated data distribution? There’s a caveat - if the true and fake distributions do not overlap, the feedback given is just 0 or infinity. How can the Discriminator or Generator learn effectively in that case? The feedback is not useful as to how they should change. That’s part of the reason why GANs are so difficult to train. Enter the Wasserstein-1 distance (Earth-Mover distance) - basically in this case, even if the two distributions have no overlap, at least it describes how far apart they are so that they can learn (instead of just returning 0 or infinity!)
You would be pleased to know that we will be running DCGAN and WGAN in our walkthrough later on. In the meantime, let me talk a little more about WGAN since we will be running a slight variation of that.
We will be using WGAN-GP (GP for Gradient Penalty) instead of WGAN - because even WGANs can fail to converge! In human terms: for the Wasserstein distance to work, it has to be limited by how fast it can change. To enforce this, the authors of WGAN forced (aka ‘clipped’) the values to stay within a certain threshold. For example, if the threshold is (-1,1), anything greater than 1 will be 1, and those less than -1 will be -1. However, forcing numbers to stay within such a range can significantly impact the final results (in practical terms, your picture of a bedroom can suddenly have empty white spaces in the middle of the image!).
In mathematical terms, for the approximation of the Wasserstein (Earth-Mover) distance to be valid, weight clipping constraints had to be imposed on the Discriminator. This resulted in:
- The optimizer with gradient clipping to search the discriminator in a space smaller than 1-Lipschitz, biasing the discriminator toward simpler functions.
- Clipped gradients vanishing or exploding as they back-propagate through network layers.
Hence, instead of clipping weights as in WGAN, WGAN-GP penalises the norm of gradient of the discriminator with respect to its input. In human terms, this means that the bigger your gradient norm (maximum rate of change), the more you will be penalised. It incentivises the model towards the threshold (remember (-1,1)?) so that WGAN remains valid.
WHEW! You made it through all that - well done!! Let’s get down to business now (finally!) and head over to
Part 2 - GAN Walkthrough, where we go through the actual steps of running your very own GAN!