Here's a fun little side project about training a GAN to Christmasify images.
The holiday season is almost here, and along with it comes the Christmas trees, the presents, and the Santa costumes. This got us at hasty.ai curious about one simple, yet-as it would turn out-a very complicated question: Can we create a Christmas GAN?
In this article, we share our approach, results, and most importantly, our findings. It is a light read and not super technical. Thus, it can also serve as an introduction to building GANs for a practical use-case for image-to-image translation. We hope that it provides you with some intuition of how GANs work and a few actionable takeaways for your own GAN experiments, but most importantly, we hope the results are fun to see.
By now, probably everyone has heard of GANs or 'Generative Adversarial Networks'. They've been used for many exciting things like image synthesis, human face generation, image super resolution, among others in the past. We will deliberately try to avoid the technicalities in this post, but here's an overview of GANs for readers new to the topic:
GANs are generative models within the framework of deep learning. They consist of two players (two deep neural networks): the generator and the discriminator. Given some input, the generator tries to "generate" what is desired by the modeler (us). The discriminator's job is to look at the generated and real samples, then encourage the generator to generate images closer to the real ones. From another perspective, the generator is trying to generate good enough samples to fool the discriminator into thinking they are real, while the discriminator is trying not to be fooled.
Pictorially, this entire model is represented in figure 1:
Figure 1: A basic model of Generative Adversarial Networks (GANs)
Our idea was to feed a GAN with a random picture and then output a Christmassy version. The goal seemed quite intuitive and straightforward to us. It turns out, though, that it's quite hard to describe what makes an image look Christmassy in a way that a GAN would "understand". There is a lot of social and situational subtext around the concept.
Traditionally, GAN methods that model image-to-image translation do so between domains with a high degree of symmetry or similarity. One example would be the translation from horse to zebra, where the addition or the removal of stripes is enough to do the translation. Another one would be from apples to oranges, where color and texture are sufficient to complete the translation.
Figure 2: Horse to Zebra, accomplished via adding stripes — image taken from CycleGAN repository
Figure 3: Apples to Oranges, accomplished via color and texture changes — image taken from CycleGAN repository
Applying this logic to our use-case is tricky. We require a model that can achieve ‘normal to Christmas’. But this mapping is far too generic compared to the other use cases. Practically any object can be decorated with an ornament. People can wear Santa or elf costumes, houses and markets can be decorated, or a pet can be dressed up as a deer, etc.
Figure 4: Top row: diversity in normal images. Bottom row: diversity in Christmas images. The images are taken from free to use websites and the COCO datasets
But we still wanted to see how far we could get with existing work on image-to-image translation. Concretely then, our goal is to learn a mapping function that can map normal images to Christmassy images.
Let’s take a moment to think about what we are trying to do here:
We should also consider some constraints on this problem, namely:
Figure 5: Paired vs Unpaired data. The image is taken from the CycleGAN paper
Keeping in mind the goals and constraints of our problem, we thought that the only viable option to model a Christmas GAN is ‘unpaired image-to-image translation’. ‘Unpaired’ refers simply to having images in both domains (Christmas and not Christmas), without any explicit one-to-one mapping between them.
We mention two methods among many that we tried out, as they seemed to work better on our data-set (the rest are excluded for brevity):
They both more or less follow the same basic idea, which is to learn a mapping between the two domains in the absence of paired images.
Figure 6: Redrawing of the full structure of the DiscoGAN.
Figure 7: Redrawing of half the structure of the CycleGAN. The figure shows the forward cycle of the CycleGAN; swap “normal” with “Christmas” along with the images to get the backward cycle. Alternatively, you can look at the paper for the full scheme.
In our case, the first domain, call it domain ‘A’, would be ‘normal’ images, and the second, call it domain ‘B’, would be ‘Christmas’ images. There are two generators and two discriminators in each scheme. The idea is to go from domain A to domain B using one generator, then go back from domain B to domain A using the second generator. The model is penalized for distorting the identity of the ‘normal’ image (returning to domain A, the image should be the same as the original). See figures 6 & 7.
DiscoGAN accomplishes this via a reconstruction loss, while CycleGAN accomplishes this via what they call a “cycle consistency loss”. If you are interested in learning more about them, we encourage you to go through their papers (see the links above).
This was the trickiest bit of this project. Indeed, we’re trying not so much as to transfer style or transform objects in images. Our problem is too unconstrained (images are unpaired AND there is no one single common property that maps normal images to Christmas). We are trying to transfer the property of Christmasiness itself without a lot of supervision.
To collect this highly unconstrained data, we asked our teammates to contribute a couple of hundred images of some Christmassy and non-Christmassy images while keeping the broad domain the same (pictures of Christmassy and non-Christmassy people, markets, trees, decorations etc.).
The motivation for asking many people to do it was to bring some variety into the data. People searched for interesting things, from “Julafton” to “Weihnachten” to “Blue Santa”, bringing in diversity. If we think about it now, it might have been a recipe for disaster. The problem becomes more and more unconstrained as we introduce cultural/regional connotations of the Christmas property.
But soldiering on, once the data collection was finished, we spent a considerable effort removing bad images. One trick employed was to run a pre-trained segmentation network on the images. Images found containing people or trees were kept, while the rest were discarded. Then we added images of markets in both domains, which were filtered and selected manually. We augmented the non-Christmas domain by adding images from the COCO dataset to minimize data collection effort. Web browser extensions that download all images from a webpage and scripts that use selenium to search for, and download images automatically came in real handy for this. We took special care to download only free to use images.
We had a dataset with about 6000 image pairs for ‘Christmas’ and ‘not Christmas’ after sorting and filtering. Now came the time to train the GAN to learn this mapping.
We started small and trained a CycleGAN model on images of trees only. Concretely, we tried to accomplish a ‘tree to Christmas tree’ GAN. Here are some results:
Figure 8: Trees to Christmas Trees GAN. 6 pairs of images. Left in each pair is the original image, on the right the Christmasfied version given by the generator. All result images will follow this ordering.
The image shows 6 pairs of images. In each pair, left is the normal image given to the generator. Right is the ‘Christmasfied’ version given by the generator. The top right pair is an exception since we just wanted to see how the Christmas effect looks on an already Christmassy image. It seems encouraging, especially because this ‘tree only’ subset of the entire data collection was only 190 images (we had more tree images, but they contained more things like people, so we didn’t add them to this subset).
We then tried to see if we can get this Christmasification to work on more types of images (not just trees) by using the full dataset. Here, CycleGAN didn’t seem to achieve any transformation except for brightening up the shade of red.
We posit this is because the cycle loss penalizes large changes, which are needed to add color shifts, Santa hats, beards, and decorations. Indeed weighting the cycle consistency loss lower helped improve the results, but not by much.
We attempted the experiment again, but this time with DiscoGAN. At the start, DiscoGAN training seemed not to converge as we had problems with mode collapse.
By now we figured, since the concept being learned is too abstract, we might as well “weakly pair” the images.
By weakly pairing the images, we mean that for any image in the ‘normal’ domain containing a person, the image sampled from the ‘Christmas’ domain also contains a person. If A has a tree, then B has a tree too. We can think of this as making the explicit assumption that there are individual mapping functions that the network can learn for one category. We chose to limit ourselves to 3 sampling categories: People, trees, and markets as they were the most dominant in our dataset.
By doing this, we make the assumption that there are individual mapping functions for ‘person to Santa’, for ‘tree to Christmas tree’ etc. This may not be strictly true for neural networks in general. But we thought since we were sometimes using instance normalization layers with single image batches, this would improve gradient flow as the feature similarity between image pairs is higher.
The motivation was to decouple these assumed mappings to make it easier to train the GAN, hopefully. This can also be thought of as adding label information to the GAN training, which generally improves performance. This seemed to do the trick, as the DiscoGAN started converging and led to some fascinating results.
First, let’s use the DiscoGAN trained on all the data but only on the images of trees we saw before and compare how the learned mapping is different:
Figure 9: Results from Normal to Christmas generator.
Nice, the Christmas effect has definitely been enhanced compared to the results from above.
Interesting to note is the addition of snow in the bottom left one. In all of these, the effect seems similar to style transfer, with the addition of many small Christmas lights. It’s almost glittery. The middle row right shows some artifacts. (It has been proposed that they can be fixed by removing batchnorm layers from the generator’s final layers. We didn’t test this hypothesis, but this could be one area to improve the model.)
Now let’s see what happens on a random assortment of ‘normal’ images containing many different things:
Figure 10: More Christmas
It seems the model loves to add the color red, warmer tones, and lights, lot and lots of lights everywhere, even on pizzas. It’s interesting to see how Christmas is hallucinated on things normally not decorated during Christmas, like food. I guess we will have to take what we can get.
Figure 11: Moar Christmas!
Markets are, in our opinion, believably Christmasified. Yep, nothing wrong here; let’s keep moving.
Figure 12: Even Moar Christmas
Interesting one here: the trees are very nicely Christmasified, though the food is not. This is forgivable, as food should not have lights on it anyway we’re told. The person in the bottom left is interesting. It appears there was an attempt to hallucinate a Santa hat, but it wasn’t realistic enough. Or it could just be a nicely placed artifact.
Figure 13: Even Moar more Christmas
It fares alright here, even when dealing with trains. The bottom right result is also great as there is now snow. We wouldn’t recommend eating the pizza though. In general, it doesn’t seem to fare too well on human faces; they are distorted to a very high degree.
Observant readers might know what’s coming here. If you look carefully at the diagrams for DiscoGAN and CycleGAN (and read their captions), you’ll notice that there is also a “Christmas to normal” generator. Indeed, as a byproduct of the way we model this image translation problem, we get what we would like to dub: the GrinchGAN.
So for those of you for whom Christmas cheer is misery and your sheer dislike of the holiday season and joy rivals that of the Grinch, here are some Christmassy images, with the joy removed by our GrinchGAN:
Figure 14: GrinchGAN removing joy and warmth from the world.
It seems the mapping GrinchGAN has learned is colder hues, bluer colors, distort all cheer, and dial down the lights. To us, it has a nuclear winter vibe. But maybe that was the Grinch’s intention.
This experiment was an interesting side-project providing us insights into the way how GANs work. Through minimal effort, we were able to achieve the effect to some decent extent.
The main learning was that it’s not easy to train a GAN to translate images based on an abstract mapping concept like Christmas, instead of specific entities like stripes on a Zebra. Data has a significant impact in these scenarios, as the training formulation is loosely constrained, data is unpaired, and there is little to no supervision.
Also, looking back, there are a few other things we could have tried which would maybe have improved the results of our ChristmasGAN:
1.__ Tuning the dataset, the results would have been much nicer.__ Part of the difficulty is in gathering images from the two domains. Christmas images are almost always in great shape. The lightning is great; people are dressed up and posing for them, the camera work is usually professional. In contrast, everyday images are not like this. It’s hard to gather such images with a high degree of similarity, unlike horses and zebras. Just try searching for images of “coniferous trees” versus “Christmas trees” to see the stark difference in how many images you end up with. Not a lot of people care about taking pictures of conifers before they are decorated.
That’s about it for this post. Whether you are joyful or mean spirited, we hope our Christmas or Grinch GAN could bring you some joy or at least some practical insights.
If you would like to discuss this blog post, give feedback, or share ideas, feel free to start a discussion in our community.
We’re a Berlin-based startup and building the next-gen annotation tool for computer vision. We’re constantly trying out new stuff to see how we could improve the algorithms behind our AI-Assistants. They allow you to annotate data 10x faster and provide you with rapid feedback so you can validate and adapt your models as you work.