🧠 Demystifying AI: is the AI just making collages?

In the previous lecture you discovered AI has a hard time rendering things like humans and text. It also gets especially bad with hands and vector art.

But why does the AI seem to be unable to generate these? It all comes down to how the so-called diffusion models work.

Group 3.png

By the end of this section, you'll understand why AI struggles with certain tasks and which ones those are, saving you time and frustration, and already putting you ahead of the majority of people creating images out there.

👵🏻 AI explained for your abuelita

Look, I want to keep it simple, so let's get a 30,000-foot overview of what you need to know to get a grasp on AI. An Artificial Intelligence is just a piece of software that can do tasks commonly associated with intelligent beings – and, most of the time, that’s us humans. 💁🏻‍♂️

The software is able to do that because it has been previously trained (by a human) on how to do that particular task. Let's look at an example: say we want to build an AI to tell apart 🐱 from 🐶.

In order to train the AI we need to show it thousands images of what cats and dogs look like, something like this:

"Hey AI, this is what cats look like..."

"...and this is what doggos look like"

So the software goes image by image, reading every pixel and extracting patterns of what usually makes a cat look like a cat (long whiskers, pointy ears, snout shape, no leash...), and what makes a dog look like a dog.

In a practical sense, an AI is just a software that takes an input, processes it, and provides an output. So, after training, you get a software that can answer either "cat" or "dog" depending on the input image you provide.

In this particular example, what we've described is called a classification model. It takes something (an image, in this case) and classifies it in one of two classes: either "cat" or "dog". The word "model" refers to the actual piece of software that implements the AI – you'll probably hear the word "model" and "AI" used interchangeably, and for our purpose we could assume they're the same.

But there are many types of AI.

Classification models are just a tiny subset. Other models can generate things instead of classifying them. The models that are most relevant to us are called Diffusion Models, a kind of generative models – and it’s important to understand how they work in order to know their strengths and limitations.

👩🏻‍🎨 Diffusion Models in a nutshell

Stable Diffusion, DALL-E 2, and Midjourney are all examples of Diffusion models. They all work in a similar way. And pretty much like our pet classifier, they have been trained using thousands (up to billions!) of images. The difference is now we're not interested in discovering what category belongs to which image – we're more interested in knowing what text best describes an image, and the other way round.

So what we do is we show the AI many images, along with the text that best describes it:

"A cat laying on the floor"

The idea is that after showing the AI enough of these images, along with the text that best describes them, the AI will start learning what "cat" means, what "laying" means, what "floor" means... and the patterns in the image that represent those concepts. In the end, the AI learns not only what the word "cat" means, but the best way of representing it in an picture.

Think about how you would teach a little kid how to draw a flower. You would show them many different flowers, and the kid would eventually start recognizing the patterns that make up a flower: a long green stem, a round-shaped center and some petals around it.

An AI works much like that: it doesn't "copy existing images" or "make collages" of different images: instead, it extracts the patterns that make up the different concepts in the image and learns how to combine them together in a brand new creation – just like a kid would do.

The way it does that, though, is a bit different to how we humans do it.

🖼 The diffusion process

It sounds a bit counterintuitive, but the way an AI learns to do this is through a process called diffusion, in which it adds noise to an image until it becomes unrecognizable:

The idea here is that it's easier to "de-noise" an image (make it just a bit less blurry) than to generate a brand new image from scratch. So if we de-noise something for long enough –AI researchers' thought– we'll end up getting a perfect image.

I know it sounds weird, but trust me, it's simple. Here's how the AI learns to generate images:

It takes the original image and adds a bit of noise.
Then adds a bit more noise again.
Keeps doing it until the image is all noise.
Now, for every noisy image, it tries to learn how to de-noise it until it gets the previous, less noisy image.
It keeps doing this until it learns to remove all the noise and generate the original image from pure noise. The whole time, the AI knows what the original image should represent ("a cat laying on the floor"), so it learns to remove noise in a way that gets closer to a cat on the floor, and not just random de-noising.
Once the AI has repeated this process enough times with millions of images it learns how to go from random noise to a clear image, based on a description of what the image should look like. It effectively learns how to represent a cat, how to draw a flower, or how to paint in Van Gogh's style.

The training process finishes with a file that stores all this "de-noising" knowledge, and the extension of this file is .ckpt (checkpoint). This file is in essence a diffusion model: an AI that can generate images from text. That text you give the AI as input is called a prompt.

🤚 Details and noise are not friends

Now that we know how the AI learned to create images, we can get an intuitive idea of why certain types of images are difficult for diffusion models. The devil, as always, is in the details.

When we add noise to images, we lose the fine details. Things that originally looked clear now look blurry, and it's hard to see what's what, when something ends and when something starts.

On top of that, these AIs don't use the full-resolution image for training. In order to work, they have to downscale the image, usually to 512x512. If it was already hard enough, try making the image extra small. Now the hand looks like this:

The main takeaway here is: details get lost in the diffusion process. It's hard for the AI to "see" where something beings, and where it ends. Any kind of image that relies heavily on sharp edges (like text, or vector images) and precise connections (like anatomically correct body parts) is going to be a struggle.

The AI just doesn't have enough "resolution" to understand fine details.

This holds especially true for hands, because they account for just a tiny proportion of the pixels on an image. Think about it this way: in a picture of a person, 95% of the pixels represent things other than hands. Only a tiny fraction of all the information on the image represents the hands. If the model was already struggling to generate fine details, even more so trying to generate fine details on a tiny fraction of an already small image.

If you absolutely wanted to achieve a high level of detail in small parts, you'd be better off generating the particularly difficult part in full resolution and then outpainting the rest – later in the course we'll cover what this means and how to do it!

✅ Before you move on

Wow, you made it all the way down here! I'm proud of you. I know this might have been quite the info dump, so pat yourself in the back! You're doing great. Here, let's do something fun. How many people do you think will make it all the way down here? Say "Has anyone seen my macaroni?" in the #academy Discord chat – it'll be our own inside joke 😉 Only people that have made it here will know what it means!

Complete and Continue