CS180 Project 5: The Power of Diffusion Models!
Author: Nicolas Rault-Wang (nraultwang at berkeley.edu)
Credit to Notion for this template.
Part A: The Power of Diffusion Models!
Part A0: Setup
āan oil painting of a snowy mountain villageā
: In general, the outputs are consistent with the prompt but have some details that seem wrong. For example, the building roofs are covered with snow but, except for the 500-iteration output, the trees are not covered with snow. Unless the village people like to shake off snow from their trees, those trees should be covered with snow.
"a man wearing a hatā
: The diffusion model seemed to think āa manā means āa middle aged or old white man in formal clothesā.
āa rocket shipā
: These were all cartoonish rockets with a single circular window, torpedo shape and sharp rounded fins just taking off from the ground. None of them look very realistic.
0.1 Random seed
I seeded all of my work with YOUR_SEED = 180
Part A1: Sampling Loops
1.1 Implementing the Forward Process
The forward process traces out a path from a clean image (ļ»æ to a purely noisy image (ļ»æ in a given number of time steps ļ»æ. Below we visualize the original clean image and some noisy images along this path. A diffusion model is given many of these paths played in reverse to learn how to iteratively denoise a purely noisy image into a clean image.
1.2 Classical Denoising
Since an independent identically distributed gaussian noise image has no correlation between neighboring pixels, it consists of lots of high frequency information. Classical denoising techniques suggest that low-passing the noise-degraded image can help clean the image. One problem with this method is that it is indiscriminate: It removes the highest frequencies of both image and noise, resulting in a less-noisy but blurry image.
1.3 One-Step Denoising
The DeepFloyd
model is a diffusion model that has been trained to estimate the noise component added to a clean image. Then, given a noisy image we can use the DeepFloyd
's estimate of the noise to recover a noise-free image. Notice that at lower noise levels the recovered image is quite close to the input and is free of noise. However, at higher noise levels, the recovered image is both more different from the original image and lacks realistic high-frequency details.
1.4 Iterative Denoising
Iterative denoising is based on the idea that itās easier to remove a small amount of noise than a lot of noise. A model like DeepFloyd
uses this strategy to gradually denoise an image in small steps. Outputs of iterative denoising can result in more realistic results than single-step denoising and gaussian filtering, as the plots below show.
1.5 Diffusion Model Sampling
Since a diffusion model has learned to gradually denoise an image degraded by gaussian noise, it can produce clean images from pure gaussian noise if we condition it to start at its maximum noise level ļ»æ. This process lets us sample the realistic image distribution learned by the model.
1.6 Classifier-Free Guidance (CFG)
CFG leverages a diffusion modelsā learned associations between images and text embeddings to steer the sampling process towards outputs that are more consistent with a target text prompt.
It works by producing two noise estimates at each time step: ļ»æ, conditioned on a null text prompt, and ļ»æ, conditioned on the target text prompt. Then, the noise estimate to remove, ļ»æ, is computed with the following equation
where ļ»æ. The intuition for this equation is the following: ļ»æ is a vector in the direction of the target prompt, and when added to ļ»æ produces a noise estimate pointing towards the region of images associated with the target prompt.
Below are samples from DeepFloyd
created with CFG guidance scaleĀ ļ»æ and target promptĀ "a high quality photo"
. Notice how the CFG samples are more photo-like and realistic compared to the non-CFG samples.
1.7 Image-to-Image Translation
Since a diffusion model can hallucinate new image features when it denoises a noisy image, we can use it to make edits to a given clean image. This is the SDEdit algorithm: First add gaussian noise to a clean image, then denoise it with an iterative diffusion model using CFG for guidance.
Below are some example edits of the test image with the text prompt "a high quality photo"
at noise levels [1, 3, 5, 7, 10, 20]
. Notice that the degree of edits to the image is roughly proportional to the amount of noise added, with less noise being closer to the original image.
1.7.1 Editing Hand-Drawn and Web Images
An extension to the notion of editing an image with a diffusion model is creating realistic looking images starting from a simple hand-drawn image. As shown below, the diffusion model can create a high-quality image that is built on a low-quality hand-drawn image.
1.7.2 Inpainting
Diffusion models can also be used to edit specific parts of an image indicated by a binary mask ļ»æ. At each time step ļ»æ, the modelās estimate of ļ»æ from ļ»æ is masked with the original-image pixels at noise level ļ»æ.
The effect of this masking at each step is the edited region being consistent with the original image. Below are some inpainting examples with the edits conditioned to produce "a high quality photo"
:
1.7.3 Text-Conditioned Image-to-Image Translation
The SDEdit algorithm also works for prompts other than "a high quality photo"
, enabling image-to-image translation conditioned on any prompt. Below are some examples.
1.8 Visual Anagrams
Visual anagrams is a technique to steers a diffusion modelās reconstruction towards two text prompt targets instead of one, such that one orientation looks like one prompt and the upside-down orientation looks like a different prompt. Algorithmically, this is done by creating two noise estimates for each image orientation, conditioned on the respective prompt, at each step and using CFG to guide the model towards this combined noise estimate.
Unflipped: 'an oil painting of people around a campfireā
. Flipped: 'an oil painting of an old manā
Unflipped: 'a photo of a dog'
. Flipped: 'a photo of a hipster baristaā
Unflipped: 'an oil painting of a snowy mountain villageā
. Flipped: 'an oil painting of an old man'
1.9 Hybrid Images
Like visual anagrams, Factorized Diffusion steers a diffusion modelās reconstruction towards two text prompt targets instead of one. But instead of orientations, it makes the low-frequencies of the output look like one prompt and the high-frequencies look like a different prompt. At each time step ļ»æ, the CFG noise estimate ļ»æ is obtained by first computing two noise estimates conditioned on the respective prompts, filtering the estimates with a gaussian or laplacian filter, and adding these filtered estimates together:
Low-pass: 'a lithograph of a skull'
, High-pass: 'a lithograph of waterfallsā
Low-pass: 'a lithograph of a skull'
, High-pass: 'an oil painting of a snowy mountain village'
Low-pass: 'a rocket ship'
, High-pass: 'a man wearing a hat'
Low-pass: 'a lithograph of waterfallsā
, High-pass: "a rocket shipā
Low-pass:'an oil painting of an old manā
, High-pass: 'a lithograph of waterfallsā
Low-pass: 'a photo of a dogā
, High-pass: 'a lithograph of waterfallsā
Part A Bells & Whistles
Recursive Text-Conditioned Image-To-Image Translations
- I wanted to see what happened if we recursively applied the SDEdit algorithm to an input image conditioned on a given prompt. This was motivated by the idea of exploring
DeepFloyd
's image manifold in a neighborhood of an original clean image.
- My implementation feeds the output of the
iterative_denoise_cfg
back into itself using the same prompt and starting index in thetimesteps
array. To stay in the neighborhood of the clean image, I only added a small amount of noise to be denoised, encouraging the SDEdit process to make a series of generally small edits each iteration.
- To display the sequence of changes, I created a looping gif with
mediapy
'sshow_video
function.
- Below are some examples of the process.
Starting_idx ā„ 15
For small initial times, the recursive edit procedure works well for a small number of iterations. However, the diffusion modelās outputs become less and less realistic for later iterations. In particular the outputs generally approach an image with rainbow colors and repeating patterns (the rainbow dimension)
- Decent examples
- The Rainbow dimension
10 ā¤ Starting_idx < 15
- For these range of starting indices, the model output remains consistently realistic, even for large number of iterations. While the generation process seems to jump away from the initial input image, it does seem to settle into a stable small-editing process after the first few iterations. This may be because the high initial noise level can allow the model to recover from built-up deviations from realism over successive iterations.
Part B: Diffusion Models from Scratch!
Part 1: Training a Single-Step Denoising UNet
1.2 Using the UNet to Train a Denoiser
The forward process is used to trace out trajectories from a clean digit to a noisy digit. We train our denoiser by giving it the reversed the trajectory so it can learn how to gradually remove noise from the image. In part 1, we start by training a single-step UNet. In part 2, we implement the iterative version using this forward process.
1.2.1 Training
1.2.2 Out-of-Distribution Testing
Our unconditional UNet was trained to noise at level ļ»æ. The plots below show the modelās performance denoising images with noise levels it wasnāt trained to handle. As the image gets noisier beyond ļ»æ, the modelās performance gets worse.
Part 2: Training a Diffusion Model
2.1 Adding Time Conditioning to UNet
Time-conditioning gives the UNet information about where along the reconstruction trajectory it is. This is useful for implementing a model to denoise a noisy image in small steps because the time-conditioning indicates how noisy the input image is and thus how much noise the model should try to remove in its next denoising step.
Below are samples from the time-conditioned UNet after training epochs 1, 5, 10, 15, and 20.
2.2 Adding Class-Conditioning to UNet
Class-conditioning enables our diffusion model to learn associations between digits ļ»æ and their handwritten image form. Then, during sampling, we can use classifier-free guidance (CFG) to steer the modelās path to the MNIST image manifold so that it arrives in the neighborhood of a specific digit rather than an arbitrary digit. As the samples below indicate using both time-conditioning and class-conditioning results in more realistic and higher-quality generated images than time-conditioning alone.
Below are samples from the class-conditioned UNet at epochs 1, 5, 10, 15, and 20.
Part B Bells & Whistles
Sampling Gif
- To create a sampling gif, we cache the intermediate denoising states of a mini batch from as the DDPM model denoises it. We get 4 samples of each digit ļ»æ by conditioning the model on a four copies of each digit. Then, we use the
media.show_video
method from the mediapy library to create the looping gifs shown above.
Skip Connections in the Time-Conditioned Model
- In deep networks, information from layers earlier in the network may be useful for layers later in the network. ResNets add āskip connectionsā which add the input of a layer to its output.
- This gives a network the option to easily learn an identify transform for ļ»æ if a layer is not needed, which can help the model learn a better fit to the data.
Here are the results of using skip connections in the time-only conditioned DDPM model. These figures show a slightly better