šŸ”¢

CS180 Project 5: The Power of Diffusion Models!

Author: Nicolas Rault-Wang (nraultwang at berkeley.edu)

Credit to Notion for this template.

Part A: The Power of Diffusion Models!

Part A0: Setup

5 iterations
50 iterations
100 iterations
500 iterations

ā€œan oil painting of a snowy mountain villageā€: In general, the outputs are consistent with the prompt but have some details that seem wrong. For example, the building roofs are covered with snow but, except for the 500-iteration output, the trees are not covered with snow. Unless the village people like to shake off snow from their trees, those trees should be covered with snow.

"a man wearing a hatā€: The diffusion model seemed to think ā€œa manā€ means ā€œa middle aged or old white man in formal clothesā€.

ā€œa rocket shipā€: These were all cartoonish rockets with a single circular window, torpedo shape and sharp rounded fins just taking off from the ground. None of them look very realistic.

0.1 Random seed

I seeded all of my work with YOUR_SEED = 180

Part A1: Sampling Loops

1.1 Implementing the Forward Process

The forward process traces out a path from a clean image (t=0)t=0)ļ»æ to a purely noisy image (t=T)t=T)ļ»æ in a given number of time steps TTļ»æ. Below we visualize the original clean image and some noisy images along this path. A diffusion model is given many of these paths played in reverse to learn how to iteratively denoise a purely noisy image into a clean image.

1.2 Classical Denoising

Since an independent identically distributed gaussian noise image has no correlation between neighboring pixels, it consists of lots of high frequency information. Classical denoising techniques suggest that low-passing the noise-degraded image can help clean the image. One problem with this method is that it is indiscriminate: It removes the highest frequencies of both image and noise, resulting in a less-noisy but blurry image.

1.3 One-Step Denoising

The DeepFloyd model is a diffusion model that has been trained to estimate the noise component added to a clean image. Then, given a noisy image we can use the DeepFloyd's estimate of the noise to recover a noise-free image. Notice that at lower noise levels the recovered image is quite close to the input and is free of noise. However, at higher noise levels, the recovered image is both more different from the original image and lacks realistic high-frequency details.

1.4 Iterative Denoising

Iterative denoising is based on the idea that itā€™s easier to remove a small amount of noise than a lot of noise. A model like DeepFloyd uses this strategy to gradually denoise an image in small steps. Outputs of iterative denoising can result in more realistic results than single-step denoising and gaussian filtering, as the plots below show.

1.5 Diffusion Model Sampling

Since a diffusion model has learned to gradually denoise an image degraded by gaussian noise, it can produce clean images from pure gaussian noise if we condition it to start at its maximum noise level TTļ»æ. This process lets us sample the realistic image distribution learned by the model.

1.6 Classifier-Free Guidance (CFG)

CFG leverages a diffusion modelsā€™ learned associations between images and text embeddings to steer the sampling process towards outputs that are more consistent with a target text prompt.

It works by producing two noise estimates at each time step: Ļµu\epsilon_uļ»æ, conditioned on a null text prompt, and Ļµc\epsilon_cļ»æ, conditioned on the target text prompt. Then, the noise estimate to remove, Ļµ\epsilonļ»æ, is computed with the following equation

Equation A.4 from the project website.

where Ī³>1\gamma > 1ļ»æ. The intuition for this equation is the following: Ī³(Ļµcāˆ’Ļµu)\gamma(\epsilon_c - \epsilon_u)ļ»æ is a vector in the direction of the target prompt, and when added to Ļµu\epsilon_uļ»æ produces a noise estimate pointing towards the region of images associated with the target prompt.

Below are samples from DeepFloyd created with CFG guidance scaleĀ Ī³=7\gamma = 7ļ»æ and target promptĀ "a high quality photo". Notice how the CFG samples are more photo-like and realistic compared to the non-CFG samples.

1.7 Image-to-Image Translation

Since a diffusion model can hallucinate new image features when it denoises a noisy image, we can use it to make edits to a given clean image. This is the SDEdit algorithm: First add gaussian noise to a clean image, then denoise it with an iterative diffusion model using CFG for guidance.

Below are some example edits of the test image with the text prompt "a high quality photo" at noise levels [1, 3, 5, 7, 10, 20]. Notice that the degree of edits to the image is roughly proportional to the amount of noise added, with less noise being closer to the original image.

1.7.1 Editing Hand-Drawn and Web Images

An extension to the notion of editing an image with a diffusion model is creating realistic looking images starting from a simple hand-drawn image. As shown below, the diffusion model can create a high-quality image that is built on a low-quality hand-drawn image.

1.7.2 Inpainting

Diffusion models can also be used to edit specific parts of an image indicated by a binary mask mmļ»æ. At each time step ttļ»æ, the modelā€™s estimate of xtx_tļ»æ from xt+1x_{t+1}ļ»æ is masked with the original-image pixels at noise level ttļ»æ.

Equation A.5 from the project website.

The effect of this masking at each step is the edited region being consistent with the original image. Below are some inpainting examples with the edits conditioned to produce "a high quality photo":

1.7.3 Text-Conditioned Image-to-Image Translation

The SDEdit algorithm also works for prompts other than "a high quality photo", enabling image-to-image translation conditioned on any prompt. Below are some examples.

Prompt: ā€˜a rocket ship'
Prompt: 'a lithograph of waterfalls'
Prompt: 'a pencil'
Prompt: 'a lithograph of a skullā€™
Prompt: 'an oil painting of a snowy mountain village'

1.8 Visual Anagrams

Visual anagrams is a technique to steers a diffusion modelā€™s reconstruction towards two text prompt targets instead of one, such that one orientation looks like one prompt and the upside-down orientation looks like a different prompt. Algorithmically, this is done by creating two noise estimates for each image orientation, conditioned on the respective prompt, at each step and using CFG to guide the model towards this combined noise estimate.

Visual anagrams algorithm, as presented on the project website.

Unflipped: 'an oil painting of people around a campfireā€™. Flipped: 'an oil painting of an old manā€™

Unflipped: 'a photo of a dog'. Flipped: 'a photo of a hipster baristaā€™

Unflipped: 'an oil painting of a snowy mountain villageā€™. Flipped: 'an oil painting of an old man'

1.9 Hybrid Images

Like visual anagrams, Factorized Diffusion steers a diffusion modelā€™s reconstruction towards two text prompt targets instead of one. But instead of orientations, it makes the low-frequencies of the output look like one prompt and the high-frequencies look like a different prompt. At each time step ttļ»æ, the CFG noise estimate Ļµ\epsilonļ»æ is obtained by first computing two noise estimates conditioned on the respective prompts, filtering the estimates with a gaussian or laplacian filter, and adding these filtered estimates together:

Factorized diffusion noise estimate. Source: project website.

Low-pass: 'a lithograph of a skull', High-pass: 'a lithograph of waterfallsā€™

Low-pass: 'a lithograph of a skull', High-pass: 'an oil painting of a snowy mountain village'

Low-pass: 'a rocket ship', High-pass: 'a man wearing a hat'

Low-pass: 'a lithograph of waterfallsā€™, High-pass: "a rocket shipā€

Low-pass:'an oil painting of an old manā€™, High-pass: 'a lithograph of waterfallsā€™

Low-pass: 'a photo of a dogā€™, High-pass: 'a lithograph of waterfallsā€™

Part A Bells & Whistles

Recursive Text-Conditioned Image-To-Image Translations

Starting_idx ā‰„ 15

For small initial times, the recursive edit procedure works well for a small number of iterations. However, the diffusion modelā€™s outputs become less and less realistic for later iterations. In particular the outputs generally approach an image with rainbow colors and repeating patterns (the rainbow dimension)

starting_idx=20, prompt="a high quality photo", scale=7, niters=30
starting_idx=25, prompt='a lithograph of waterfallsā€™, guidance scale = 2
starting_idx=25, prompt='a lithograph of waterfallsā€™, guidance scale = 5
starting_idx=15, prompt='a lithograph of waterfallsā€™, scale=7,

10 ā‰¤ Starting_idx < 15

prompt='a high quality photo', niters=15, scale=7, start_idx=10
prompt='a high quality photo', niters=15, scale=7, start_idx=13


Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

1.2 Using the UNet to Train a Denoiser

The forward process is used to trace out trajectories from a clean digit to a noisy digit. We train our denoiser by giving it the reversed the trajectory so it can learn how to gradually remove noise from the image. In part 1, we start by training a single-step UNet. In part 2, we implement the iterative version using this forward process.

Figure 3: Varying levels of noise on MNIST digits

1.2.1 Training

1.2.2 Out-of-Distribution Testing

Our unconditional UNet was trained to noise at level Ļƒ=0.5\sigma = 0.5ļ»æ. The plots below show the modelā€™s performance denoising images with noise levels it wasnā€™t trained to handle. As the image gets noisier beyond Ļƒ=0.5\sigma = 0.5ļ»æ, the modelā€™s performance gets worse.

Part 2: Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

Time-conditioning gives the UNet information about where along the reconstruction trajectory it is. This is useful for implementing a model to denoise a noisy image in small steps because the time-conditioning indicates how noisy the input image is and thus how much noise the model should try to remove in its next denoising step.

Training loss plot for time-only conditioning

Below are samples from the time-conditioned UNet after training epochs 1, 5, 10, 15, and 20.

Epoch 1
Epoch 5
Epoch 10
Epoch 15
Epoch 20

2.2 Adding Class-Conditioning to UNet

Class-conditioning enables our diffusion model to learn associations between digits 0,1,ā€¦,90, 1, \dots, 9ļ»æ and their handwritten image form. Then, during sampling, we can use classifier-free guidance (CFG) to steer the modelā€™s path to the MNIST image manifold so that it arrives in the neighborhood of a specific digit rather than an arbitrary digit. As the samples below indicate using both time-conditioning and class-conditioning results in more realistic and higher-quality generated images than time-conditioning alone.

Training loss plot for time and class conditioning.

Below are samples from the class-conditioned UNet at epochs 1, 5, 10, 15, and 20.

Epoch 1
Epoch 5
Epoch 10
Epoch 15
Epoch 20

Part B Bells & Whistles

Sampling Gif

Skip Connections in the Time-Conditioned Model

  • In deep networks, information from layers earlier in the network may be useful for layers later in the network. ResNets add ā€œskip connectionsā€ which add the input of a layer to its output.
  • This gives a network the option to easily learn an identify transform for F(x)F(x)ļ»æ if a layer is not needed, which can help the model learn a better fit to the data.

Here are the results of using skip connections in the time-only conditioned DDPM model. These figures show a slightly better

Epoch 1
Epoch 5
Epoch 10
Epoch 15
Epoch 20