CS180 Project 5: The Power of Diffusion Models!

Author: Nicolas Rault-Wang (nraultwang at berkeley.edu)

Credit to Notion for this template.

Part A: The Power of Diffusion Models!

Part A0: Setup

“an oil painting of a snowy mountain village”: In general, the outputs are consistent with the prompt but have some details that seem wrong. For example, the building roofs are covered with snow but, except for the 500-iteration output, the trees are not covered with snow. Unless the village people like to shake off snow from their trees, those trees should be covered with snow.

"a man wearing a hat”: The diffusion model seemed to think “a man” means “a middle aged or old white man in formal clothes”.

“a rocket ship”: These were all cartoonish rockets with a single circular window, torpedo shape and sharp rounded fins just taking off from the ground. None of them look very realistic.

0.1 Random seed

I seeded all of my work with YOUR_SEED = 180

Part A1: Sampling Loops

1.1 Implementing the Forward Process

The forward process traces out a path from a clean image ( $t=0)$ to a purely noisy image ( $t=T)$ in a given number of time steps $T$ . Below we visualize the original clean image and some noisy images along this path. A diffusion model is given many of these paths played in reverse to learn how to iteratively denoise a purely noisy image into a clean image.

1.2 Classical Denoising

Since an independent identically distributed gaussian noise image has no correlation between neighboring pixels, it consists of lots of high frequency information. Classical denoising techniques suggest that low-passing the noise-degraded image can help clean the image. One problem with this method is that it is indiscriminate: It removes the highest frequencies of both image and noise, resulting in a less-noisy but blurry image.

1.3 One-Step Denoising

The DeepFloyd model is a diffusion model that has been trained to estimate the noise component added to a clean image. Then, given a noisy image we can use the DeepFloyd's estimate of the noise to recover a noise-free image. Notice that at lower noise levels the recovered image is quite close to the input and is free of noise. However, at higher noise levels, the recovered image is both more different from the original image and lacks realistic high-frequency details.

1.4 Iterative Denoising

Iterative denoising is based on the idea that it’s easier to remove a small amount of noise than a lot of noise. A model like DeepFloyd uses this strategy to gradually denoise an image in small steps. Outputs of iterative denoising can result in more realistic results than single-step denoising and gaussian filtering, as the plots below show.

1.5 Diffusion Model Sampling

Since a diffusion model has learned to gradually denoise an image degraded by gaussian noise, it can produce clean images from pure gaussian noise if we condition it to start at its maximum noise level $T$ . This process lets us sample the realistic image distribution learned by the model.

1.6 Classifier-Free Guidance (CFG)

CFG leverages a diffusion models’ learned associations between images and text embeddings to steer the sampling process towards outputs that are more consistent with a target text prompt.

It works by producing two noise estimates at each time step: $\epsilon_u$ , conditioned on a null text prompt, and $\epsilon_c$ , conditioned on the target text prompt. Then, the noise estimate to remove, $\epsilon$ , is computed with the following equation

where $\gamma > 1$ . The intuition for this equation is the following: $\gamma(\epsilon_c - \epsilon_u)$ is a vector in the direction of the target prompt, and when added to $\epsilon_u$ produces a noise estimate pointing towards the region of images associated with the target prompt.

Below are samples from DeepFloyd created with CFG guidance scale $\gamma = 7$ and target prompt "a high quality photo". Notice how the CFG samples are more photo-like and realistic compared to the non-CFG samples.

1.7 Image-to-Image Translation

Since a diffusion model can hallucinate new image features when it denoises a noisy image, we can use it to make edits to a given clean image. This is the SDEdit algorithm: First add gaussian noise to a clean image, then denoise it with an iterative diffusion model using CFG for guidance.

Below are some example edits of the test image with the text prompt "a high quality photo" at noise levels [1, 3, 5, 7, 10, 20]. Notice that the degree of edits to the image is roughly proportional to the amount of noise added, with less noise being closer to the original image.

1.7.1 Editing Hand-Drawn and Web Images

An extension to the notion of editing an image with a diffusion model is creating realistic looking images starting from a simple hand-drawn image. As shown below, the diffusion model can create a high-quality image that is built on a low-quality hand-drawn image.

1.7.2 Inpainting

Diffusion models can also be used to edit specific parts of an image indicated by a binary mask $m$ . At each time step $t$ , the model’s estimate of $x_t$ from $x_{t+1}$ is masked with the original-image pixels at noise level $t$ .

The effect of this masking at each step is the edited region being consistent with the original image. Below are some inpainting examples with the edits conditioned to produce "a high quality photo":

1.7.3 Text-Conditioned Image-to-Image Translation

The SDEdit algorithm also works for prompts other than "a high quality photo", enabling image-to-image translation conditioned on any prompt. Below are some examples.

Prompt: `'an oil painting of a snowy mountain village'`

1.8 Visual Anagrams

Visual anagrams is a technique to steers a diffusion model’s reconstruction towards two text prompt targets instead of one, such that one orientation looks like one prompt and the upside-down orientation looks like a different prompt. Algorithmically, this is done by creating two noise estimates for each image orientation, conditioned on the respective prompt, at each step and using CFG to guide the model towards this combined noise estimate.

Visual anagrams algorithm, as presented on the project website.

Unflipped: 'an oil painting of people around a campfire’. Flipped: 'an oil painting of an old man’

Unflipped: 'a photo of a dog'. Flipped: 'a photo of a hipster barista’

Unflipped: 'an oil painting of a snowy mountain village’. Flipped: 'an oil painting of an old man'

1.9 Hybrid Images

Like visual anagrams, Factorized Diffusion steers a diffusion model’s reconstruction towards two text prompt targets instead of one. But instead of orientations, it makes the low-frequencies of the output look like one prompt and the high-frequencies look like a different prompt. At each time step $t$ , the CFG noise estimate $\epsilon$ is obtained by first computing two noise estimates conditioned on the respective prompts, filtering the estimates with a gaussian or laplacian filter, and adding these filtered estimates together:

Factorized diffusion noise estimate. Source: project website.

Low-pass: 'a lithograph of a skull', High-pass: 'a lithograph of waterfalls’

Low-pass: 'a lithograph of a skull', High-pass: 'an oil painting of a snowy mountain village'

Low-pass: 'a rocket ship', High-pass: 'a man wearing a hat'

Low-pass: 'a lithograph of waterfalls’, High-pass: "a rocket ship”

Low-pass:'an oil painting of an old man’, High-pass: 'a lithograph of waterfalls’

Low-pass: 'a photo of a dog’, High-pass: 'a lithograph of waterfalls’

Part A Bells & Whistles

Recursive Text-Conditioned Image-To-Image Translations

I wanted to see what happened if we recursively applied the SDEdit algorithm to an input image conditioned on a given prompt. This was motivated by the idea of exploring DeepFloyd's image manifold in a neighborhood of an original clean image.

My implementation feeds the output of the iterative_denoise_cfg back into itself using the same prompt and starting index in the timesteps array. To stay in the neighborhood of the clean image, I only added a small amount of noise to be denoised, encouraging the SDEdit process to make a series of generally small edits each iteration.

To display the sequence of changes, I created a looping gif with mediapy's show_video function.

Below are some examples of the process.

Starting_idx ≥ 15

For small initial times, the recursive edit procedure works well for a small number of iterations. However, the diffusion model’s outputs become less and less realistic for later iterations. In particular the outputs generally approach an image with rainbow colors and repeating patterns (the rainbow dimension)

Decent examples
starting_idx=15, prompt='a lithograph of waterfalls’, scale=2
starting_idx=25, prompt='a lithograph of waterfalls’, scale=2
starting_idx=15, prompt="a high quality photo", scale=7
starting_idx=25, prompt='a lithograph of waterfalls’, scale=7

The Rainbow dimension

starting_idx=20, prompt="a high quality photo", scale=7, niters=30

starting_idx=25, prompt='a lithograph of waterfalls’, guidance scale = 2

starting_idx=25, prompt='a lithograph of waterfalls’, guidance scale = 5

starting_idx=15, prompt='a lithograph of waterfalls’, scale=7,

10 ≤ Starting_idx < 15

For these range of starting indices, the model output remains consistently realistic, even for large number of iterations. While the generation process seems to jump away from the initial input image, it does seem to settle into a stable small-editing process after the first few iterations. This may be because the high initial noise level can allow the model to recover from built-up deviations from realism over successive iterations.

prompt='a high quality photo', niters=15, scale=7, start_idx=10

prompt='a high quality photo', niters=15, scale=7, start_idx=13

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

1.2 Using the UNet to Train a Denoiser

The forward process is used to trace out trajectories from a clean digit to a noisy digit. We train our denoiser by giving it the reversed the trajectory so it can learn how to gradually remove noise from the image. In part 1, we start by training a single-step UNet. In part 2, we implement the iterative version using this forward process.

Figure 3: Varying levels of noise on MNIST digits

1.2.1 Training

1.2.2 Out-of-Distribution Testing

Our unconditional UNet was trained to noise at level $\sigma = 0.5$ . The plots below show the model’s performance denoising images with noise levels it wasn’t trained to handle. As the image gets noisier beyond $\sigma = 0.5$ , the model’s performance gets worse.

Part 2: Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

Time-conditioning gives the UNet information about where along the reconstruction trajectory it is. This is useful for implementing a model to denoise a noisy image in small steps because the time-conditioning indicates how noisy the input image is and thus how much noise the model should try to remove in its next denoising step.

Training loss plot for time-only conditioning

Below are samples from the time-conditioned UNet after training epochs 1, 5, 10, 15, and 20.

2.2 Adding Class-Conditioning to UNet

Class-conditioning enables our diffusion model to learn associations between digits $0, 1, \dots, 9$ and their handwritten image form. Then, during sampling, we can use classifier-free guidance (CFG) to steer the model’s path to the MNIST image manifold so that it arrives in the neighborhood of a specific digit rather than an arbitrary digit. As the samples below indicate using both time-conditioning and class-conditioning results in more realistic and higher-quality generated images than time-conditioning alone.

Training loss plot for time and class conditioning.

Below are samples from the class-conditioned UNet at epochs 1, 5, 10, 15, and 20.

Part B Bells & Whistles

Sampling Gif

To create a sampling gif, we cache the intermediate denoising states of a mini batch from as the DDPM model denoises it. We get 4 samples of each digit $0, 1, 2, \dots, 9$ by conditioning the model on a four copies of each digit. Then, we use the media.show_video method from the mediapy library to create the looping gifs shown above.

Skip Connections in the Time-Conditioned Model

In deep networks, information from layers earlier in the network may be useful for layers later in the network. ResNets add “skip connections” which add the input of a layer to its output.

This gives a network the option to easily learn an identify transform for $F(x)$ if a layer is not needed, which can help the model learn a better fit to the data.

Here are the results of using skip connections in the time-only conditioned DDPM model. These figures show a slightly better