Programming Generative AI: From Variational Autoencoders to Stable Diffusion with PyTorch and Hugging Face

18h 17mIntermediate2025-02-04

Authors

Pearson

Jonathan Dinu

Course details

In this course, Jonathan Dinu—a dedicated educator, author, and speaker—presents an interactive tour of deep generative modeling. Learn how to train your own generative models from scratch to create an infinity of images. Discover how you can generate text with large language models similar to the ones that power applications like ChatGPT. Write your own text-to-image pipeline to understand how prompt- based generative models actually work. Plus, personalize large pretrained models like stable diffusion to generate images of novel subjects in unique visual styles. This course offers you an applied resource to complement any theoretical or conceptual knowledge you have.

Learning objectives
Train a variational autoencoder with PyTorch to learn a compressed latent space of images.
Define how to generate and edit realistic human faces with unconditional diffusion models and SDEdit.
Use large language models such as GPT2 to generate text with Hugging Face Transformers.
Perform text-based semantic image search using multimodal models such as CLIP.
Program your own text-to-image pipeline to understand how prompt-based generative models such as Stable Diffusion actually work.
Evaluate generative models, both qualitatively and quantitatively.
Identify how to caption images using pretrained foundation models.
Articulate how to generate images in a specific visual style by efficiently fine-tuning Stable Diffusion with LoRA.
Create personalized AI avatars by teaching pretrained diffusion models new subjects and concepts with Dreambooth.
Guide the structure and composition of generated images using depth- and edge- conditioned ControlNets.
Perform near real-time inference with SDXL Turbo for frame-based video-to-video translation.

Skills covered

Hugging FacePyTorchArtificial Intelligence for DesignNatural Language Processing (NLP)Programming FoundationsGenerative AIVideoPhotographyGraphic DesignArtificial Intelligence (AI)Animation and IllustrationOpen SourceSoftware DevelopmentOne-Off

Concepts

0. Introduction

01 - Programming generative AI - Introduction

1. The What, Why, and How of Generative AI

02 - Topics
03 - Generative AI in the wild
04 - Defining generative AI
05 - Multitudes of media
06 - How machines create
07 - Formalizing generative models
08 - Generative versus discriminative models
09 - The generative modeling trilemma
10 - Introduction to Google Colab

2. PyTorch for the Impatient

11 - Topics
12 - What is PyTorch
13 - The PyTorch layer cake
14 - The deep learning software trilemma
15 - What are tensors, really
16 - Tensors in PyTorch
17 - Introduction to computational graphs
18 - Backpropagation is just the chain rule
19 - Effortless backpropagation with torch.autograd
20 - PyTorch's device abstraction (i.e., GPUs)
21 - Working with devices
22 - Components of a learning algorithm
23 - Introduction to gradient descent
24 - Getting to stochastic gradient descent (SGD)
25 - Comparing gradient descent and SGD
26 - Linear regression with PyTorch
27 - Perceptrons and neurons
28 - Layers and activations with torch.nn
29 - Multi-layer feedforward neural networks (MLP)

3. Latent Space Rules Everything Around Me

30 - Topics
31 - Representing images as tensors
32 - Desiderata for computer vision
33 - Features of convolutional neural networks
34 - Working with images in Python
35 - The Fashion-MNIST dataset
36 - Convolutional neural networks in PyTorch
37 - Components of a latent variable model (LVM)
38 - The humble autoencoder
39 - Defining an autoencoder with PyTorch
40 - Setting up a training loop
41 - Inference with an autoencoder
42 - Look ma, no features
43 - Adding probability to autoencoders (VAE)
44 - Variational inference - Not just for autoencoders
45 - Transforming an autoencoder into a VAE
46 - Training a VAE with PyTorch
47 - Exploring latent space
48 - Latent space interpolation and attribute vectors

4. Demystifying Diffusion

49 - Topics
50 - Generation as a reversible process
51 - Sampling as iterative denoising
52 - Diffusers and the Hugging Face ecosystem
53 - Generating images with diffusers pipelines
54 - Deconstructing the diffusion process
55 - Forward process as encoder
56 - Reverse process as decoder
57 - Interpolating diffusion models
58 - Image-to-image translation with SDEdit
59 - Image restoration and enhancement

5. Generating and Encoding Text with Transformers

60 - Topics
61 - The natural language processing pipeline
62 - Generative models of language
63 - Generating text with transformers pipelines
64 - Deconstructing transformers pipelines
65 - Decoding strategies
66 - Transformers are just latent variable models for sequences
67 - Visualizing and understanding attention
68 - Turning words into vectors
69 - The vector space model
70 - Embedding sequences with transformers
71 - Computing the similarity between embeddings
72 - Semantic search with embeddings
73 - Contrastive embeddings with sentence transformers

6. Connecting Text and Images

74 - Topics
75 - Components of a multimodal model
76 - Vision-language understanding
77 - Contrastive language-image pretraining
78 - Embedding text and images with CLIP
79 - Zero-shot image classification with CLIP
80 - Semantic image search with CLIP
81 - Conditional generative models
82 - Introduction to latent diffusion models
83 - The latent diffusion model architecture
84 - Failure modes and additional tools
85 - Stable diffusion deconstructed
86 - Writing your own stable diffusion pipeline
87 - Decoding images from the stable diffusion latent space
88 - Improving generation with guidance
89 - Playing with prompts

7. Post-Training Procedures for Diffusion Models

90 - Topics
91 - Methods and metrics for evaluating generative AI
92 - Manual evaluation of stable diffusion with DrawBench
93 - Quantitative evaluation of diffusion models with human preference predictors
94 - Overview of methods for fine-tuning diffusion models
95 - Sourcing and preparing image datasets for fine-tuning
96 - Generating automatic captions with BLIP-2
97 - Parameter efficient fine-tuning with LoRa
98 - Inspecting the results of fine-tuning
99 - Inference with LoRas for style-specific generation
100 - Conceptual overview of textual inversion
101 - Subject-specific personalization with DreamBooth
102 - DreamBooth versus LoRa fine-tuning
103 - DreamBooth fine-tuning with Hugging Face
104 - Inference with DreamBooth to create personalized AI avatars
105 - Adding conditional control to text-to-image diffusion models
106 - Creating edge and depth maps for conditioning
107 - Depth and edge-guided stable diffusion with ControlNet
108 - Understanding and experimenting with ControlNet parameters
109 - Generative text effects with font depth maps
110 - Few step generation with adversarial diffusion distillation (ADD)
111 - Reasons to distill
112 - Comparing SDXL and SDXL Turbo
113 - Text-guided image-to-image translation
114 - Video-driven frame-by-frame generation with SDXL Turbo
115 - Near real-time inference with PyTorch performance optimizations

Conclusion

116 - Programming generative AI - Summary