Programming Generative AI: From Variational Autoencoders to Stable Diffusion with PyTorch and Hugging Face
18h 17mIntermediate2025-02-04
Authors

Pearson

Jonathan Dinu
Course details
In this course, Jonathan Dinu—a dedicated educator, author, and speaker—presents an interactive tour of deep generative modeling. Learn how to train your own generative models from scratch to create an infinity of images. Discover how you can generate text with large language models similar to the ones that power applications like ChatGPT. Write your own text-to-image pipeline to understand how prompt- based generative models actually work. Plus, personalize large pretrained models like stable diffusion to generate images of novel subjects in unique visual styles. This course offers you an applied resource to complement any theoretical or conceptual knowledge you have.
Learning objectives
Train a variational autoencoder with PyTorch to learn a compressed latent space of images.
Define how to generate and edit realistic human faces with unconditional diffusion models and SDEdit.
Use large language models such as GPT2 to generate text with Hugging Face Transformers.
Perform text-based semantic image search using multimodal models such as CLIP.
Program your own text-to-image pipeline to understand how prompt-based generative models such as Stable Diffusion actually work.
Evaluate generative models, both qualitatively and quantitatively.
Identify how to caption images using pretrained foundation models.
Articulate how to generate images in a specific visual style by efficiently fine-tuning Stable Diffusion with LoRA.
Create personalized AI avatars by teaching pretrained diffusion models new subjects and concepts with Dreambooth.
Guide the structure and composition of generated images using depth- and edge- conditioned ControlNets.
Perform near real-time inference with SDXL Turbo for frame-based video-to-video translation.
Learning objectives
Train a variational autoencoder with PyTorch to learn a compressed latent space of images.
Define how to generate and edit realistic human faces with unconditional diffusion models and SDEdit.
Use large language models such as GPT2 to generate text with Hugging Face Transformers.
Perform text-based semantic image search using multimodal models such as CLIP.
Program your own text-to-image pipeline to understand how prompt-based generative models such as Stable Diffusion actually work.
Evaluate generative models, both qualitatively and quantitatively.
Identify how to caption images using pretrained foundation models.
Articulate how to generate images in a specific visual style by efficiently fine-tuning Stable Diffusion with LoRA.
Create personalized AI avatars by teaching pretrained diffusion models new subjects and concepts with Dreambooth.
Guide the structure and composition of generated images using depth- and edge- conditioned ControlNets.
Perform near real-time inference with SDXL Turbo for frame-based video-to-video translation.
Skills covered
Hugging FacePyTorchArtificial Intelligence for DesignNatural Language Processing (NLP)Programming FoundationsGenerative AIVideoPhotographyGraphic DesignArtificial Intelligence (AI)Animation and IllustrationOpen SourceSoftware DevelopmentOne-Off
Concepts
0. Introduction
- 01 - Programming generative AI - Introduction
1. The What, Why, and How of Generative AI
- 02 - Topics
- 03 - Generative AI in the wild
- 04 - Defining generative AI
- 05 - Multitudes of media
- 06 - How machines create
- 07 - Formalizing generative models
- 08 - Generative versus discriminative models
- 09 - The generative modeling trilemma
- 10 - Introduction to Google Colab
2. PyTorch for the Impatient
- 11 - Topics
- 12 - What is PyTorch
- 13 - The PyTorch layer cake
- 14 - The deep learning software trilemma
- 15 - What are tensors, really
- 16 - Tensors in PyTorch
- 17 - Introduction to computational graphs
- 18 - Backpropagation is just the chain rule
- 19 - Effortless backpropagation with torch.autograd
- 20 - PyTorch's device abstraction (i.e., GPUs)
- 21 - Working with devices
- 22 - Components of a learning algorithm
- 23 - Introduction to gradient descent
- 24 - Getting to stochastic gradient descent (SGD)
- 25 - Comparing gradient descent and SGD
- 26 - Linear regression with PyTorch
- 27 - Perceptrons and neurons
- 28 - Layers and activations with torch.nn
- 29 - Multi-layer feedforward neural networks (MLP)
3. Latent Space Rules Everything Around Me
- 30 - Topics
- 31 - Representing images as tensors
- 32 - Desiderata for computer vision
- 33 - Features of convolutional neural networks
- 34 - Working with images in Python
- 35 - The Fashion-MNIST dataset
- 36 - Convolutional neural networks in PyTorch
- 37 - Components of a latent variable model (LVM)
- 38 - The humble autoencoder
- 39 - Defining an autoencoder with PyTorch
- 40 - Setting up a training loop
- 41 - Inference with an autoencoder
- 42 - Look ma, no features
- 43 - Adding probability to autoencoders (VAE)
- 44 - Variational inference - Not just for autoencoders
- 45 - Transforming an autoencoder into a VAE
- 46 - Training a VAE with PyTorch
- 47 - Exploring latent space
- 48 - Latent space interpolation and attribute vectors
4. Demystifying Diffusion
- 49 - Topics
- 50 - Generation as a reversible process
- 51 - Sampling as iterative denoising
- 52 - Diffusers and the Hugging Face ecosystem
- 53 - Generating images with diffusers pipelines
- 54 - Deconstructing the diffusion process
- 55 - Forward process as encoder
- 56 - Reverse process as decoder
- 57 - Interpolating diffusion models
- 58 - Image-to-image translation with SDEdit
- 59 - Image restoration and enhancement
5. Generating and Encoding Text with Transformers
- 60 - Topics
- 61 - The natural language processing pipeline
- 62 - Generative models of language
- 63 - Generating text with transformers pipelines
- 64 - Deconstructing transformers pipelines
- 65 - Decoding strategies
- 66 - Transformers are just latent variable models for sequences
- 67 - Visualizing and understanding attention
- 68 - Turning words into vectors
- 69 - The vector space model
- 70 - Embedding sequences with transformers
- 71 - Computing the similarity between embeddings
- 72 - Semantic search with embeddings
- 73 - Contrastive embeddings with sentence transformers
6. Connecting Text and Images
- 74 - Topics
- 75 - Components of a multimodal model
- 76 - Vision-language understanding
- 77 - Contrastive language-image pretraining
- 78 - Embedding text and images with CLIP
- 79 - Zero-shot image classification with CLIP
- 80 - Semantic image search with CLIP
- 81 - Conditional generative models
- 82 - Introduction to latent diffusion models
- 83 - The latent diffusion model architecture
- 84 - Failure modes and additional tools
- 85 - Stable diffusion deconstructed
- 86 - Writing your own stable diffusion pipeline
- 87 - Decoding images from the stable diffusion latent space
- 88 - Improving generation with guidance
- 89 - Playing with prompts
7. Post-Training Procedures for Diffusion Models
- 90 - Topics
- 91 - Methods and metrics for evaluating generative AI
- 92 - Manual evaluation of stable diffusion with DrawBench
- 93 - Quantitative evaluation of diffusion models with human preference predictors
- 94 - Overview of methods for fine-tuning diffusion models
- 95 - Sourcing and preparing image datasets for fine-tuning
- 96 - Generating automatic captions with BLIP-2
- 97 - Parameter efficient fine-tuning with LoRa
- 98 - Inspecting the results of fine-tuning
- 99 - Inference with LoRas for style-specific generation
- 100 - Conceptual overview of textual inversion
- 101 - Subject-specific personalization with DreamBooth
- 102 - DreamBooth versus LoRa fine-tuning
- 103 - DreamBooth fine-tuning with Hugging Face
- 104 - Inference with DreamBooth to create personalized AI avatars
- 105 - Adding conditional control to text-to-image diffusion models
- 106 - Creating edge and depth maps for conditioning
- 107 - Depth and edge-guided stable diffusion with ControlNet
- 108 - Understanding and experimenting with ControlNet parameters
- 109 - Generative text effects with font depth maps
- 110 - Few step generation with adversarial diffusion distillation (ADD)
- 111 - Reasons to distill
- 112 - Comparing SDXL and SDXL Turbo
- 113 - Text-guided image-to-image translation
- 114 - Video-driven frame-by-frame generation with SDXL Turbo
- 115 - Near real-time inference with PyTorch performance optimizations
Conclusion
- 116 - Programming generative AI - Summary
Related courses
- Hands-On AI: Building Your First Conversational AI Chatbot
- Build LLM Evaluation Applications with LangChain
- AI Workshop: Building AI Applications with Hugging Face Models
- AI Sentiment Analysis with PyTorch and Hugging Face Transformers
- Building a RAG Solution from Scratch
- A Hands-On Introduction to Hugging Face for Developers
- Advanced RAG Applications with Vector Databases
- Generative AI and Open Source Models: Hands-On Practice with Hugging Face Models