Generative AI for Audio and Images: Models and Applications
Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.
Generative AI for Audio and Images: Models and Applications
This course is part of Generative AI Fundamentals Specialization
Instructor: Anahita Doosti
Included with
Ask Coursera
Skills you'll gain
Details to know
See how employees at top companies are mastering in-demand skills
Build your subject-matter expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate
There are 4 modules in this course
Generative AI for Audio and Images: Models and Applications offers an in-depth exploration of how modern generative models such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Transformers, and Diffusion models are used to create, manipulate, and enhance audio, image, and video content.
Learners examine the architectures, training processes, and use cases of these models across different modalities, gaining both conceptual understanding and practical insights through hands-on activities. The course also highlights the ethical and societal implications of generative AI, including bias, transparency, intellectual property, and the challenges of deepfake technologies. By covering foundational theory as well as state-of-the-art approaches and applications, this course prepares learners to apply and develop generative AI creatively and responsibly for the audio and image modalities. By the end of this course, learners will be able to: Outline core concepts, challenges, and the history of AI-generated audio. Analyze important foundational audio generation models, such as variational and vector quantized autoencoders (VAE and VQ-VAE) Examine how these models integrate with the latest GenAI technologies to form hybrid, state-of-the-art transformer and diffusion-based audio generation systems, Study the architecture and functionality of Generative Adversarial Networks (GANs), and their variations. Implement and train GAN models for creating and enhancing visual content, Explore cutting-edge techniques such as diffusion models and transformers for image and video creation. Discuss the ethical considerations regarding generative AI for audio and images.
This module introduces the foundations and core concepts of AI-generated audio. Learners explore why audio generation is uniquely challenging, such representation and evaluation challenges. They learn how audio is represented and processed, compare waveform and symbolic formats, and common audio data formats and Python libraries for working with audio. The module also examines methods for evaluating generated audio and provides a framework for categorizing audio generation approaches by their functionality and human–AI collaboration level. It concludes with a historical overview of AI-generated audio, tracing its evolution from early rule-based methods to modern deep generative models.
What's included
21 videos3 readings4 assignments2 discussion prompts
21 videos•Total 135 minutes
- Course Introduction•6 minutes
- Meet your instructor: Anahita Doosti•1 minute
- Meet your instructor: Nasimeh Asgarian•1 minute
- Overview of AI for Audio and Music Generation•7 minutes
- Why Is Audio Generation Difficult?•9 minutes
- Data representation: Waveform vs Symbolic•8 minutes
- Data Formats•7 minutes
- Evaluation (part 1)•5 minutes
- Evaluation (part 2)•10 minutes
- Categorizing Audio Generation Approaches•6 minutes
- The Many Forms of Audio Generation•6 minutes
- Audio Functionality•9 minutes
- Human-AI Collaboration •7 minutes
- Putting It into Practice•3 minutes
- An Overview of the Progress Throughout the Years•7 minutes
- Pre-ML Approaches: Algorithmic, Rule-Based•10 minutes
- Early ML Approaches: HMMs, FF Neural Networks•7 minutes
- Modern Approaches 1: RNNs and CNNs•10 minutes
- Modern Approaches 2: Autoencoders/VAEs and GANs•6 minutes
- Modern Approaches 3: Transformers and Diffusion•9 minutes
- Module 1 Recap•2 minutes
3 readings•Total 140 minutes
- Terminology•10 minutes
- Python Libraries for Audio Data•10 minutes
- WaveNet Implementation (Hands-on Lab)•120 minutes
4 assignments•Total 145 minutes
- Module 1 Quiz•80 minutes
- Practice Quiz 1•30 minutes
- Practice Quiz 2•20 minutes
- Practice Quiz 3•15 minutes
2 discussion prompts•Total 20 minutes
- Learning Goal•10 minutes
- Is AI even capable of achieving true creativity?•10 minutes
Building on the fundamentals, this module dives into advanced models for audio generation. Learners study Variational Autoencoders (VAEs) and their variants, and how they apply to melody generation and speech synthesis. The module also explores transformer-based models, such as Music Transformer, AudioLM, and FastSpeech, as well as diffusion-based models like DiffWave and Stable Audio. Through these lessons, learners gain a comprehensive understanding of how modern generative architectures produce realistic, high-quality audio and music.
What's included
31 videos2 readings4 assignments
31 videos•Total 202 minutes
- Introduction to Variational Autoencoders•4 minutes
- Autoencoders•5 minutes
- Latent Space•8 minutes
- Inside the Encoder-Decoder Blocks•8 minutes
- Training VAEs (Part 1)•5 minutes
- Training VAEs (Part 2)•7 minutes
- Vector Quantized Variational Autoencoders (Part 1)•6 minutes
- Vector Quantized Variational Autoencoders (Part 2)•6 minutes
- Using VAE to Generate Melodies•7 minutes
- How to Condition VAEs with Additional Musical Information Such as Chord, Scale?•7 minutes
- Example: MusicVAE•8 minutes
- Attribute Vector Arithmetic for Melodies •8 minutes
- Example: Jukebox•6 minutes
- Example: Speech Synthesis•8 minutes
- Strengths and limitations of VAE-based approaches•5 minutes
- Transformer Primer•6 minutes
- Transformers for Audio Generation•6 minutes
- Example: Music Transformer•13 minutes
- Revisiting JukeBox: How Transformers Can Generate Waveform Audio! (Part 1)•9 minutes
- Revisiting JukeBox: How Transformers Can Generate Waveform Audio! (Part 2)•4 minutes
- A New Paradigm: Audio Codec + Language Model (Part 1)•6 minutes
- A New Paradigm: Audio Codec + Language Model (Part 2)•8 minutes
- Example: FastSpeech•8 minutes
- Strengths and Limitations of Transformer-Based Approaches•5 minutes
- What Are Diffusion Models, and How Can They Generate Audio?•5 minutes
- Example: Stable Audio•6 minutes
- Example: DiffWave•5 minutes
- Strengths and Limitations of Diffusion-Based Approaches•5 minutes
- How Do the Recent Models Compare to Each Other?•9 minutes
- What Is on the Horizon? Where Are We Headed?•7 minutes
- Module 2 Recap•3 minutes
2 readings•Total 130 minutes
- Resource Guide•10 minutes
- Audio Generation Models Inference and Comparison (Hands-on Lab)•120 minutes
4 assignments•Total 125 minutes
- Module 2 Quiz•80 minutes
- Practice Quiz•15 minutes
- Practice Quiz•15 minutes
- Practice Quiz•15 minutes
This module transitions from audio to image generation, introducing the principles and evolution of image and video synthesis. Learners examine key architectures like GANs and VAEs, explore how adversarial training works, and study variations such as Conditional and Progressive GANs, Pix2Pix, and CycleGAN. The module also connects theory to practice by showcasing creative and commercial applications—from art and design to data augmentation—demonstrating how generative models enhance realism and variety in visual outputs.
What's included
22 videos3 readings5 assignments
22 videos•Total 156 minutes
- Overview of AI for Image and Video Generation•8 minutes
- Applications of Image and Video Generation•8 minutes
- DALL-E and MidJourney Examples•8 minutes
- Sora Examples•5 minutes
- A Short History of Image Generation•8 minutes
- Revisit VAE•6 minutes
- Introducing GAN•8 minutes
- Discriminator•7 minutes
- Generator•9 minutes
- GAN Training•6 minutes
- Challenges and Best Practices for GAN Training•6 minutes
- Progressive GAN•8 minutes
- Conditional GANs•8 minutes
- Applications, Advantages and Limitations of cGANs•7 minutes
- Image-to-Image Translation•7 minutes
- Challenges and Applications of Image-to-Image Translation•5 minutes
- Text to Image GAN•9 minutes
- Other GAN Variations: Cycle GAN, DCGAN, StyleGAN•10 minutes
- Creative design•9 minutes
- Commercial Use Cases•7 minutes
- Data Augmentation•7 minutes
- Module 3 Recap•2 minutes
3 readings•Total 140 minutes
- Style GAN•10 minutes
- Data synthesis•10 minutes
- DCGAN from Scratch (Hands-on Lab)•120 minutes
5 assignments•Total 140 minutes
- Module 3 Quiz•80 minutes
- Practice Quiz 1•15 minutes
- Practice Quiz 2•15 minutes
- Practice Quiz 3•15 minutes
- Practice Quiz 4•15 minutes
In this module,we explore the final stages of what large language models (LLMs) can offer. You’ll learn how and when to use fine-tuning, along with the pros and cons of different approaches. Throughout the course, you will receive relevant assignments that prepare you for the capstone project: building a fully functional chatbot
What's included
21 videos1 reading4 assignments
21 videos•Total 146 minutes
- Overview on Key Models and Architectures•8 minutes
- High-Level Overview of Vision Transformer•8 minutes
- Encoder-Decoder Design Pattern•9 minutes
- Convolutional Encoders•10 minutes
- Self Attention•9 minutes
- Spatial vs. Channel vs. Temporal Attention•8 minutes
- Diffusion Model Architecture High-Level Overview•7 minutes
- Forward / Diffusion Process•7 minutes
- Reverse Process•7 minutes
- Diffusion Model Training•5 minutes
- Examples of Diffusion Model•6 minutes
- Bias in Training Data•8 minutes
- Transparency•9 minutes
- Intellectual Property•8 minutes
- Data Privacy•7 minutes
- Deepfake Intro•9 minutes
- Deep Fake - Face Swap•5 minutes
- Voice Cloning•4 minutes
- Video Deep Fake•6 minutes
- Module 4 Recap•2 minutes
- Course Wrap Up•3 minutes
1 reading•Total 120 minutes
- ViT vs. Diffusion (Hands-on Lab)•120 minutes
4 assignments•Total 158 minutes
- Module 4 Quiz•80 minutes
- Practice Quiz 1•30 minutes
- Practice Quiz 2•30 minutes
- Practice Quiz 3•18 minutes
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Instructor
Explore more from Algorithms
- Status: Free Trial
Course
- Status: Free TrialA
Alberta Machine Intelligence Institute
Course
- Status: Free Trial
Course
- Status: Free TrialM
Microsoft
Course
Why people choose Coursera for their career
Frequently asked questions
To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.
When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.
Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.
More questions
Financial aid available,
