VOOZH about

URL: https://www.coursera.org/learn/generative-ai-for-audio-and-images-models-and-applications

⇱ Generative AI for Audio and Images: Models and Applications | Coursera


Generative AI for Audio and Images: Models and Applications

Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.

Generative AI for Audio and Images: Models and Applications

Included with

Ask Coursera

Gain insight into a topic and learn the fundamentals.
3 weeks to complete
at 10 hours a week
Flexible schedule
Learn at your own pace

Gain insight into a topic and learn the fundamentals.
3 weeks to complete
at 10 hours a week
Flexible schedule
Learn at your own pace

Build your subject-matter expertise

This course is part of the Generative AI Fundamentals Specialization
When you enroll in this course, you'll also be enrolled in this Specialization.
  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate

There are 4 modules in this course

Generative AI for Audio and Images: Models and Applications offers an in-depth exploration of how modern generative models such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Transformers, and Diffusion models are used to create, manipulate, and enhance audio, image, and video content.

Learners examine the architectures, training processes, and use cases of these models across different modalities, gaining both conceptual understanding and practical insights through hands-on activities. The course also highlights the ethical and societal implications of generative AI, including bias, transparency, intellectual property, and the challenges of deepfake technologies. By covering foundational theory as well as state-of-the-art approaches and applications, this course prepares learners to apply and develop generative AI creatively and responsibly for the audio and image modalities. By the end of this course, learners will be able to: Outline core concepts, challenges, and the history of AI-generated audio. Analyze important foundational audio generation models, such as variational and vector quantized autoencoders (VAE and VQ-VAE) Examine how these models integrate with the latest GenAI technologies to form hybrid, state-of-the-art transformer and diffusion-based audio generation systems, Study the architecture and functionality of Generative Adversarial Networks (GANs), and their variations. Implement and train GAN models for creating and enhancing visual content, Explore cutting-edge techniques such as diffusion models and transformers for image and video creation. Discuss the ethical considerations regarding generative AI for audio and images.

This module introduces the foundations and core concepts of AI-generated audio. Learners explore why audio generation is uniquely challenging, such representation and evaluation challenges. They learn how audio is represented and processed, compare waveform and symbolic formats, and common audio data formats and Python libraries for working with audio. The module also examines methods for evaluating generated audio and provides a framework for categorizing audio generation approaches by their functionality and human–AI collaboration level. It concludes with a historical overview of AI-generated audio, tracing its evolution from early rule-based methods to modern deep generative models.

What's included

21 videos3 readings4 assignments2 discussion prompts

21 videosTotal 135 minutes
  • Course Introduction6 minutes
  • Meet your instructor: Anahita Doosti1 minute
  • Meet your instructor: Nasimeh Asgarian1 minute
  • Overview of AI for Audio and Music Generation7 minutes
  • Why Is Audio Generation Difficult?9 minutes
  • Data representation: Waveform vs Symbolic8 minutes
  • Data Formats7 minutes
  • Evaluation (part 1)5 minutes
  • Evaluation (part 2)10 minutes
  • Categorizing Audio Generation Approaches6 minutes
  • The Many Forms of Audio Generation6 minutes
  • Audio Functionality9 minutes
  • Human-AI Collaboration 7 minutes
  • Putting It into Practice3 minutes
  • An Overview of the Progress Throughout the Years7 minutes
  • Pre-ML Approaches: Algorithmic, Rule-Based10 minutes
  • Early ML Approaches: HMMs, FF Neural Networks7 minutes
  • Modern Approaches 1: RNNs and CNNs10 minutes
  • Modern Approaches 2: Autoencoders/VAEs and GANs6 minutes
  • Modern Approaches 3: Transformers and Diffusion9 minutes
  • Module 1 Recap2 minutes
3 readingsTotal 140 minutes
  • Terminology10 minutes
  • Python Libraries for Audio Data10 minutes
  • WaveNet Implementation (Hands-on Lab)120 minutes
4 assignmentsTotal 145 minutes
  • Module 1 Quiz80 minutes
  • Practice Quiz 130 minutes
  • Practice Quiz 220 minutes
  • Practice Quiz 315 minutes
2 discussion promptsTotal 20 minutes
  • Learning Goal10 minutes
  • Is AI even capable of achieving true creativity?10 minutes

Building on the fundamentals, this module dives into advanced models for audio generation. Learners study Variational Autoencoders (VAEs) and their variants, and how they apply to melody generation and speech synthesis. The module also explores transformer-based models, such as Music Transformer, AudioLM, and FastSpeech, as well as diffusion-based models like DiffWave and Stable Audio. Through these lessons, learners gain a comprehensive understanding of how modern generative architectures produce realistic, high-quality audio and music.

What's included

31 videos2 readings4 assignments

31 videosTotal 202 minutes
  • Introduction to Variational Autoencoders4 minutes
  • Autoencoders5 minutes
  • Latent Space8 minutes
  • Inside the Encoder-Decoder Blocks8 minutes
  • Training VAEs (Part 1)5 minutes
  • Training VAEs (Part 2)7 minutes
  • Vector Quantized Variational Autoencoders (Part 1)6 minutes
  • Vector Quantized Variational Autoencoders (Part 2)6 minutes
  • Using VAE to Generate Melodies7 minutes
  • How to Condition VAEs with Additional Musical Information Such as Chord, Scale?7 minutes
  • Example: MusicVAE8 minutes
  • Attribute Vector Arithmetic for Melodies 8 minutes
  • Example: Jukebox6 minutes
  • Example: Speech Synthesis8 minutes
  • Strengths and limitations of VAE-based approaches5 minutes
  • Transformer Primer6 minutes
  • Transformers for Audio Generation6 minutes
  • Example: Music Transformer13 minutes
  • Revisiting JukeBox: How Transformers Can Generate Waveform Audio! (Part 1)9 minutes
  • Revisiting JukeBox: How Transformers Can Generate Waveform Audio! (Part 2)4 minutes
  • A New Paradigm: Audio Codec + Language Model (Part 1)6 minutes
  • A New Paradigm: Audio Codec + Language Model (Part 2)8 minutes
  • Example: FastSpeech8 minutes
  • Strengths and Limitations of Transformer-Based Approaches5 minutes
  • What Are Diffusion Models, and How Can They Generate Audio?5 minutes
  • Example: Stable Audio6 minutes
  • Example: DiffWave5 minutes
  • Strengths and Limitations of Diffusion-Based Approaches5 minutes
  • How Do the Recent Models Compare to Each Other?9 minutes
  • What Is on the Horizon? Where Are We Headed?7 minutes
  • Module 2 Recap3 minutes
2 readingsTotal 130 minutes
  • Resource Guide10 minutes
  • Audio Generation Models Inference and Comparison (Hands-on Lab)120 minutes
4 assignmentsTotal 125 minutes
  • Module 2 Quiz80 minutes
  • Practice Quiz15 minutes
  • Practice Quiz15 minutes
  • Practice Quiz15 minutes

This module transitions from audio to image generation, introducing the principles and evolution of image and video synthesis. Learners examine key architectures like GANs and VAEs, explore how adversarial training works, and study variations such as Conditional and Progressive GANs, Pix2Pix, and CycleGAN. The module also connects theory to practice by showcasing creative and commercial applications—from art and design to data augmentation—demonstrating how generative models enhance realism and variety in visual outputs.

What's included

22 videos3 readings5 assignments

22 videosTotal 156 minutes
  • Overview of AI for Image and Video Generation8 minutes
  • Applications of Image and Video Generation8 minutes
  • DALL-E and MidJourney Examples8 minutes
  • Sora Examples5 minutes
  • A Short History of Image Generation8 minutes
  • Revisit VAE6 minutes
  • Introducing GAN8 minutes
  • Discriminator7 minutes
  • Generator9 minutes
  • GAN Training6 minutes
  • Challenges and Best Practices for GAN Training6 minutes
  • Progressive GAN8 minutes
  • Conditional GANs8 minutes
  • Applications, Advantages and Limitations of cGANs7 minutes
  • Image-to-Image Translation7 minutes
  • Challenges and Applications of Image-to-Image Translation5 minutes
  • Text to Image GAN9 minutes
  • Other GAN Variations: Cycle GAN, DCGAN, StyleGAN10 minutes
  • Creative design9 minutes
  • Commercial Use Cases7 minutes
  • Data Augmentation7 minutes
  • Module 3 Recap2 minutes
3 readingsTotal 140 minutes
  • Style GAN10 minutes
  • Data synthesis10 minutes
  • DCGAN from Scratch (Hands-on Lab)120 minutes
5 assignmentsTotal 140 minutes
  • Module 3 Quiz80 minutes
  • Practice Quiz 115 minutes
  • Practice Quiz 215 minutes
  • Practice Quiz 315 minutes
  • Practice Quiz 415 minutes

In this module,we explore the final stages of what large language models (LLMs) can offer. You’ll learn how and when to use fine-tuning, along with the pros and cons of different approaches. Throughout the course, you will receive relevant assignments that prepare you for the capstone project: building a fully functional chatbot

What's included

21 videos1 reading4 assignments

21 videosTotal 146 minutes
  • Overview on Key Models and Architectures8 minutes
  • High-Level Overview of Vision Transformer8 minutes
  • Encoder-Decoder Design Pattern9 minutes
  • Convolutional Encoders10 minutes
  • Self Attention9 minutes
  • Spatial vs. Channel vs. Temporal Attention8 minutes
  • Diffusion Model Architecture High-Level Overview7 minutes
  • Forward / Diffusion Process7 minutes
  • Reverse Process7 minutes
  • Diffusion Model Training5 minutes
  • Examples of Diffusion Model6 minutes
  • Bias in Training Data8 minutes
  • Transparency9 minutes
  • Intellectual Property8 minutes
  • Data Privacy7 minutes
  • Deepfake Intro9 minutes
  • Deep Fake - Face Swap5 minutes
  • Voice Cloning4 minutes
  • Video Deep Fake6 minutes
  • Module 4 Recap2 minutes
  • Course Wrap Up3 minutes
1 readingTotal 120 minutes
  • ViT vs. Diffusion (Hands-on Lab)120 minutes
4 assignmentsTotal 158 minutes
  • Module 4 Quiz80 minutes
  • Practice Quiz 130 minutes
  • Practice Quiz 230 minutes
  • Practice Quiz 318 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructor

Alberta Machine Intelligence Institute
2 Courses614 learners

Explore more from Algorithms

Why people choose Coursera for their career

👁 Image

Felipe M.

Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."
👁 Image

Jennifer J.

Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."
👁 Image

Larry W.

Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."
👁 Image

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Frequently asked questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

Financial aid available,