Transformer Architectures and Multimodal Models
Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.
Transformer Architectures and Multimodal Models
This course is part of Advanced Deep Learning Architectures Specialization
Included with
Ask Coursera
Recommended experience
Recommended experience
What you'll learn
Understand attention mechanisms and complete transformer architectures.
Implement multi-head attention and positional encoding techniques.
Analyze and optimize efficient transformer components like Flash Attention and MoE.
Build multimodal and similarity-based models using transformer foundations.
Skills you'll gain
- Software Architecture
- Artificial Intelligence and Machine Learning (AI/ML)
- Memory Management
- Generative Model Architectures
- Artificial Intelligence
- Distributed Computing
- Computer Vision
- Scalability
- Unsupervised Learning
- Recurrent Neural Networks (RNNs)
- Deep Learning
- Natural Language Processing
- Embeddings
- Large Language Modeling
- Model Optimization
- Model Training
- Artificial Neural Networks
Tools you'll learn
Details to know
March 2026
13 assignments
See how employees at top companies are mastering in-demand skills
Build your subject-matter expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate
There are 4 modules in this course
This course explores the foundations and evolution of modern transformer architectures, taking you from early sequence models to advanced multimodal systems that power today’s AI breakthroughs. Combining strong conceptual depth with practical demonstrations, this course provides a structured journey through attention mechanisms, transformer design, efficiency innovations, and large-scale training strategies.
You will begin by understanding Recurrent Neural Networks (RNNs), LSTMs, and GRUs—examining their strengths and limitations in modeling sequential data. From there, you’ll transition into attention mechanisms and multi-head attention, uncovering how transformers overcame long-standing challenges like vanishing gradients and long-term dependency modeling. As the course progresses, you’ll build a deep understanding of encoder-decoder architectures, positional encoding techniques such as sinusoidal embeddings and RoPE, and efficiency innovations like Flash Attention, GQA, and Mixture of Experts (MoE). The course then expands into multimodal learning and similarity-based systems. You’ll explore Vision Transformers (ViTs), embedding alignment techniques, contrastive learning, and large-scale distributed training strategies. Through demonstrations and analysis, you’ll see how modern transformer systems scale to massive datasets while maintaining performance and memory efficiency. By the end of this course, you will be able to: • Explain the limitations of traditional RNN-based sequence models and how attention mechanisms address them. • Implement and analyze multi-head attention and transformer encoder-decoder architectures. • Compare positional encoding strategies and understand their impact on model generalization. • Evaluate efficiency techniques such as Flash Attention, GQA, and MoE for scaling transformers. • Understand Vision Transformers and multimodal representation learning. • Apply similarity learning concepts using embeddings and distance metrics. • Design scalable transformer training systems using distributed and memory-optimized strategies. • Architect transformer-based systems for real-world NLP and multimodal applications. This course is ideal for AI engineers, machine learning practitioners, researchers, and advanced students who want a rigorous understanding of transformer systems beyond surface-level usage. A foundational understanding of Python and basic neural networks will be helpful. Join us to master transformer architectures, explore multimodal intelligence, and build the technical depth required to understand and scale the models shaping modern AI.
Build a strong foundation in sequence modeling by exploring RNNs, LSTMs, GRUs, and the evolution toward attention mechanisms. Understand gradient challenges, long-term dependency solutions, and how self-attention transforms contextual learning. Through guided demonstrations, you’ll visualize sequence flow, attention behavior, and multi-head representations in action.
What's included
11 videos5 readings4 assignments
11 videos•Total 61 minutes
- Specialization Introduction•4 minutes
- Course Introduction•3 minutes
- Recurrent Neural Networks and Backpropagation•6 minutes
- Demonstration: Forward Pass in RNNs•7 minutes
- Demonstration: Vanishing Gradient Illustration in RNN•7 minutes
- LSTM and GRU: Gated Architectures•4 minutes
- Demonstration: LSTM Networks for Sequence Modeling•6 minutes
- Demonstration: GRU Based Sequence Modeling•7 minutes
- Self-Attention and Multi-Head Attention Explained•4 minutes
- Demonstration: Multi-Head Attention in Transformer•6 minutes
- Demonstration : Head Contribution Analysis•7 minutes
5 readings•Total 85 minutes
- Welcome to Transformer Architectures and Multimodal Models•10 minutes
- Understanding RNNs: Sequence Modeling and Gradient Challenges•20 minutes
- Gated Recurrent Networks: Solving Long-Term Dependency Problems•20 minutes
- Attention Mechanisms: From Context Weighting to Multi-Head Representations•20 minutes
- Module Summary: Sequence Models and Attention Foundations•15 minutes
4 assignments•Total 48 minutes
- Knowledge Check: Sequence Models and Attention Foundations•30 minutes
- Practice Knowledge Check: Recurrent Neural Networks (RNN) Foundations•6 minutes
- Practice Knowledge Check: Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU)•6 minutes
- Practice Knowledge Check: Attention and Multi-Head Attention Mechanisms•6 minutes
Explore the full transformer architecture, from encoder–decoder models to positional encoding and efficiency optimizations. Learn how attention layers, masking, and autoregressive decoding work together to power modern language models. Through practical walkthroughs, you’ll analyze transformer blocks, positional strategies like RoPE, and scalable design techniques such as Flash Attention and Mixture of Experts.
What's included
14 videos4 readings4 assignments
14 videos•Total 66 minutes
- Encoder and Decoder Architecture•4 minutes
- Demonstration: Encoder Forward Pass in Transformer Encoders: Attention Foundations•4 minutes
- Demonstration: Encoder Forward Pass in Transformer Encoders: Encoder Stack•5 minutes
- Demonstration: Autoregressive Decoding in Transformer Decoders: Core Components•4 minutes
- Demonstration: Autoregressive Decoding in Transformer Decoders: Autoregressive Generation•5 minutes
- Sinusoidal and RoPE Encodings•3 minutes
- Demonstration: RoPE Implementation•7 minutes
- Demonstration: Encoding Comparison: Positional Encoding Mechanism•7 minutes
- Demonstration: Encoding Comparison: Encoding Impact Analysis•7 minutes
- Flash Attention GQA and MoE•4 minutes
- Demonstration: Memory Efficient Attention: Standard Attention Baseline•4 minutes
- Demonstration: Memory Efficient Attention: Optimized Attention•4 minutes
- Demonstration: Expert Routing Visualization: Token to Expert Routing•3 minutes
- Demonstration: Expert Routing Visualization: Capacity and Load Balancing •5 minutes
4 readings•Total 75 minutes
- Transformer Encoder Decoder Models•20 minutes
- Positional Encoding Methods•20 minutes
- Efficient Transformer Design•20 minutes
- Module Summary: Complete Transformer Architectures•15 minutes
4 assignments•Total 48 minutes
- Knowledge Check: Complete Transformer Architectures•30 minutes
- Practice Knowledge Check: Transformer Blocks•6 minutes
- Practice Knowledge Check: Positional Encoding Techniques•6 minutes
- Practice Knowledge Check: Efficient Transformer Components•6 minutes
Expand beyond text to understand how transformers power multimodal AI and semantic similarity systems. Learn how vision and language models align embeddings, how similarity learning structures semantic space, and how large models scale through distributed training. Through applied demos, you’ll explore embedding alignment, semantic search concepts, and large-scale transformer optimization strategies.
What's included
15 videos4 readings4 assignments
15 videos•Total 74 minutes
- Vision Transformers and Multimodal Learning•4 minutes
- Demonstration: Image and Text Embedding Alignment: Similarity Computation•7 minutes
- Demonstration: Image and Text Embedding Alignment: Retrieval Visualization•5 minutes
- Demonstration: Multimodal Representation Analysis: Similarity Evaluation•7 minutes
- Demonstration: Multimodal Representation Analysis: Representation Geometry•7 minutes
- Text Embeddings and Similarity Learning•4 minutes
- Demonstration: Semantic Text Similarity: Computation and Heatmap Analysis •5 minutes
- Demonstration: Semantic Text Similarity: Embedding Space Geometry•4 minutes
- Demonstration: Embedding Distance Metrics: Similarity Foundations•5 minutes
- Demonstration: Embedding Distance Metrics: Visualizing and Ranking Analysis•4 minutes
- Distributed Transformer Training•3 minutes
- Demonstration: Large Model Training Setup: Architecture Setup•6 minutes
- Demonstration: Large Model Training Setup: Training and Optimisation•5 minutes
- Demonstration: Memory Usage Optimization: Model Setup •5 minutes
- Demonstration: Memory Usage Optimization: Benchmark and Comparison•4 minutes
4 readings•Total 75 minutes
- Multimodal Deep Learning•20 minutes
- Similarity Learning for Text•20 minutes
- Scaling Transformer Systems•20 minutes
- Module Summary: Multimodal and Similarity-Based Models•15 minutes
4 assignments•Total 48 minutes
- Knowledge Check: Multimodal and Similarity-Based Models•30 minutes
- Practice Knowledge Check: Multimodal Models•6 minutes
- Practice Knowledge Check: Similarity Models•6 minutes
- Practice Knowledge Check: Scaling Strategies•6 minutes
Apply your knowledge of sequence models, transformers, multimodal learning, and scaling strategies in a comprehensive practice project. Integrate architectural concepts, embedding techniques, and efficiency optimizations into a cohesive system-level design. Through guided implementation and evaluation, you’ll strengthen your ability to analyze, compare, and optimize transformer-based AI systems in real-world scenarios.
What's included
1 video1 reading1 assignment
1 video•Total 2 minutes
- Course Summary•2 minutes
1 reading•Total 60 minutes
- Practice Project: Building a Multimodal Transformer-Based Knowledge and Similarity Engine•60 minutes
1 assignment•Total 30 minutes
- End Knowledge Check: Transformer Architecture and Multimodal Models•30 minutes
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Explore more from Machine Learning
Course
Course
Why people choose Coursera for their career
Frequently asked questions
Basic knowledge of Python, linear algebra, and neural networks is recommended.
RNNs, attention mechanisms, transformers, efficiency techniques, multimodal models, and scaling strategies.
The course is designed to be completed in approximately 6–8 weeks.
More questions
Financial aid available,
