Building Resilient Systems
Ends soon! Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.
Recommended experience
Recommended experience
What you'll learn
Explain core resilience engineering principles and differentiate between failure types in modern distributed systems.
Analyze system architectures to identify single points of failure and resilience gaps that could impact availability.
Develop disaster recovery strategies aligned with defined business requirements such as RTO and RPO.
Evaluate monitoring, observability, and incident response practices to improve system reliability and operational resilience.
Skills you'll gain
- Network Monitoring
- Load Balancing
- System Monitoring
- Service Level
- Site Reliability Engineering
- System Design and Implementation
- System Implementation
- High Voltage
- Disaster Recovery
- Systems Architecture
- Risk Management Framework
- Solution Architecture
- Distributed Computing
- Systems Design
- Software Design Patterns
- Incident Management
- Continuous Monitoring
- Business Continuity
- Software Architecture
- Incident Response
Details to know
April 2026
4 assignments
See how employees at top companies are mastering in-demand skills
There are 4 modules in this course
Building resilient systems requires more than knowing individual toolsβit demands the ability to design architectures that anticipate failure and recover effectively. In this intermediate course, you will learn how to apply resilience engineering principles to modern distributed systems, focusing on high availability, fault tolerance, and disaster recovery planning.
You will analyze how and why systems fail, identify hidden risks in system architecture, and design strategies that improve uptime and reliability. The course connects key concepts such as load balancing, redundancy, observability, and incident response into a cohesive resilience strategy aligned with business goals like RTO and RPO. Designed for IT professionals, DevOps engineers, and system architects, this course emphasizes practical decision-making, trade-offs, and operational readiness. By the end, you will be able to design resilient architectures, strengthen system reliability, and lead effective incident management and continuous improvement practices.
This module introduces the core concepts behind resilient system design. Learners will explore why failures are inevitable, how resilient systems differ from traditional architectures, and the foundational principles used to build systems that can withstand, adapt to, and recover from disruptions. The module sets the mindset and technical baseline required for designing reliable and fault-aware systems.
What's included
11 videos2 readings1 assignment1 peer review1 discussion prompt
11 videosβ’Total 86 minutes
- Welcome to Building Resilient Systemsβ’7 minutes
- Module Introduction β’3 minutes
- Why Systems Fail β’9 minutes
- Failure Types and Their Impact β’10 minutes
- Learning from Real-World Outages β’10 minutes
- Defining Resilience in Modern Systems β’8 minutes
- Key Characteristics of Resilient Architectures β’8 minutes
- Resilience v/s. Traditional Design Approaches β’8 minutes
- Core Principles of Resilience Engineeringβ’8 minutes
- Redundancy, Diversity, and Isolation β’7 minutes
- Trade-offs in Resilient Design β’9 minutes
2 readingsβ’Total 10 minutes
- Welcome to the Course: Course Overviewβ’5 minutes
- Designing Resilient Systems β’5 minutes
1 assignmentβ’Total 20 minutes
- Foundations of Resilient Systemsβ’20 minutes
1 peer reviewβ’Total 10 minutes
- Hands-On-Learning: Identifying Failure Risks in a System Design β’10 minutes
1 discussion promptβ’Total 10 minutes
- Designing for Failure Before It Happensβ’10 minutes
This module focuses on designing systems that remain available despite failures. Learners will explore high availability concepts, fault tolerance techniques, and architectural patterns used to eliminate single points of failure. The module emphasizes practical design decisions that improve uptime while balancing cost and complexity.
What's included
10 videos1 reading1 assignment1 peer review1 discussion prompt
10 videosβ’Total 77 minutes
- Module Introduction β’3 minutes
- What High Availability Really Meansβ’8 minutes
- Availability Metrics and SLAs β’7 minutes
- Eliminating Single Points of Failure β’13 minutes
- Active-Active v/s. Active-Passive Designs β’6 minutes
- Load Balancing and Traffic Distribution β’8 minutes
- Failover Mechanisms and Health Checks β’7 minutes
- Designing for Partial Failures β’8 minutes
- Graceful Degradation and Backpressure β’8 minutes
- Containing Failures with Isolation β’9 minutes
1 readingβ’Total 5 minutes
- High Availability and Fault-Tolerant Architecture β’5 minutes
1 assignmentβ’Total 20 minutes
- High Availability and Fault Tolerance Designβ’20 minutes
1 peer reviewβ’Total 10 minutes
- Hands-On-Learning: Designing a High Availability Architecture β’10 minutes
1 discussion promptβ’Total 10 minutes
- Balancing Availability, Cost, and Complexityβ’10 minutes
This module focuses on preparing systems and teams to recover from major disruptions. Learners will explore backup and recovery strategies, define recovery objectives, design disaster recovery testing approaches, and create operational runbooks that support consistent and effective recovery. The module emphasizes planning, decision-making, and operational readiness rather than tool-specific implementation.
What's included
10 videos1 reading1 assignment1 peer review1 discussion prompt
10 videosβ’Total 74 minutes
- Module Introduction β’5 minutes
- Backup Strategies and Recovery Models β’8 minutes
- Understanding RTO and RPO β’8 minutes
- Designing Backup and Recovery Solutions β’8 minutes
- Why Disaster Recovery Testing Matters β’8 minutes
- Types of Disaster Recovery Tests β’6 minutes
- Developing Disaster Recovery Testing Procedures β’8 minutes
- What is an Operational Runbook β’7 minutes
- Runbook Structure and Best Practices β’8 minutes
- Creating Effective Recovery Runbooks β’10 minutes
1 readingβ’Total 5 minutes
- Disaster Recovery Planning and RTO or RPO Concepts β’5 minutes
1 assignmentβ’Total 20 minutes
- Disaster Recovery Planning and Operational Readinessβ’20 minutes
1 peer reviewβ’Total 10 minutes
- Hands-On-Learning: Creating a Disaster Recovery Planβ’10 minutes
1 discussion promptβ’Total 10 minutes
- Evaluating Recovery Readiness in Real-World Environmentsβ’10 minutes
This module focuses on maintaining system reliability through effective monitoring, observability, and structured incident management. Learners will explore how logs, metrics, and traces provide system visibility, how alerting strategies support timely response, and how post-incident reviews drive continuous improvement. The module emphasizes operational effectiveness and learning from incidents rather than tool-specific implementation.
What's included
11 videos1 reading1 assignment2 peer reviews1 discussion prompt
11 videosβ’Total 66 minutes
- Module Introduction β’2 minutes
- Monitoring v/s. Observability β’6 minutes
- Observability Pillars: Logs, Metrics, and Traces β’6 minutes
- Implementing Comprehensive Observability β’7 minutes
- Principles of Effective Alerting β’6 minutes
- Alert Thresholds and Escalation Paths β’6 minutes
- Designing Effective Alerting Strategies β’6 minutes
- Incident Lifecycle and Response Review β’9 minutes
- Conducting Productive Post-Incident Reviews β’7 minutes
- Driving Continuous Improvement from Incidents β’6 minutes
- Course Wrap-Upβ’5 minutes
1 readingβ’Total 5 minutes
- Observability and Incident Management Fundamentals β’5 minutes
1 assignmentβ’Total 20 minutes
- Monitoring, Observability, and Incident Managementβ’20 minutes
2 peer reviewsβ’Total 70 minutes
- Hands-On-Learning: Incident Analysis and Post-Incident Review β’10 minutes
- Project: Designing and Defending a Resilient System Architectureβ’60 minutes
1 discussion promptβ’Total 10 minutes
- Designing Observability and Alerting for Real Impactβ’10 minutes
Instructors
Offered by
Why people choose Coursera for their career
Frequently asked questions
To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.
When you purchase a Certificate you get access to all course materials, including graded assignments. Upon completing the course, your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.
Yes. In select learning programs, you can apply for financial aid or a scholarship if you canβt afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, youβll find a link to apply on the description page.
More questions
Financial aid available,
