Optimizing Spark and Cloud Data Storage for Analytics
Keep adding new skills with 10,000+ programs for $239 (usually $399). Save now.
Optimizing Spark and Cloud Data Storage for Analytics
This course is part of Open source Data Engineering with Spark, dbt & Airflow Professional Certificate
Included with
Learn more
Ask Coursera
Recommended experience
Recommended experience
What you'll learn
Optimize Spark job performance through strategic partitioning and caching, achieving 30%+ runtime improvements using data access analysis.
Implement transactional data lakes with Delta format, enabling versioning, ACID operations, and schema evolution for reliable datasets.
Provision secure cloud data infrastructure using IAM policies, private networks, and encrypted storage following security best practices.
Evaluate and benchmark storage formats (Parquet, ORC, Avro) to select optimal solutions for analytical workloads and cost efficiency.
Skills you'll gain
- Data Storage Technologies
- Performance Tuning
- Transaction Processing
- Data Storage
- Cloud Infrastructure
- Cloud Security
- Cloud Deployment
- Data Warehousing
- Data Integrity
- Security Controls
- Cloud Computing
- Infrastructure Architecture
- Infrastructure as Code (IaC)
- Data Security
- Data Management
- Cloud Computing Architecture
Tools you'll learn
Details to know
March 2026
See how employees at top companies are mastering in-demand skills
Build your Data Analysis expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate from Coursera
There are 11 modules in this course
You will master advanced performance optimization techniques for large-scale data processing using Apache Spark and cloud storage technologies. In this hands-on course, you'll learn to diagnose and resolve performance bottlenecks that plague distributed data systems, implement strategic partitioning and caching strategies that can improve job performance by 30% or more, and design secure, cost-effective cloud data infrastructure.
You will gain expertise in transactional data lake technologies like Delta Lake, evaluate storage formats to optimize analytical workloads, and provision enterprise-grade cloud infrastructure with proper security controls. Through practical exercises, you'll analyze Spark execution plans, implement data versioning and ACID transactions, and benchmark different storage formats to make informed architectural decisions. By the end, you will have the skills to optimize data pipelines at scale, reduce cloud storage costs through intelligent format selection, and build robust data infrastructure that meets enterprise security requirements. This expertise directly addresses the performance challenges faced by data engineers working with petabyte-scale datasets in production environments.
You will discover why systematic performance analysis beats random configuration changes and master reading Spark UI metrics to identify bottlenecks.
What's included
3 videos1 reading1 assignment
3 videosβ’Total 14 minutes
- When Data Pipelines Crash: A Performance Crisisβ’3 minutes
- Spark UI Fundamentals: Reading the Performance Storyβ’6 minutes
- Navigating the Spark UI for Performance Optimizationβ’5 minutes
1 readingβ’Total 7 minutes
- Performance Bottleneck Identification: Patterns and Solutionsβ’7 minutes
1 assignmentβ’Total 3 minutes
- Spark UI Analysis Challengeβ’3 minutes
You will implement partitioning and caching strategies to achieve measurable performance improvements in distributed data processing.
What's included
3 videos1 reading2 assignments1 ungraded lab
3 videosβ’Total 16 minutes
- From 4 Hours to 5 Minutes: Netflix's Optimization Successβ’3 minutes
- Caching Strategies: Reducing Computation Costsβ’7 minutes
- Implementing Partitioning and Caching Optimizationsβ’5 minutes
1 readingβ’Total 7 minutes
- Partitioning Strategies: Minimizing Data Movementβ’7 minutes
2 assignmentsβ’Total 16 minutes
- Comprehensive Spark Optimization Assessmentβ’13 minutes
- Spark Performance Optimization and Analysis Techniquesβ’3 minutes
1 ungraded labβ’Total 20 minutes
- Implement Caching and Partitioning for Spark Performance Optimizationβ’20 minutes
You will develop foundational skills for analyzing distributed execution plans to identify performance bottlenecks caused by data shuffle and skew patterns in Spark applications.
What's included
3 videos3 readings1 assignment1 ungraded lab
3 videosβ’Total 14 minutes
- Why Performance Analysis Saves Data Teams from Pipeline Disastersβ’3 minutes
- Understanding Spark's Distributed Execution Architectureβ’6 minutes
- Interpreting Visual Execution Metrics and Performance Indicatorsβ’6 minutes
3 readingsβ’Total 22 minutes
- Data Shuffle and Skew: The Hidden Performance Killersβ’8 minutes
- Navigating Spark's Execution Monitoring Interfaceβ’7 minutes
- Identifying Bottleneck Patterns in Task Execution Metricsβ’7 minutes
1 assignmentβ’Total 3 minutes
- Knowledge Check: Execution Plan Analysis Fundamentalβ’3 minutes
1 ungraded labβ’Total 20 minutes
- Diagnose Performance Bottlenecks Through Execution Plan Analysisβ’20 minutes
You will apply advanced optimization strategies to resolve identified performance bottlenecks through partition tuning, broadcast joins, and configuration optimization techniques.
What's included
1 video1 reading3 assignments
1 videoβ’Total 7 minutes
- Configuration Optimization: Tuning Spark for Maximum Performanceβ’7 minutes
1 readingβ’Total 10 minutes
- Partition Strategies and Broadcast Join Optimization Techniquesβ’10 minutes
3 assignmentsβ’Total 30 minutes
- Final Assessment: Comprehensive Performance Bottleneck Analysis and Resolutionβ’12 minutes
- Optimize Real-World Performance Scenarioβ’15 minutes
- Knowledge Check: Performance Optimization Strategiesβ’3 minutes
You will understand why transactional features are essential for data lake reliability, explore the fundamental concepts of ACID transactions and versioning, and learn how to convert existing Parquet tables to transactional Delta format.
What's included
2 videos1 reading2 assignments
2 videosβ’Total 12 minutes
- Understanding ACID Transactions and Versioning in Data Lakesβ’8 minutes
- Converting Parquet Tables to Delta Formatβ’5 minutes
1 readingβ’Total 10 minutes
- Delta Lake Architecture and Transaction Log Mechanicsβ’10 minutes
2 assignmentsβ’Total 33 minutes
- Implementing Transactional and Versioning Features in Data Lake Tablesβ’30 minutes
- Transactional and Versioning Features Knowledge Checkβ’3 minutes
You will execute atomic write and delete operations with conditions, query historical table versions for audit purposes, verify rollback capabilities through version history, and demonstrate mastery through hands-on lab work and comprehensive assessment.
What's included
2 videos2 assignments1 ungraded lab
2 videosβ’Total 14 minutes
- Atomic Write Operations and Conditional Deletesβ’8 minutes
- Querying Historical Versions and Verifying Rollback Capabilitiesβ’6 minutes
2 assignmentsβ’Total 13 minutes
- Transactional and Versioning Features Mastery Assessmentβ’10 minutes
- Versioning Operations Knowledge Checkβ’3 minutes
1 ungraded labβ’Total 20 minutes
- Implementing Transactional Operations on Customer Analytics Tablesβ’20 minutes
You will understand fundamental cloud security principles, encryption methods, and access control concepts needed to provision secure data infrastructure using Infrastructure as Code.
What's included
3 videos1 reading2 assignments
3 videosβ’Total 21 minutes
- Why Cloud Security Breaches Cost Companies Millionsβ’3 minutes
- Core Principles of Cloud Security Architectureβ’13 minutes
- Analyzing Secure Terraform Configuration Patternsβ’5 minutes
1 readingβ’Total 10 minutes
- Infrastructure as Code Security Best Practicesβ’10 minutes
2 assignmentsβ’Total 23 minutes
- Secure Cloud Infrastructure Designβ’18 minutes
- Cloud Security Foundations Knowledge Checkβ’5 minutes
You will implement secure cloud data infrastructure using Terraform, creating encrypted storage with proper access controls and network isolation that demonstrates practical application of security principles.
What's included
2 videos1 reading2 assignments1 ungraded lab
2 videosβ’Total 15 minutes
- How Netflix Scales Secure Data Infrastructureβ’4 minutes
- Implementing Secure S3 Storage with Terraformβ’10 minutes
1 readingβ’Total 10 minutes
- IAM Security Patterns for Data Infrastructureβ’10 minutes
2 assignmentsβ’Total 18 minutes
- Secure Cloud Infrastructure Mastery Assessmentβ’12 minutes
- Secure Infrastructure Implementation Checkβ’6 minutes
1 ungraded labβ’Total 20 minutes
- Provision Secure Data Infrastructure with Terraformβ’20 minutes
You will establish foundational understanding of storage format trade-offs and begin evaluating columnar versus row-oriented approaches for analytical workloads.
What's included
3 videos1 reading1 assignment
3 videosβ’Total 11 minutes
- Why Storage Format Decisions Make or Break Analytics Performanceβ’3 minutes
- Columnar vs Row-Oriented Storage: Core Concepts and Trade-offsβ’5 minutes
- Analyzing Storage Format Trade-offs: A Systematic Approachβ’3 minutes
1 readingβ’Total 7 minutes
- Storage Format Deep Dive: Performance Characteristics and Use Casesβ’7 minutes
1 assignmentβ’Total 6 minutes
- Storage Format Knowledge Checkβ’6 minutes
You will conduct hands-on performance benchmarking of storage formats and create evidence-based recommendations that mirror professional data engineering decision-making processes.
What's included
1 video2 readings3 assignments
1 videoβ’Total 3 minutes
- Professional Recommendation Reports: Translating Benchmarks into Business Valueβ’3 minutes
2 readingsβ’Total 18 minutes
- Enterprise Benchmarking Methodologies: From Netflix to Cloudflareβ’10 minutes
- Interpreting Storage Format Performance Dataβ’8 minutes
3 assignmentsβ’Total 33 minutes
- Storage Architecture Decision: Comprehensive Analysis and Recommendationβ’12 minutes
- Create Professional Storage Recommendation Reportβ’15 minutes
- Performance Benchmarking Knowledge Checkβ’6 minutes
You will create a comprehensive data infrastructure optimization project that integrates Spark performance tuning, cloud security provisioning, and storage architecture evaluation. This project synthesizes distributed computing optimization, cloud infrastructure design, and data warehousing principles into a realistic enterprise solution.
What's included
4 readings1 assignment
4 readingsβ’Total 110 minutes
- Why This Project Mattersβ’10 minutes
- Project Requirementsβ’10 minutes
- Assignment: Data Infrastructure Optimization Projectβ’60 minutes
- Solution Keyβ’30 minutes
1 assignmentβ’Total 15 minutes
- Graded Quiz: Optimizing Spark and Cloud Data Storage for Analyticsβ’15 minutes
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Instructor
Offered by
Explore more from Data Analysis
- Status: Free Trial
Course
- Status: Free Trial
Course
- Status: Free Trial
Course
Course
Why people choose Coursera for their career
Frequently asked questions
To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.
When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.
More questions
Financial aid available,
ΒΉ Some assignments in this course are AI-graded. For these assignments, your data will be used in accordance with Coursera's Privacy Notice.
