Introduction to PySpark | Distributed Computing with Apache Spark

Last Updated : 18 Jul, 2025

As data grows rapidly from sources like social media and e-commerce, traditional systems fall short. Distributed computing, with tools like Apache Spark and PySpark, enables fast, scalable data processing. This article covers the basics, key features and a hands-on PySpark.

What is Distributed Computing?

Distributed computing is a computing model where large computational tasks are divided and executed across multiple machines (nodes) that work in parallel. Think of it as breaking a huge job into smaller parts and assigning each part to a different worker. It's key features include:

Speed: Multiple nodes work simultaneously.
Scalability: Add more nodes to handle more data.
Fault tolerance: If one node fails, others continue.

What is Apache Spark?

Apache Spark is an open-source distributed computing engine developed by the Apache Software Foundation. It is designed to process large datasets quickly and efficiently across a cluster of machines. It's key features include:

High Performance: Much faster than Hadoop, thanks to in-memory computing.
Multi-language Support: Works with Python, Scala, Java and R.
All-in-One Engine: Handles batch, streaming, ML and graph processing.
Easy to Use: Offers simple, high-level APIs built on the MapReduce model.

What is Pyspark?

PySpark is the Python API for Apache Spark, allowing Python developers to use the full power of Spark’s distributed computing framework with familiar Python syntax. It bridges the gap between Python’s ease of use and Spark’s processing power. It's key features include:

Python-Friendly: Build Spark applications using pure Python great for data scientists and engineers.
Handles Big Data: Efficiently process huge datasets across multiple machines.
Rich Libraries: Includes modules for SQL (pyspark.sql), machine learning (pyspark.ml), and streaming (pyspark.streaming).
DataFrame & SQL API: Work with structured data using powerful, SQL-like operations.

PySpark Modules

PySpark is built in a modular way, offering specialized libraries for different data processing tasks:

Module	Description
pyspark.sql	Work with structured data using DataFrames and SQL queries.
pyspark.ml	Build machine learning pipelines (classification, regression, clustering, etc.).
pyspark.streaming	Process real-time data streams (e.g., Twitter feed, logs).
pyspark.graphx	Handle graph computations and social network analysis (Scala/Java primarily).

How PySpark Works

When you run a PySpark application, it follows a structured workflow to process large datasets efficiently across a distributed cluster. Here’s a high-level overview:

Driver Program: Your Python script that initiates and controls the Spark job.
SparkContext: Connects the driver to the Spark cluster and manages job configuration.
RDDs/DataFrames: Data structures that are distributed and processed in parallel.
Cluster Manager: Schedules and allocates resources to worker nodes (e.g., YARN, Mesos, Kubernetes).
Executor Nodes: Run the actual tasks in parallel and return results to the driver.

Basic Example: Word Count with PySpark

Here’s a simple PySpark example that reads a text file and counts the frequency of each word:

Output

[('PySpark', 1), ('makes', 1), ('big', 1), ('data', 1), ('processing', 1), ('fast', 1), ('and', 1), ('easy', 1), ('with', 1), ('Python', 1)]

Explanation:

sc.parallelize() creates an RDD from the text. flatMap() splits lines into words, map() creates (word, 1) pairs and reduceByKey() adds up word counts.
collect() gathers the final word count output from all Spark worker nodes to the driver.
sc.stop() stops the SparkContext to free up system resources.

Comment

Article Tags:

Big Data

python

URL: https://www.geeksforgeeks.org/big-data/introduction-pyspark-distributed-computing-apache-spark/