Creating a PySpark DataFrame

Last Updated : 23 Jul, 2025

PySpark helps in processing large datasets using its DataFrame structure. In this article, we will see different methods to create a PySpark DataFrame. It starts with initialization of SparkSession which serves as the entry point for all PySpark applications which is shown below:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Lets see an example of creating DataFrame from a List of Rows. Here we can create a DataFrame from a list of rows where each row is represented as a Row object. This method is useful for small datasets that can fit into memory.

spark = SparkSession.builder.getOrCreate(): Initializes a SparkSession which is the entry point for working with PySpark or retrieves an existing session if one is already created.
df = spark.createDataFrame([...]): Creates a PySpark DataFrame using a list of Row objects where each row contains values for the columns a, b, c, d and e.

Output:

👁 pyspark1

Basic example using List of rows

Syntax

pyspark.sql.SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

Parameters:

data: Data we want to load into the DataFrame.
schema: A string or list specifying column names and data types. It is optional.
samplingRatio: Ratio of rows used for analysing schema. Default is None.
verifySchema: Ensures data types of each row match the schema. Default is True.

Returns: Dataframe

Different Methods to Create a PySpark DataFrame

1. Create PySpark DataFrame with an Explicit Schema

Here we can specify the schema explicitly to define the structure of DataFrame which is useful when we want more control over data types.

df = spark.createDataFrame([...], schema='a long, b double, c string, d date, e timestamp'): Creates a PySpark DataFrame using a list of tuples and an explicit schema that defines the column names and data types.

Output:

👁 pyspark2

Explicit Schema

2. Create DataFrame from a Pandas DataFrame

We can convert a Pandas DataFrame into a PySpark DataFrame for large-scale data processing.

pandas_df = pd.DataFrame({...}): Creates a Pandas DataFrame pandas_df with columns a, b, c, d, and e using sample data.
df = spark.createDataFrame(pandas_df): Converts Pandas DataFrame pandas_df into a PySpark DataFrame df.

Output:

👁 pyspark3

Using Pandas DataFrame

3. Create DataFrame from an RDD

We can convert an existing RDD (Resilient Distributed Dataset) into a DataFrame for structured data processing.

rdd = spark.sparkContext.parallelize([ ... ]): Creates an RDD from a list of tuples where each tuple represents a row of data.
df = spark.createDataFrame(rdd, schema=['a', 'b', 'c', 'd', 'e']): Converts RDD into a PySpark DataFrame and assigns column names (a, b, c, d, e) to DataFrame.

Output:

👁 pyspark4

Using RDD

4. Create DataFrame from a CSV File

PySpark can easily load data from a CSV file into a DataFrame. Here we are using random dataset for its implementation. Download the dataset from train_dataset.

df = spark.createDataFrame(pd.read_csv('/content/train_dataset-1.csv')): Reads a CSV file using Pandas read_csv() function and then converts resulting Pandas DataFrame into PySpark DataFrame.

Output:

👁 pyspark5

Using a CSV File

5. Create PySpark DataFrame from Text file

If our data is stored in a plain text file we can load each line as a row using the read.text() method. Here we are using a random .txt file which can be downloaded from here.

df = spark.createDataFrame(pd.read_csv('/content/text_file.txt', delimiter="\t")): Reads text file using pandas.read_csv() to load it into Pandas DataFrame.

Output:

👁 pyspark4

Using aext file

6. Create DataFrame from JSON

JSON is a common format used for structured data. We can use read.json() to load data from JSON files directly into a PySpark DataFrame. The file we are using can be downloaded from here.

df = spark.createDataFrame(pd.read_json('/content/json_data.json')): Reads a JSON file using pandas.read_json() to load it into a Pandas DataFrame.

Output:

👁 pyspark7

Using a JSON file

PySpark's process large-scale datasets using DataFrames and its integration with Spark's distributed computing framework makes it important for data science work.

Comment

Article Tags:

Python

R-DataFrame

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Python Courses

URL: https://www.geeksforgeeks.org/python/creating-a-pyspark-dataframe/