![]() |
VOOZH | about |
In this article, we'll learn how to drop the columns in DataFrame if the entire column is null in Python using Pyspark.
To create a dataframe with pyspark.sql.SparkSession.createDataFrame() methods.
Syntax
pyspark.sql.SparkSession.createDataFrame()
Parameters:
- dataRDD: An RDD of any kind of SQL data representation(e.g. Row, tuple, int, boolean, etc.), or list, or pandas.DataFrame.
- schema: A datatype string or a list of column names, default is None.
- samplingRatio: The sample ratio of rows used for inferring
- verifySchema: Verify data types of every row against schema. Enabled by default.
Returns: Dataframe
Output:
+---------+----------+---------+------+------+ |firstname|middlename|lastname |gender|salary| +---------+----------+---------+------+------+ |James |null |Bond |M |6000 | |Michael |null |null |M |4000 | |Robert |null |Pattinson|M |4000 | |Natalie |null |Portman |F |4000 | |Julia |null |Roberts |F |1000 | +---------+----------+---------+------+------+
Here we want to drop all the columns where the entire column is null, as we can see the middle name columns are null and we want to drop that.
{'firstname': 0, 'middlename': 5, 'lastname': 1, 'gender': 0, 'salary': 0}
['middlename']
+---------+---------+------+------+
|firstname|lastname |gender|salary|
+---------+---------+------+------+
|James |Bond |M |6000 |
|Michael |null |M |4000 |
|Robert |Pattinson|M |4000 |
|Natalie |Portman |F |4000 |
|Julia |Roberts |F |1000 |
+---------+---------+------+------+