![]() |
VOOZH | about |
In this article, we will discuss how to get the specific row from the PySpark dataframe.
Creating Dataframe for demonstration:
Output:
👁 ImageThis is used to get the all row's data from the dataframe in list format.
Syntax: dataframe.collect()[index_position]
Where,
- dataframe is the pyspark dataframe
- index_position is the index row in dataframe
Example: Python code to access rows
Output:
Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')
Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2')
Row(Employee ID='5', Employee NAME='gnanesh', Company Name='company 1')
Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3')
This function is used to get the top n rows from the pyspark dataframe.
Syntax: dataframe.show(no_of_rows)
where, no_of_rows is the row number to get the data
Example: Python code to get the data using show() function
Output:
👁 ImageThis function is used to return only the first row in the dataframe.
Syntax: dataframe.first()
Example: Python code to select the first row in the dataframe.
Output:
Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')
This method is used to display top n rows in the dataframe.
Syntax: dataframe.head(n)
where, n is the number of rows to be displayed
Example: Python code to display the number of rows to be displayed.
Output:
[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')]
[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1'),
Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2'),
Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3')]
[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1'),
Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2')]
Used to return last n rows in the dataframe
Syntax: dataframe.tail(n)
where n is the no of rows to be returned from last in the dataframe.
Example: Python code to get last n rows
Output:
[Row(Employee ID='5', Employee NAME='gnanesh', Company Name='company 1')]
[Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3'),
Row(Employee ID='4', Employee NAME='rohith', Company Name='company 2'),
Row(Employee ID='5', Employee NAME='gnanesh', Company Name='company 1')]
[Row(Employee ID='4', Employee NAME='rohith', Company Name='company 2'),
Row(Employee ID='5', Employee NAME='gnanesh', Company Name='company 1')]
This method is used to select a particular row from the dataframe, It can be used with collect() function.
Syntax: dataframe.select([columns]).collect()[index]
where,
- dataframe is the pyspark dataframe
- Columns is the list of columns to be displayed in each row
- Index is the index number of row to be displayed.
Example: Python code to select the particular row.
Output:
Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')
Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3')
Row(Employee ID='4', Employee NAME='rohith', Company Name='company 2')
This method is also used to select top n rows
Syntax: dataframe.take(n)
where n is the number of rows to be selected
Output:
[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1'),
Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2')]
[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1'),
Row(Employee ID='2', Employee NAME='ojaswi', Company Name='company 2'),
Row(Employee ID='3', Employee NAME='bobby', Company Name='company 3'),
Row(Employee ID='4', Employee NAME='rohith', Company Name='company 2')]
[Row(Employee ID='1', Employee NAME='sravan', Company Name='company 1')]