VOOZH about

URL: https://www.geeksforgeeks.org/r-language/data-manipulation-in-r-with-dplyr-package/

⇱ Data Manipulation in R with Dplyr Package - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Data Manipulation in R with Dplyr Package

Last Updated : 15 Jul, 2025

Data manipulation in R involves cleaning, transforming, and organizing data to make it suitable for analysis. It includes tasks like selecting, filtering, sorting, and creating new variables, often done using the dplyr package for simple and efficient operations. The dplyr package provides functions to perform these operations using simple syntax.

Common dplyr Functions

Some of the key functions in the dplyr package used for manipulating data in R are:

Function Name

Description

filter()

Produces a subset of a Data Frame.

distinct()

Removes duplicate rows in a Data Frame

arrange()

Reorder the rows of a Data Frame

select()

Produces data in required columns of a Data Frame

rename()

Renames the variable names

mutate()

Creates new variables without dropping old ones.

transmute()

Creates new variables by dropping the old.

summarize()

Gives summarized data like Average, Sum, etc.

1. filter() Method

We use the filter() method to extract rows that meet a given condition.

Syntax:

filter(dataframeName, condition)

  • dataframe: The input data frame to filter.
  • condition: A logical statement used to determine which rows to keep.

Example:

We will create a sample data frame from which we extract rows where the runs column is greater than 100.

  • library(dplyr): Loads the dplyr package.
  • data.frame(): Creates a sample data frame.
  • filter(): Filters rows based on the condition runs > 100.

Output:

👁 dataframe
Output

2. distinct() Method

We use the distinct() method to remove duplicate rows based on one or more columns.

Syntax:

distinct(dataframeName, col1, col2,.., .keep_all=TRUE)

  • dataframe: The input data frame.
  • columns: Optional. Column(s) to consider when identifying duplicates.
  • .keep_all: If TRUE, keeps all columns in output; otherwise, keeps only the specified columns.

Example: 

We remove duplicate rows from the stats data frame, first entirely and then based on the player column.

  • distinct(stats): Removes all duplicate rows.
  • distinct(stats, player, .keep_all = TRUE): Keeps only the first occurrence of each player and retains other columns.

Output:

👁 dataframe
Output

3. arrange() Method

We use the arrange() method to sort rows in ascending order based on one or more columns.

Syntax:

arrange(dataframeName, columnName)

  • dataframe: The input data frame.
  • column: The column by which to sort the data.

Example:

We sort the rows of the stats data frame based on the runs column in ascending order.

  • arrange(stats, runs): Orders the rows by values in the runs column.

Output:

👁 dataframe
Output

4. select() Method

We use the select() method to extract specific columns from a data frame.

Syntax:

select(dataframeName, col1,col2,...)

  • dataframe: The input data frame.
  • column1, column2, ...: The names of the columns to extract.

Example:

We select only the player and wickets columns from the stats data frame.

  • select(stats, player, wickets): Returns only the specified columns from the data frame.

Output:

👁 dataframe
Output

5. rename() Method

We use the rename() method to change the names of columns in a data frame.

Syntax:

rename(dataframeName, newName=oldName)

  • dataframe: The input data frame.
  • new_name = old_name: Specifies the new name for an existing column.

Example:

We rename the runs column to runs_scored in the stats data frame.

  • rename(stats, runs_scored = runs): Changes the column name runs to runs_scored.

Output:

👁 dataframe
Output


6. mutate() and transmute() Methods

We use mutate() to create new columns while keeping the existing ones.

Syntax:

mutate(dataframeName, newVariable=formula)

transmute(dataframeName, newVariable=formula)

  • dataframe: The input data frame.
  • new_column: Name of the new column to be added.
  • expression: The formula to compute values for the new column.

Example:

We add a new column avg as runs divided by 4 using both mutate() and transmute().

Output:

👁 dataframe
Output

7. summarize() Method

We use the summarize() method to reduce a set of values to a single summary value using aggregation functions like sum(), mean(), etc.

Syntax:

summarize(dataframeName, aggregate_function(columnName))

  • dataframe: The input data frame.
  • summary_name: The name of the output summary column.
  • aggregate_function(column): Function like sum(), mean(), etc., applied on a column.

Example:

We summarize the runs column to get its total and average.

  • sum(runs): Calculates the total runs.
  • mean(runs): Calculates the average runs.

Output:

👁 dataframe
Output
Comment

Explore