![]() |
VOOZH | about |
In R Programming Language, Mutate() is a function used to create, delete, and modify columns in a dataset. It is used to create columns that are functions of existing variables.
mutate(x, expr)
Parameters:
X: Data Frame
expr: operation on variables
Here we are creating a simple dataset and performing a simple mutate operation to understand how mutate() works. We created a dataset with values and used mutate() to add a new column where the values are squared.
Output:
[1] Original Dataset
1 1 10
2 2 15
3 3 20
4 4 25
5 5 30
[1] Mutated Dataset
1 1 10 100
2 2 15 225
3 3 20 400
4 4 25 625
5 5 30 900
Before learning about Conditional Mutate in R we should know about relational operators present in R.
Operator | is TRUE if |
|---|---|
A < B | A is Less than B |
A <= B | A is Less than equal to B |
A > B | A is Greater than B |
A >= B | A is Greater than equal to B |
A == B | A is Equal to B |
A != B | A is Not Equal to B |
A %in% B | A is an element of B |
In R, mutate() function we can create and modify the columns of the datasets by applying conditions on the columns of the dataset. We can do Conditional Mutate in R in two types
case_when() is a function used in mutate() to create and modify the columns of a dataset using conditions. We use these conditions to categorize or eliminate value etc, It has a simple syntax
case_when( X ~ Y)
parameters:
X: Condition to be applied
~: tilde
Y: Value to be set
Here x is the condition we will be applying to the dataset '~' is the tilde and right of this is Y which is the value to be inserted in the column.
lets learn about case_when() in detail with some examples
Here, We are creating a simple dataset to perform operations on Conditional Mutate in R. This dataset includes the ID, Name, Age, Gender, and Education of 10 members male and female and we have some NA values in the dataset. We created those missing values to understand how we handle those missing values with mutate().
Output:
A tibble: 10 × 5
ID Name Age Gender Education
<int><chr><dbl><chr><chr>
1 1 Alice 25 Female Bachelor's
2 2 Bob 18 Male High School
3 3 Charlie 22 Male Bachelor's
4 4 David NA Male PhD
5 5 Eva 35 Female Master's
6 6 Frank 16 Male High School
7 7 Grace 24 Female PhD
8 8 Hank NA Male Master's
9 9 Ivy 27 Female Bachelor's
10 10 Jack 33 Male PhD
We are selecting the age column from the dataset using the select() function and saving it in another variable age_data for better understanding and this will not affect the whole dataset. We created a new column 'Age_Group' using mutate() and applied some conditions using case_when() where the people with Ages less than or equal to 18 are considered children and Ages above 18 are considered Adults.
Here we created a new variable for better understanding and maintaining the original dataset as it is.
Output:
A tibble: 10 × 2
Age Age_Group
<dbl><chr>
1 25 Adult
2 18 Child
3 22 Adult
4 NA NA
5 35 Adult
6 16 Child
7 24 Adult
8 NA NA
9 27 Adult
10 33 Adult
Here NA values are considered as NA, people aged 18 and below are considered as Children, and above 18 are considered as Adults. We will handle NA values in the next sections.
TRUE is an argument in the case_when() function and is used as the default case. if all the conditions in the case_when() function are false then this TRUE condition Is applied.
Output:
A tibble: 10 × 3
Age Age_Group Is_Child
<dbl><chr><chr>
1 25 Adult Not Child
2 18 Child Child
3 22 Adult Not Child
4 NA NA Not Child
5 35 Adult Not Child
6 16 Child Child
7 24 Adult Not Child
8 NA NA Not Child
9 27 Adult Not Child
10 33 Adult Not Child
Here we used the TRUE argument. People with an age below 18 are considered children, and for NA values, people with an age above 18 are considered not children.
We are making a new condition for NA values in the case_when() function using is.na() function. Here, we have created a new column 'New_Age_Group' based on three conditions: people with an age below 18 are considered children, those above 18 are considered adults, and we labeled 'Age missing' for NA values.
Output:
A tibble: 10 × 4
Age Age_Group Is_Child New_Age_Group
<dbl><chr><chr><chr>
1 25 Adult Not Child Adult
2 18 Child Not Child Child
3 22 Adult Not Child Adult
4 NA NA Not Child Age Missing
5 35 Adult Not Child Adult
6 16 Child Not Child Child
7 24 Adult Not Child Adult
8 NA NA Not Child Age Missing
9 27 Adult Not Child Adult
10 33 Adult Not Child Adult
Here you can observe that for NA values it printed as Age Missing and remaining as the condition applied.
We can keep the default values of a column and modify specific elements in the column using the TRUE argument. Here, we are creating a new column 'Education_Level' using the case_when() function with the Education column, considering masters and Ph.D. as postgraduates, and leaving the remaining values unchanged.
Output:
A tibble: 10 × 6
ID Name Age Gender Education Education_Level
<int><chr><dbl><chr><chr><chr>
1 1 Alice 25 Female Bachelor's NA
2 2 Bob 18 Male High School NA
3 3 Charlie 22 Male Bachelor's NA
4 4 David NA Male PhD Post Graduate
5 5 Eva 35 Female Master's Post Graduate
6 6 Frank 16 Male High School NA
7 7 Grace 24 Female PhD Post Graduate
8 8 Hank NA Male Master's Post Graduate
9 9 Ivy 27 Female Bachelor's NA
10 10 Jack 33 Male PhD Post Graduate
In the above example, we categorized both master's and Ph.D. as postgraduate, while the remaining values were marked as NA because we had not used the TRUE argument yet.
Output:
A tibble: 10 × 6
ID Name Age Gender Education Education_Level
<int><chr><dbl><chr><chr><chr>
1 1 Alice 25 Female Bachelor's Bachelor's
2 2 Bob 18 Male High School High School
3 3 Charlie 22 Male Bachelor's Bachelor's
4 4 David NA Male PhD Post Graduate
5 5 Eva 35 Female Master's Post Graduate
6 6 Frank 16 Male High School High School
7 7 Grace 24 Female PhD Post Graduate
8 8 Hank NA Male Master's Post Graduate
9 9 Ivy 27 Female Bachelor's Bachelor's
10 10 Jack 33 Male PhD Post Graduate
Here you can observe that all the remaining values are set to the default values in the Education column.
Here, we are applying multiple conditions to multiple variables or columns using the case_when() function. We have defined conditions for the 'Education' and 'Gender' variables. Males with masters or Ph.D. are categorized as 'Recruit to male Category', females with masters or Ph.D. are categorized as 'Recruit to female Category', and the default TRUE argument is set to 'Not recruited.'
Output:
A tibble: 10 × 7
ID Name Age Gender Education Education_Level Recruitment_Category
<int><chr><dbl><chr><chr><chr><chr>
1 1 Alice 25 Female Bachelor's Bachelor's Not Recruited
2 2 Bob 18 Male High School High School Not Recruited
3 3 Charlie 22 Male Bachelor's Bachelor's Not Recruited
4 4 David NA Male PhD Post Graduate Recruit to Male Category
5 5 Eva 35 Female Master's Post Graduate Recruit to Female Category
6 6 Frank 16 Male High School High School Not Recruited
7 7 Grace 24 Female PhD Post Graduate Recruit to Female Category
8 8 Hank NA Male Master's Post Graduate Recruit to Male Category
9 9 Ivy 27 Female Bachelor's Bachelor's Not Recruited
10 10 Jack 33 Male PhD Post Graduate Recruit to Male Category
In the case_when() function, the priority order of conditions is crucial. To illustrate, consider the example of creating a new column, 'New_Age_Group' with conditions based on the 'age' column. The priority order is as follows: age below 18 is categorized as a child, below 30 as a younger adult, below 100 as an older adult, and any missing values are labeled as 'age missing.'
We are following the order of conditions in a hierarchical way.
Output:
A tibble: 10 × 4
Age Age_Group Is_Child New_Age_Group
<dbl><chr><chr><chr>
1 25 Adult Not Child Young Adult
2 18 Child Not Child Child
3 22 Adult Not Child Young Adult
4 NA NA Not Child Age Missing
5 35 Adult Not Child Older Adult
6 16 Child Not Child Child
7 24 Adult Not Child Young Adult
8 NA NA Not Child Age Missing
9 27 Adult Not Child Young Adult
10 33 Adult Not Child Older Adult
By altering the order of the conditions, specifically placing the age under 100 conditions at the top, we observe a significant impact on the output. Consequently, all values in the new column are now set to the 'Older Adult' category.
Here we have given the highest priority to the condition "Age less than 100" which has led to a faulty case in the output. As a result, all values in the output, except for NA values, are categorized as 'Older Adult'. To avoid this condition
Note: TRUE argument should always be given at the last of the conditions
This is also similar to case_when() where here we include the else statement for the False condition. It is used in the mutate() function to create and modify columns based on the condition. If the condition is TRUE, it is set to a specific value otherwise, it is set to another value.
ifelse(Con, X, Y)
Parameters:
Con: Condition
X: value to be returned if condition is TRUE
Y: value to be returned if condition is FALSE
Here, we are creating a new column 'Army_Eligibility' using the ifelse() function. If the height is greater than 165, individuals are considered eligible for the army; otherwise, they are set to not eligible for the army.
Output:
A tibble: 10 × 8
ID Name Age Gender Education Education_Level Recruitment_Category New_Education
<int><chr><dbl><chr><chr><chr><chr><chr>
1 1 Alice 25 Female Bachelor's Bachelor's Not Recruited College or H…
2 2 Bob 18 Male High School High School Not Recruited High School
3 3 Charlie 22 Male Bachelor's Bachelor's Not Recruited College or H…
4 4 David NA Male PhD Post Graduate Recruit to Male Category Age Missing
5 5 Eva 35 Female Master's Post Graduate Recruit to Female Catego… College or H…
6 6 Frank 16 Male High School High School Not Recruited High School
7 7 Grace 24 Female PhD Post Graduate Recruit to Female Catego… College or H…
8 8 Hank NA Male Master's Post Graduate Recruit to Male Category Age Missing
9 9 Ivy 27 Female Bachelor's Bachelor's Not Recruited College or H…
10 10 Jack 33 Male PhD Post Graduate Recruit to Male Category College or H
In conclusion, regarding Conditional Mutate in R, we have two types of functions: case_when() and ifelse(). These functions are used to create and modify columns based on the provided conditions. The case_when() function sets values only if the condition is TRUE, while ifelse() has an additional statement for the FALSE condition, providing flexibility in creating new columns. We learned how to use case_when() and ifelse() functions in mutate() function, we can use multiple conditions on a single variable and multiple variables, and the order of priority should be followed. The TRUE argument should be the last condition to be given. This article covers various topics on Conditional Mutate in R.