![]() |
VOOZH | about |
Tidy data is a data science and analysis notion that entails arranging data systematically and consistently, making it easier to work with and analyze using tools such as R. Tidy data is a crucial component of Hadley Wickham's data science methodology, which he popularized by creating the "tidyverse," a set of R packages that contains tools for data modification, visualization, and analysis. We'll look at the basics of tidy data in R and why it's necessary for good data analysis in this introduction.
Tidy data is a concept popularized by Hadley Wickham, the creator of the ggplot2 and dplyr packages in the R Programming Language. It's an approach to structuring and organizing data in a consistent and standardized manner to simplify data manipulation, analysis, and visualization.
By adhering to these principles, tidy data ensures that your datasets are structured in a way that simplifies data analysis and visualization. It makes it easier to use functions and tools from R packages like dplyr, ggplot2, tidyr, and others for data manipulation and exploration.
The term "tidy data" refers to a specific format or organization of data, while "normal data" is a more general term and does not refer to any specific data format. Let's clarify the differences between these two concepts.
The columns ProductA, ProductB, and ProductC contain sales values, which represent variables for each product.
Output:
Date Store ProductA ProductB ProductC
1 2024-01-01 Store1 50 30 20
2 2024-01-01 Store2 55 20 45
3 2024-01-02 Store1 60 25 35
4 2024-01-02 Store2 65 35 50
The data is untidy because each product is stored as a separate column, which violates the principle of "each variable forms a column."
In tidy data, each variable is stored in its own column, and each row represents a single observation or data point. To transform the normal data into tidy data, we can use the tidyr package.
Output:
# A tibble: 12 × 4
Date Store Product Sales
<chr><chr><chr><dbl>
1 2024-01-01 Store1 ProductA 50
2 2024-01-01 Store1 ProductB 30
3 2024-01-01 Store1 ProductC 20
4 2024-01-01 Store2 ProductA 55
5 2024-01-01 Store2 ProductB 20
6 2024-01-01 Store2 ProductC 45
7 2024-01-02 Store1 ProductA 60
8 2024-01-02 Store1 ProductB 25
9 2024-01-02 Store1 ProductC 35
10 2024-01-02 Store2 ProductA 65
11 2024-01-02 Store2 ProductB 35
12 2024-01-02 Store2 ProductC 50
Now, each product has its own column (Product), and the sales figures are in the Sales column. This tidy format makes it easier to manipulate, analyze, and visualize using tools like dplyr and ggplot2.
Consider a dataset related to students’ test scores across multiple subjects (Math, Science, and English) over different semesters. The data is structured in a wide format:
In this normal (untidy) data representation, we have different products as columns, and each row represents a sales record for a specific date.
Output:
Student Semester Math Science English
1 Alice Fall 85 90 88
2 Alice Spring 89 94 92
3 Bob Fall 78 85 80
4 Bob Spring 82 88 84
Each subject (Math, Science, and English) is in a separate column, which violates the tidy data principle where each variable should form its own column.
In tidy data, each variable (subject) will be stored in its own column, and each row will represent an observation (i.e., a single student’s score for a particular subject in a given semester).
Output:
# A tibble: 12 × 4
Student Semester Subject Score
<chr><chr><chr><dbl>
1 Alice Fall Math 85
2 Alice Fall Science 90
3 Alice Fall English 88
4 Alice Spring Math 89
5 Alice Spring Science 94
6 Alice Spring English 92
7 Bob Fall Math 78
8 Bob Fall Science 85
9 Bob Fall English 80
10 Bob Spring Math 82
11 Bob Spring Science 88
12 Bob Spring English 84
in this example transformation of untidy (wide) data into tidy (long) data. By following the tidy data principles, the data is now structured in a way that simplifies data manipulation and analysis. Each variable (Student, Semester, Subject, and Score) has its own column, and each observation (a student’s score for a specific subject in a given semester) is stored in a separate row.
The key difference between tidy data and normal data lies in their organization and adherence to specific principles. Tidy data is structured according to principles that facilitate data analysis, while normal data can take on various formats that may require additional effort to prepare for analysis. Tidy data is particularly useful when working with tools and packages designed to operate on well-structured data, such as those in the tidyverse ecosystem in R.