![]() |
VOOZH | about |
R is an open-source programming language used statistical software and data analysis tools. It is an important tool for Data Science. It is highly popular and is the first choice of many statisticians and data scientists.
In R, we use the <- operator to assign values to variables, though = is also commonly used. You can also add comments in your code to explain what’s happening, using the# symbol. It’s great practice to comment your code so that it’s easier to understand later.
[1] "Sum of x and y: 8" [1] "Product of x and y: 15"
In R, data is stored in various structures, such as vectors, matrices, lists and data frames. Let’s break each one down.
1. Vectors: Vectors are like simple arrays that hold multiple values of the same type. You can create a vector using the c() function:
[1] 1 2 3 4 5
2. Matrices: Matrices are two-dimensional arrays where each element has the same data type. You create a matrix using the matrix() function:
[,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9
3. Lists: Lists can contain elements of different types, including numbers, strings, vectors and another list inside it. Lists are created using the list() function:
[[1]] [1] "Red" [[2]] [1] 20 [[3]] [1] TRUE [[4]] [1] 1 2 3 4 5
4. Data Frames: Data frames are the most commonly used data structure in R. They’re like tables, where each column can contain different data types. Use data.frame() to create one:
Name Age 1 Alice 24 2 Bob 28
These foundational concepts are a great starting point for your journey into data science. To dive deeper, consider exploring the following tutorial: R Programming Tutorial
In R Programming, several libraries are required in data science for tasks like data manipulation and statistical modeling to visualize and machine learning. The key libraries include:
R Libraries are effective for data manipulation, enabling analysts to clean, transform and summarize datasets efficiently.
The dplyr package provides a set of functions that make it easy to manipulate data frames in a clean and readable manner. Some of the key functions in dplyr include:
Let's perform data manipulation using the above function using a sample dataset:
Output:
[1] "Filtered Data (Age > 25):"
Name Age Salary
1 Bob 28 60000
2 Charlie 35 70000
3 David 40 80000
[1] "Selected Data (Name and Salary columns):"
Name Salary
1 Alice 50000
2 Bob 60000
3 Charlie 70000
4 David 80000
5 Eve 45000
Data cleaning involves correcting or removing errors and transforming data into a usable format. Key transformations include:
Now, we will be using the previous dataset to perform data transformation:
Output
[1] "Renamed Data (Name to Employee_Name, Age to Employee_Age):"
Employee_Name Employee_Age Salary Salary_per_year
1 Alice 24 50000 4166.667
2 Bob 28 60000 5000.000
3 Charlie 35 70000 5833.333
4 David 40 80000 6666.667
5 Eve 22 45000 3750.000
Dealing with missing values is an essential part of data preparation. R provides several functions to identify, handle and replace missing values in datasets. Key functions include:
Output:
[1] "Identifying Missing Values:"
Name Age Salary
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE FALSE FALSE
[4,] TRUE TRUE FALSE
[5,] FALSE FALSE FALSE[1] "Data After Filling Missing Values in Age (Downward Direction):"
Name Age Salary
1 Alice 24 50000
2 Bob 28 NA
3 Charlie 35 70000
4 <NA> 35 80000
5 Eve 22 45000
R provides tools for performing both descriptive and inferential statistical analysis, making it a preferred choice for statisticians and data scientists.
Descriptive statistics provide a summary of the data's key characteristics using measures like mean, median, variance and standard deviation.
[1] "Mean: 30" [1] "Median: 30" [1] "Sum: 150"
Inferential statistics allow you to make predictions or generalizations about a population based on sample data.
1. Hypothesis Testing
Hypothesis Testing evaluates assumptions (hypotheses) about population parameters. In R, common hypothesis tests include:
Output:
[1] "T-test Result:"
Welch Two Sample t-test
data: group1 and group2
t = -5, df = 8, p-value = 0.001053
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.306004 -2.693996
sample estimates:
mean of x mean of y
3 8
[1] "Chi-Square Test Result:"
Pearson's Chi-squared test
data: data_chisq
X-squared = 0, df = 1, p-value = 1
2. Correlation and Regression Analysis
Correlation and Regression Analysis these techniques explore relationships between variables:
Output:
[1] "Correlation Between x and y:"
[1] -1
Machine learning in R enables analysts to build predictive models, perform classification and uncover patterns in data.
1. Linear Regression: Linear regression is used for predicting continuous numeric outcomes based on one or more predictors. In R, we can predict the continuous numeric outcomes using lm().
Output:
197.509197666493
2. Logistic Regression: Logistic regression is used for binary classification tasks where the outcome variable is categorical (e.g., 0 or 1), in R, it is performed using glm() function.
Output:
0.63
3. Decision Trees: Decision trees are used for both classification and regression tasks. In this example, we perform classification using rpart() function:
Output:
0.72
4. Random Forest: Random Forest is an ensemble learning technique to perform classification and regression using randomForest().
Output:
Random Forest Accuracy: 1
Unsupervised learning involves learning patterns in data without labeled outputs. Common techniques include clustering and dimensionality reduction.
1. K-means Clustering: K-means partitions the data into K clusters based on the distance between data points. In R, kmeans() function is used perform clustering.
Output:
[1] "Cluster Centers:"
predictor1 predictor2
1 62.48318 27.73121
2 51.24186 30.80630
3 41.05266 29.10471
[1] "Cluster Assignments:"
[1] 3 2 1 2 2 1 2 3 3 2 1 2 2 2 3 1 2 3 1 3 3 2 3 3 3 3 1 2 3 1 2 2 1 1 1 2 1
[38] 2 2 3 3 2 3 1 1 3 3 3 2 2 2 2 2 1 2 1 3 2 2 2 2 3 3 3 3 2 2 2 1 1 3 3 1 3
[75] 3 1 2 3 2 2 2 2 3 1 2 2 1 2 2 1 1 2 2 3 1 3 1 1 2 3
[1] "Total Within-Cluster Sum of Squares:"
[1] 3809.048
2. Principal Component Analysis (PCA): PCA transforms the data into a new coordinate system where the axes represent direction of maximum variance. In R, PCA is performed using prcomp() function.
Output:
Importance of components:
PC1 PC2 PC3
Standard deviation 1.0726 0.9900 0.9324
Proportion of Variance 0.3835 0.3267 0.2898
Cumulative Proportion 0.3835 0.7102 1.0000
After building a model, it’s essential to evaluate its performance. We can evaluate models using the following metrics:
1. Classification Evaluation Metrics
2. Regression Evaluation Metrics
R provides multiple functions for creating, manipulating and analyzing time series data.
For more advanced decomposition, you can use STL (Seasonal and Trend decomposition using Loess), which is more robust for irregular seasonality. It is implemented using stl() function.
| Feature | R | Python |
|---|---|---|
| Introduction | R is a language and environment designed for statistical programming, computing and graphics. | Python is a general-purpose programming language used for data analysis and scientific computing. |
| Objective | Focuses on statistical analysis and data visualization. | Supports a wide range of applications, including GUI development, web development and embedded systems. |
| Workability | Offers numerous easy-to-use packages for statistical tasks. | Excels in matrix computation, optimization and general-purpose tasks. |
| Integrated Development Environment (IDE) | Popular IDEs include RStudio, RKward and R Commander. | Common IDEs are Spyder, Eclipse+PyDev, Atom and more. |
| Libraries and Packages | Includes packages like ggplot2 for visualization and caret for machine learning. | Features libraries like Pandas, NumPy and SciPy for data manipulation and analysis. |
| Scope | Primarily used for complex statistical analysis and data science projects. | Offers a streamlined approach for data science, along with versatility in other domains. |
R is ideal for statistical computing and visualization, while Python provides a more versatile platform for diverse applications, including data science.
To get a detailed overview of R Programming for Data Science, you can refer to: Data Science Tutorial with R