![]() |
VOOZH | about |
Predicting flight delays is an important aspect in today's moving modern world. This step is important for better time management and customer satisfaction. These delays can cause significant dissatisfaction among passengers even resulting in churn for further flights in the future. Using Machine Learning and Data analysis, we can estimate and predict flight delays using R, a popular statistical programming language. This article will cover how to predict flight delays with the help of R Programming Language.
Flight delay prediction involves forecasting whether a flight will be delayed and by how much, based on various factors such as weather conditions, flight schedule, aircraft specifics, and air traffic control constraints. The goal is to build a predictive model that can assist stakeholders in making informed decisions.
There are certain steps to be followed to predict flight delay in R.
Loading and installing these necessary packages are important since they simplify the processing.
Make sure you have R and Rstudio installed on your PC.
Here, we will be using an external dataset from the Kaggle website based on the Flight Delay Analysis of US Airlines from NYC.
Dataset Link: NYC Flight Data
Output:
year month day dep_time sched_dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 2.0 Min. : 600
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 856.8 1st Qu.: 902
Median :2013 Median : 7.000 Median :16.00 Median :1401.5 Median :1359
Mean :2013 Mean : 6.715 Mean :16.06 Mean :1340.3 Mean :1350
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1745.0 3rd Qu.:1732
Max. :2013 Max. :12.000 Max. :31.00 Max. :2344.0 Max. :2359
NA's :9
dep_delay arr_time sched_arr_time arr_delay carrier
Min. :-14.00 Min. : 1 Min. : 4 Min. :-57.000 B6 :59
1st Qu.: -5.00 1st Qu.:1107 1st Qu.:1114 1st Qu.:-17.000 UA :59
Median : -1.00 Median :1541 Median :1550 Median : -6.000 DL :52
Mean : 12.58 Mean :1511 Mean :1541 Mean : 7.443 EV :51
3rd Qu.: 13.25 3rd Qu.:1960 3rd Qu.:2000 3rd Qu.: 16.000 AA :30
Max. :253.00 Max. :2352 Max. :2359 Max. :330.000 MQ :27
NA's :9 NA's :9 NA's :10 (Other):59
flight tailnum origin dest air_time distance
Min. : 11 N738MQ : 4 EWR:125 ATL : 18 Min. : 29.0 Min. : 94
1st Qu.: 623 N723TW : 3 JFK:111 BOS : 16 1st Qu.: 90.0 1st Qu.: 544
Median :1444 N0EGMQ : 2 LGA:101 CLT : 16 Median :139.0 Median :1005
Mean :1943 N11551 : 2 SFO : 15 Mean :157.3 Mean :1091
3rd Qu.:3361 N12238 : 2 FLL : 14 3rd Qu.:194.0 3rd Qu.:1400
Max. :5978 (Other):320 LAX : 14 Max. :661.0 Max. :4963
NA's : 4 (Other):244 NA's :10
hour minute time_hour
Min. : 6.00 Min. : 0.00 13-08-2013 08:00: 3
1st Qu.: 9.00 1st Qu.: 8.00 21-10-2013 20:00: 2
Median :13.00 Median :30.00 22-02-2013 18:00: 2
Mean :13.23 Mean :26.74 22-04-2013 06:00: 2
3rd Qu.:17.00 3rd Qu.:41.00 22-11-2013 17:00: 2
Max. :23.00 Max. :59.00 28-02-2013 17:00: 2
(Other) :324
Data Preprocessing is one of the most important steps in Machine Learning since it helps us filter the data making it ready to train. Many such missing values can alter the prediction therefore it is important to deal with them. Data preprocessing includes data cleaning, handling missing values, feature engineering etc.
EDA is done on any dataset to understand the insights in data. These graphs will help us understand the data in better way and then make informed decisions.
Here, we will plot the distribution of departure delay in minutes giving us insights of departure.
Output:
This gives us insights about the airlines and their average departure delay letting us know which airlines makes more delays.
Output:
The heatmap shows which days of the month and which months have higher or lower average departure delays. Lighter colors represent lower delays, and darker colors represent higher delays.
Output:
We plot correlation matrix to understand the factors affecting the delay of the flight.
OUTPUT:
Random Forest is an ensemble learning method that uses multiple decision trees to improve classification or regression performance. It reduces overfitting and increases accuracy.
Here, we are predicting the delays of flight using a randomforest algorithm. Based on the trained model, we will predict if the flights will be delayed or not.
Output:
Actual Predicted Residual Delayed
2 -6 -0.4863333 -5.513667 No
3 -3 -1.9815000 -1.018500 No
4 -1 1.9500000 -2.950000 Yes
5 -2 5.0131667 -7.013167 Yes
9 -10 -3.5148333 -6.485167 No
14 -9 28.9585000 -37.958500 Yes
To understand better we will visualize the prediction of delays.
Output:
These metrics in R are used to evaluate the performance of the model and how accurate it is. This step is important since it helps us understand the precision and accuracy of the predictions made. The graphs help us understand better how to evaluate the model we trained and if our predictions are correct or not.
A performance matrix provides various metrics to evaluate the model's performance, such as accuracy, precision, recall, F1-score, and AUC.
Output:
RMSE: 13.0528251991879"
Accuracy: 0.6145833
Precision: 0.8461538
Recall (Sensitivity): 0.4
F1-score: 0.5432099
AUC: 0.7933481
RMSE gives an idea of how close the model's predictions are to the actual values, with lower values indicating better performance
This article discussed the Flight delay prediction using R programming language and how machine learning can play an important role in our travel planning. We used an external dataset to predict flight delay and understand the relationship between the variables with the help of visualization.