Exploratory Data Analysis (EDA): A pratical approach using YOUR Uber rides dataset
Exploring data is certainly one of the most important stages on Data Science process. Despite its simplicity, it could be a powerful tool…
A Practical Approach Using YOUR Uber Rides Dataset
An exploratory data analysis in Python
Exploring data is certainly one of the most important stages in Data Science processes. Despite its simplicity, it can be a powerful tool to put you ahead on data and business context, as well as to determine crucial treatments before creating machine learning models.
To turn things a little bit more interesting, I’ve decided to have some fun with Python on my personal Uber rides data and see which insights I could extract.
In this post, I will guide you through the following steps:
- Problem Definition
- Data Discovery
- Data Preparation
- Data Analysis & Storytelling
Note: Data Preparation is usually a stage that requires lots of work around data formatting, cleansing, and manipulation, but making your data CONSISTENT is a success factor for your analysis and future modeling.
Requesting and downloading your personal dataset
Uber’s data download feature provides you with in-depth information about your rides. You can request access to your data through the following link: https://myprivacy.uber.com/privacy/exploreyourdata/download
After your request is done, an email with the download link will be sent to you (usually on the same day).
For security purposes, your data is only available for 7 days.
1. Problem Definition: First things, first!
Before starting manipulating and analyzing data, the first thing you should do is to think about the purpose. What I mean is that you should think about the reasons why you are up to conducting such analysis. If you are uncertain about this, simply start formulating questions regarding your subject like What? When? Where? Who? Which? How? How many? How much?
Depending on how many data and features you have, the analysis could go to the infinite and beyond. So that’s why (after thinking process) I decided to focus on the following questions:
a. How many trips have I done over the years?
b. How many trips were Completed and Canceled?
c. Where did most of the dropoffs occur?
d. What product type was usually chosen?
e. What was the avg. fare, distance, amount and time spent on rides?
f. Which weekdays had the highest average fares?
g. Which was the longest/shortest and more expensive/cheaper ride?
h. What was the average lead time before begining a trip?
Clearly, these are only questions to guide you through the analysis. But let’s assume I am a manager and my problem definition is: I do not have visibility on business-main metrics (problem) to start driving insights and decisions to the business (purpose).
To better guide you, always try to think about 3 main questions:
WHAT? (The Problem: what is the need, the intention, the pain?)
WHY? (The Purpose: why do you want this? what is your goal?)
HOW? (The Product: how will your problem be solved? which tool?)
Note: Never proceed to the next stages before the problem and purpose are crystal clear to you.
2. Data Discovery
Importing libraries and dataset.
Checking basic dataset information (data types and dimensions)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 554 entries, 0 to 553
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 City 554 non-null int64
1 Product Type 551 non-null object
2 Trip or Order Status 554 non-null object
3 Request Time 554 non-null object
4 Begin Trip Time 554 non-null object
5 Begin Trip Lat 525 non-null float64
6 Begin Trip Lng 525 non-null float64
7 Dropoff Time 554 non-null object
8 Dropoff Lat 525 non-null float64
9 Dropoff Lng 525 non-null float64
10 Distance (miles) 554 non-null float64
11 Fare Amount 554 non-null float64
12 Fare Currency 551 non-null object
dtypes: float64(6), int64(1), object(6)
memory usage: 56.4+ KB
.rename( ) method allows you to rename axis labels (indexes and columns). In this case, I decided to normalize column names to clean up coding, since I could later call columns easily typing <data_frame>.<column>
Use .head( ) method to gain more sensibility around data formatting and understand the overall structure of the dataset values.
Taking a look at the continuous variables, we notice the presence of some outliers. However, these outliers do not seem to reflect any abnormal value (e.g. fare_amount = 1000 BRL), which may let us a little bit more comfortable.
P.S. In case abnormal values are found, some treatment should be probably considered (e.g. outliers replacement/removal).
The charts below show a different perspective of the distribution of the variables. In this case, we see that both variables present an asymmetric distribution (positive). For distance, we notice that the higher frequency values are shorter distances, and for fare amount, we have the same behavior.
Additionally, we also notice that the standard deviation is high, taking ‘means’ as our reference. This means that values in both variables are very dispersal.
Not surprisingly we have a strong correlation between ‘fare_amount’ and ‘distance_miles’, inferring that as much you stay on the ride, the higher will the fare be.
sns.scatterplot(x='distance_miles',y='fare_amount',data=df1);
3. Data Preparation
I decided to remove the column fare_currency, since all my trips happened inside a single country (Brazil).
Now let’s check the existence of missing values.
Despite empty Lng and Lat values (29 total), there were found 3 records without product_type. As shown below, these records are insignificant to my dataset, since practically no columns are fulfilled.
So now, let’s get rid of these 3 records before proceding.
3.1 Data Cleansing: Categorical features
While analyzing the first categorical column <product_type>, I could clearly see that some work was necessary since I could find different values referring to the same category. Then, I summarized 15 original categories in 5 ones.
As the scope of this analysis is only around Uber rides, I removed UberEATS records from my dataset.
Our second categorical feature <status> seems well classified in 3 statuses, which will not require any treatment.
3.2 Data Transformation: Handling dates
Dates usually increase a lot your power of analysis, since you can break it down into different parts and generate insights from different perspectives. As previously shown, our date features are in fact object data types, so we need to convert them into datetime format.
Now, let’s break down <request_time> feature into different date parts. I just did that for <request_time>, since I’m assuming that all rides were completed on the same day (believe me, I have already checked that! 😀 ).
3.3 Feature Engineering: Creating new features
Based on <fare_amount> and <distance_miles> features I’ve created a new feature called <amount_km>, which would help us understand how much is paid by kilometer ridden.
Delta time between <request_time> and <begin_time> will let us know how much time (in minutes) I usually waited for Uber cars to arrive at my destination. In this case, it was calculated on a minutes base.
Similarly, the delta time between <dropoff_time> and <begin_time> will let us know how much time (in minutes) was spent on each trip.
As features in records with Canceled and Driver_Cancelled status will not be useful for my analysis, I set them as null values to clean up a little bit more in my dataset.
4. Data Analysis & Storytelling: It’s show time!
RECOMMENDATION: Do not start your analysis without completing the Business Problem Definition, since it determines your analysis’ focus and quality. Besides that, this process will help you to think about new possibilities/questions while trying to answer the previous ones set.
NOTE: In order to organize better my analysis, I will create an additional data frame, removing all trips with status CANCELED and DRIVER_CANCELED, since they should be disregarded from some questions.
a. How many trips have I done over the years?
A total of 444 trips were completed from Apr’16 to Jan’21. If we disregard 2016 and 2021 (not full years), we can clearly see that from 2017 to 2019 the average number of rides per year is 124, and that there is a huge drop from 2019 to 2020 (-51%). This is easily explained by the COVID outbreak.
Now, imagine if we extrapolate this result to all Uber users…
b. How many trips were completed or canceled?
Looking at the stacked bars below, we can see that excluding 2015 and 2021 (due to low trip volume), 2020 has the highest cancelation rate. This could be an alarming indicator, considering the drastic impact caused to the businesses after the Covid outbreak. Overall, the cancelation rate was 17.9% (considering RIDERS and DRIVERS cancelations).
c. Where did most of the dropoffs occur?
The following heatmap dynamically shows the most frequented areas throughout different hues and intensities. This could be valuable information for Uber to adjust prices and optimize demand in certain regions, also combining time-space data to track users’ behaviors.
d. What product type was usually chosen?
UberX was by far the preferred product type with a frequency of 90.3%. So I could probably infer that I am the type of user who usually looks for affordable prices.
e. What was the average fare, distance, amount, and time spent on rides?
Considering all trips, the average amount spent per trip is 19.2 BRL, ridding in approx. 8.1 km. So, if we do a quick simulation on how much I would spend in a year to do daily round trips we would have: 365 days * 2 trips * 19.2 BRL/fare = 14,016 BRL/year
Also on average, It was spent approx. 2.4 BRL/km and 21.4 minutes by trip.
f. Which weekdays had the highest average fares per km ridden?
According to the chart below, we can see that Mondays, Wednesdays, Fridays and Sundays were on average the most expensive weekdays. Therefore, it allows us to better understand the weekly seasonality, and find out days with higher profitability for Uber and its drivers.
g. Which was the longest/shortest and more expensive/cheaper ride?
The table below shows records with the longest (31.77 km) and shortest rides (0.24 km).
Analyzing the amount paid by km ridden we have: expensive (46.96 BRL/km) and cheaper (0 BRL/km). This effect is basically driven by fixed minimum fare in high-demand periods since the total distance was only 0.24km.
h. What was the average lead time before beginning a trip?
It took approximately 5 minutes to start trips after they are requested.
4.9 minutes
Conclusion
Exploratory Data Analysis is not a trivial task! It requires lots of work and patience, however, it is surely a powerful tool if correctly applied to your business context.
This post briefly demonstrated some tips and steps to make analysis easier and undoubtedly highlighted the crucial importance of a well-defined business problem, guiding all coding efforts to a specific objective, and also highlighting important insights. This business case also tried to reflect a practical application of python in daily business activities, showing how fun, valuable, and interesting it could become.
Thank you so much for getting here! I hope you liked it! 😃
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS