Data Science

Exploratory Data Analysis (EDA): A pratical approach using YOUR Uber rides dataset

Exploring data is certainly one of the most important stages on Data Science process. Despite its simplicity, it could be a powerful tool…

Felipe Alves Santos

Feb 16, 2021

9 min read

A Practical Approach Using YOUR Uber Rides Dataset

An exploratory data analysis in Python

👁 Picture taken by Benjamin Voros

Picture taken by Benjamin Voros

Exploring data is certainly one of the most important stages in Data Science processes. Despite its simplicity, it can be a powerful tool to put you ahead on data and business context, as well as to determine crucial treatments before creating machine learning models.

To turn things a little bit more interesting, I’ve decided to have some fun with Python on my personal Uber rides data and see which insights I could extract.

In this post, I will guide you through the following steps:

Problem Definition
Data Discovery
Data Preparation
Data Analysis & Storytelling

Note: Data Preparation is usually a stage that requires lots of work around data formatting, cleansing, and manipulation, but making your data CONSISTENT is a success factor for your analysis and future modeling.

Requesting and downloading your personal dataset

Uber’s data download feature provides you with in-depth information about your rides. You can request access to your data through the following link: https://myprivacy.uber.com/privacy/exploreyourdata/download

After your request is done, an email with the download link will be sent to you (usually on the same day).

For security purposes, your data is only available for 7 days.

1. Problem Definition: First things, first!

Before starting manipulating and analyzing data, the first thing you should do is to think about the purpose. What I mean is that you should think about the reasons why you are up to conducting such analysis. If you are uncertain about this, simply start formulating questions regarding your subject like What? When? Where? Who? Which? How? How many? How much?

Depending on how many data and features you have, the analysis could go to the infinite and beyond. So that’s why (after thinking process) I decided to focus on the following questions:

a. How many trips have I done over the years?
b. How many trips were Completed and Canceled?
c. Where did most of the dropoffs occur?
d. What product type was usually chosen?
e. What was the avg. fare, distance, amount and time spent on rides?
f. Which weekdays had the highest average fares?
g. Which was the longest/shortest and more expensive/cheaper ride?
h. What was the average lead time before begining a trip?

Clearly, these are only questions to guide you through the analysis. But let’s assume I am a manager and my problem definition is: I do not have visibility on business-main metrics (problem) to start driving insights and decisions to the business (purpose).

To better guide you, always try to think about 3 main questions:

WHAT? (The Problem: what is the need, the intention, the pain?)

WHY? (The Purpose: why do you want this? what is your goal?)

HOW? (The Product: how will your problem be solved? which tool?)

Note: Never proceed to the next stages before the problem and purpose are crystal clear to you.

2. Data Discovery

Importing libraries and dataset.

Checking basic dataset information (data types and dimensions)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 554 entries, 0 to 553
Data columns (total 13 columns):
 # Column Non-Null Count Dtype 
--- ------ -------------- ----- 
 0 City 554 non-null int64 
 1 Product Type 551 non-null object 
 2 Trip or Order Status 554 non-null object 
 3 Request Time 554 non-null object 
 4 Begin Trip Time 554 non-null object 
 5 Begin Trip Lat 525 non-null float64
 6 Begin Trip Lng 525 non-null float64
 7 Dropoff Time 554 non-null object 
 8 Dropoff Lat 525 non-null float64
 9 Dropoff Lng 525 non-null float64
 10 Distance (miles) 554 non-null float64
 11 Fare Amount 554 non-null float64
 12 Fare Currency 551 non-null object 
dtypes: float64(6), int64(1), object(6)
memory usage: 56.4+ KB

.rename( ) method allows you to rename axis labels (indexes and columns). In this case, I decided to normalize column names to clean up coding, since I could later call columns easily typing <data_frame>.<column>

Use .head( ) method to gain more sensibility around data formatting and understand the overall structure of the dataset values.

Taking a look at the continuous variables, we notice the presence of some outliers. However, these outliers do not seem to reflect any abnormal value (e.g. fare_amount = 1000 BRL), which may let us a little bit more comfortable.

P.S. In case abnormal values are found, some treatment should be probably considered (e.g. outliers replacement/removal).

URL: https://towardsdatascience.com/exploratory-data-analysis-eda-a-pratical-approach-using-your-uber-rides-dataset-5e9f0e892149/

Exploratory Data Analysis (EDA): A pratical approach using YOUR Uber rides dataset

A Practical Approach Using YOUR Uber Rides Dataset

An exploratory data analysis in Python

Requesting and downloading your personal dataset

1. Problem Definition: First things, first!

2. Data Discovery

3. Data Preparation

3.1 Data Cleansing: Categorical features

3.2 Data Transformation: Handling dates

3.3 Feature Engineering: Creating new features

4. Data Analysis & Storytelling: It’s show time!

a. How many trips have I done over the years?

b. How many trips were completed or canceled?

c. Where did most of the dropoffs occur?

d. What product type was usually chosen?

e. What was the average fare, distance, amount, and time spent on rides?

f. Which weekdays had the highest average fares per km ridden?

g. Which was the longest/shortest and more expensive/cheaper ride?

h. What was the average lead time before beginning a trip?

Conclusion

Related Articles