![]() |
VOOZH | about |
In this analysis, we will explore car sales data to uncover patterns and relationships that influence car prices and sales. Our dataset includes variables such as Price_in_thousands, Engine_size, Horsepower, Fuel_efficiency, Sales_in_thousands and many more.
In this project, we aim to perform a comprehensive analysis of car sales data to understand how various features such as engine size, horsepower and fuel efficiency influence car prices and sales. We will:
By the end of this project, we will have a deeper understanding of the car market dynamics and the factors driving sales in the automotive industry.
Dataset Link: Car Sales Data
We will begin by loading the necessary packages and importing the dataset into R. We Installed and loaded the required libraries such as tidyverse, ggplot2, plotly and lubridate to handle data manipulation, visualization and time-related operations. Also we loaded the car sales dataset using read.csv().
Output:
Next, we will inspect the structure of the dataset to verify the data types of each column. We used str() to check the structure of the dataset and ensure that all columns have appropriate data types.
Output:
The dataset consists of 157 observations and 5 variables and the columns have appropriate data types, though some columns contain missing values.
We will generate summary statistics to understand the distribution of each variable.We used summary() to check basic statistics like the minimum, maximum, mean and quartiles of each column, helping us understand the data's range and distribution
Output:
The summary statistics revealed key insights, such as the wide range of car prices (Price_in_thousands) and the presence of missing values in several columns. We also observed that the sales data (Sales_in_thousands) ranges from 0.11 to 540.56.
We will check for any missing values and duplicate records in the dataset.We used colSums(is.na()) to check for missing values in each column and used duplicated() to identify any duplicate rows in the dataset.
Output:
There were missing values in several columns, but no duplicate records were found. We will handle these missing values in the subsequent step
We will create a scatter plot to visualize the relationship between Horsepower and Price_in_thousands, color-coded by Fuel_efficiency.We created a scatter plot to explore how Horsepower correlates with Price_in_thousands and used color coding to highlight differences in Fuel_efficiency.
Output:
The scatter plot shows that higher Horsepower tends to correlate with higher Price_in_thousands, with more fuel-efficient cars appearing in the red spectrum. This suggests that more powerful and fuel-efficient cars are generally priced higher.
We will visualize the distribution of Price_in_thousands using a histogram with a density plot overlay. We created a histogram and overlaid it with a density plot to visualize the distribution of car prices.
Output:
The histogram and density plot reveal that Price_in_thousands is somewhat normally distributed, with a peak around 20-30 thousand, though there are a few high-price outliers.
We will create a boxplot to detect outliers in Price_in_thousands. We created a horizontal boxplot to identify potential outliers and understand the distribution of car prices.
Output:
The boxplot identifies several high-price outliers, which likely represent luxury or specialty cars.
We will create a time series plot of Sales_in_thousands to analyze sales trends over time. We created a time series plot to visualize the fluctuations in sales and overlaid a Simple Moving Average (SMA) to identify trends.
Output:
The time series plot shows fluctuations in sales, with a clear trend visible when the moving average (red line) is applied.
We will train a Random Forest regression model to predict Sales_in_thousands based on other features like Horsepower, Engine_size and Fuel_efficiency.
Output:
RootβMeanβSquaredβError (RMSE): 99.3076
Mapping Accuracy: 0.0542
The RMSE of the model is 99.9132 and the Mapping Accuracy is 0.0485, indicating that the model performs well in predicting car sales based on the available features.
We will use the trained model to predict future car sales for a new observation. We created a new observation and used the trained model to predict the future sales for this hypothetical car.
Output:
1
22.58741
The trained model predicts the future sales for a hypothetical car, providing an estimate based on the car's features.
From our analysis:
We concluded that car prices are influenced by horsepower and fuel efficiency, while the sales data shows trends that can be used for forecasting. Our regression model performed well in predicting future sales, providing valuable insights into the market dynamics.