![]() |
VOOZH | about |
In this article, we will learn how to predict whether a person has diabetes or not using the Diabetes dataset. This is a classification problem and we'll utilize Logistic Regression in R Programming Language to make our predictions.
In this project, we aim to predict whether a person has diabetes based on their medical information. We will cover the following steps:
The diabetes.csv dataset contains health metrics used to predict diabetes. Key features include:
Dataset Link: Diabetes Dataset
We will begin by loading the necessary libraries and importing the dataset.
Output:
This dataset contains eight features, each of which determines a result of 0 or 1. 0 means the patient does not have diabetes, whereas a score of 1 means they do.
In this step, we preprocess the data by handling missing values, scaling features and splitting the dataset:
Output:
We ensured that the dataset was clean by checking for missing values and scaling the columns. We then split the data into training and testing sets for later evaluation.
Now we will create some visualization for this dataset to get explore our dataset.
We will visualize the correlations between various features using a correlation heatmap. This will help us understand the relationships between different variables in the dataset.
Output:
The heatmap visualizes the correlation matrix and provides insights into the strength and direction of relationships between the features.
Now, we will visualize the distribution of the Outcome variable (whether or not a person has diabetes).
Output:
This bar plot shows the distribution of Outcome classes (0 for no diabetes and 1 for diabetes), helping us understand the balance between classes.
Next, we'll plot histograms for various features, split by the Outcome class, to understand how the distribution of each feature varies between diabetic and non-diabetic individuals.
Output:
We visualized the distribution of Pregnancies with respect to the Outcome to identify patterns between the feature and the target variable.
To understand the distribution of BMI for both diabetic and non-diabetic individuals, we can plot a boxplot.
Output:
This boxplot shows how BMI varies for individuals with and without diabetes. It helps to identify if there's a noticeable difference in BMI between the two classes.
Now, we will build a Logistic Regression model to predict diabetes based on medical features.
Output:
To evaluate the model, we will calculate various performance metrics like accuracy, precision, recall and F1 score using the test data.
Output:
We will now use the trained model to predict the likelihood of diabetes for a new patient based on their medical data. This function allows us to predict diabetes risk for a new patient by entering their medical parameters. The prediction is based on the trained logistic regression model.
Output:
Based on the model's prediction, there is a higher chance of diabetes.
From our analysis, we:
We concluded that the logistic regression model is effective for predicting diabetes. However, further model optimization and data balancing techniques could improve the recall, making the model more robust in identifying diabetic patients.