![]() |
VOOZH | about |
Logistic Regression is a widely used supervised machine learning algorithm used for classification tasks. In Python, it helps model the relationship between input features and a categorical outcome by estimating class probabilities, making it simple, efficient and easy to interpret.
We will build a classifier that predicts whether a tumour is malignant or benign, based on medical measurements using Python.
This step loads the Breast Cancer dataset from scikit learn. The dataset is provided as a structured object that bundles everything needed for a classification task. It contains:
This structure makes the dataset easy to explore, preprocess and use directly for model training.
The raw dataset is converted into pandas structures for easier handling and analysis. Using a DataFrame improves readability, supports exploratory data analysis and aligns with real world data workflows commonly used in production.
This step splits the data into training and testing sets. Here we will use 25% of data for testing and rest for training.
This step standardizes feature values so they are on a similar scale. Logistic Regression uses gradient based optimization which is sensitive to feature magnitudes, so scaling helps the model train correctly. The scaler is fit only on training data and then applied to test data to avoid data leakage.
This step trains the Logistic Regression model on the scaled training data. Here:
At this stage, the trained model is used to make predictions on unseen test data. It computes probabilities for each class and applies a threshold (default = 0.5) to convert them into class labels.
Output:
At this stage, predictions are available. We now evaluate how well the model performs using multiple metrics, since each metric highlights a different aspect of performance.
Accuracy measures how often the model makes correct predictions overall.
Output:
accuracy: 0.98
A confusion matrix shows where the model is right and where it makes mistakes by comparing predicted labels with actual labels.
Output:
Precision shows how many predicted positives are actually correct.
Output:
Precision: 0.99
Recall measures how many actual positives the model correctly identifies.
Output:
Recall: 0.98
F1 score balances precision and recall into a single metric.
Output:
f1 score: 0.98
ROC-AUC shows how well the model separates classes across thresholds.
Output:
You can download the python notebook from here