![]() |
VOOZH | about |
Scikit-learn is an open-source Python library that simplifies the process of building machine learning models. It offers a clean and consistent interface that helps both beginners and experienced users work efficiently.
Before we start building models we need to install Scikit-learn. It requires Python 3.8 or newer and depends on two important libraries: NumPy and SciPy. Make sure these are installed first.
To install Scikit-learn run the following command:
pip install -U scikit-learn
This will download and install the latest version of Scikit-learn along with its dependencies. Lets see various steps involved in the process of building Model using Scikit-learn library.
A dataset consists of:
Scikit-learn provides built-in datasets like Iris, Digits and Boston Housing. Using the Iris dataset:
We can inspect the first few rows to understand the structure. For custom datasets, Pandas is commonly used to load external files such as CSVs.
Output:
Sometimes we need to work on our own custom data then we load an external dataset. For this we can use the pandas library for easy loading and manipulating datasets.
For this you can refer to our article on How to import csv file in pandas?
To evaluate a model fairly, we split data into:
Using train_test_split, we split the Iris dataset so that 60% is for training and 40% for testing (test_size=0.4). random_state=1 ensures reproducibility.
After splitting, we get:
Checking the shapes ensures the data is split correctly.
Now lets check the Shapes of the Splitted Data to ensures that both sets have correct proportions of data avoiding any potential errors in model evaluation or training.
Output:
Machine learning algorithms work with numerical inputs, so categorical (text) data must be converted into numbers. If not encoded properly, models can misinterpret categories. Scikit-learn provides multiple encoding methods:
1. Label Encoding: It converts each category into a unique integer. For example in a column with categories like 'cat', 'dog' and 'bird', it would convert them to 0, 1 and 2 respectively. This method works well when the categories have a meaningful order such as “Low”, “Medium” and “High”.
Output:
Encoded feature: [1 2 2 1 0]
2. One-Hot Encoding: One-Hot Encoding creates separate binary columns for each category. This is useful when categories do not have any natural ordering. Example: cat, dog, bird -> 3 new columns (cat/dog/bird) with 1s and 0s.
Output:
Besides Label Encoding and One-Hot Encoding there are other techniques like Mean Encoding.
Now that our data is ready, it’s time to train a machine learning model. Scikit-learn has many algorithms with a consistent interface for training, prediction and evaluation. Here we’ll use Logistic Regression as an example.
Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.
Once trained we use the model to make predictions on the test data X_test by calling the predict method. This returns predicted labels y_pred.
Check how well our model is performing by comparing y_test and y_pred. Here we are using the metrics module's method accuracy_score.
Output:
Logistic Regression model accuracy: 0.9666666666666667
Now we want our model to make predictions on new sample data. Then the sample input can simply be passed in the same way as we pass any feature matrix. Here we used it as sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
Output:
Predictions: [np.str_('virginica'), np.str_('virginica')]
Scikit-learn is used because it makes building machine learning models straightforward and efficient. Here are some important reasons: