Why write a Solution Description for a machine-learning problem
Done developing machine learning model. Well, it's not yet done
You have finished solving a machine learning problem. The accuracy of your model is awesome. You are done. Wait! Not so fast!
Till now your work is probably a Jupyter notebook, which is full of code, a few visuals, and very little documentation. If you see your work after a month or so, you might struggle to understand your own creation. To make matter worse, the Jupyter notebook does not have all decisions and assumptions you have taken in the solution.
It is a good practice to make a solution description document of the amazing work you have done. Such a document can have the following benefits:
- Whatever is said and done, a nicely written document is much more understandable than a Jupyter notebook.
- You can document why you took a particular approach. It helps to put your solution into a perspective.
- It can help the operationalization team to understand your solution. Generally, operationalization or IT team would feel comfortable in a solution that they understand better.
Let me illustrate with an example. Let us take the Kaggle House Price Prediction problem (https://www.kaggle.com/c/house-prices-advanced-regression-techniques). Here is a snapshot image of the solution in the Jupyter notebook.
Now let us see how to write a solution description of the work done. A solution description is a way to document how you are solving a given problem. The document can have three main parts:
Business objective: This part explains the problem and why you need machine learning to solve the problem.
Solution Summary: This part has a summary of the solution. It can be a table that lists the main steps. Looking at such as summary table can be a great way to quickly understand the solution.
Solution Details: This part has a brief description of the solution, assumptions taken, and visuals that explain the approach.
Let me illustrate it with an example of the house prediction problem.
Business Objective
The objective is to predict house prices for residential homes in Ames, Iowa. There are 79 variables related to the house. It will be humanly impossible to determine the relationship between such a high number of variables and price. So we adopt a machine learning approach.
Solution Summary
The summary has a list of solution steps documented in a concise way, as shown below.
Such as table helps to clearly and efficiently document the solution in a very concise way. You can go one step further and make a tab for each problem you solve. This will also give insights into how you generally approach a machine learning problem, re-use your approach as well as improve in your own game.
Solution Details
This section describes the solution with explanations and visuals. It contains the important steps as well as assumptions you have taken. In the House Price prediction problem, we summarize the solution in Data Processing, Feature Engineering, Machine Learning.
Data Processing
Remove outliers in target variable: Observing the scatterplot between Price and one of the input variables living area, we observe that some prices which are outliers. We can remove them by deleting them.
Remove skew in target variable: The target variable price is skewed. So we can use log-transform **** to price in order to make it normally distributed
Impute Missing values: The house price dataset has many missing values. The features which have a very high number of missing values are Pool, Miscellaneous feature, Alley, Fence, Fireplace, Front area.
Here is the way to impute missing values.
Pool: We assume that missing values means that there is no pool. This is a good assumption as most of the houses do not have pools. We replace missing values with NONE.
Miscellaneous, Alley, Fence, Fireplace: We can use a similar method as Pool and we replace missing values with NONE.
Basement Surface Area features: We assume missing values means that there is no basement. So we replace it with zero.
Lot Front Area: Since the area of each street connected to the house property most likely has a similar area to other houses in its neighborhood, we can fill in missing values by the median value of the lot front area of the neighborhood.
MSZoning (The general zoning classification): The percentage of missing values is 0.13%, which is very less. So we can fill the missing value with the most common value, which is ‘RL’.
Electrical, Kitchen Quality, Exterior, Sals. Type: We follow the same approach as MSZoning as the percentage of missing values is very less. We replace it with the most common value.
Feature Engineering
Here we apply some ‘common-sense’ and ‘creative’ feature engineering.
Transforming some numerical variables that are really categorical: The data has features such as building class (MSSubClass), overall condition (OverallCond), Year of sell (YrSold), Month of sell (MoSold). Though they have numeric values, they actually are categorical. So we can convert them into non-numeric values. We will be able to one-hot encode it rather than normalizing it.
Label Encoding features that can represent ordered set: In the dataset, there are features that are related to quality. For example Fireplace Quality, Basement Quality, Garage Quality, etc. Low values indicate low quality and high values indicate high quality. This means the values represent an ordering: from lowest quality to highest quality. Such features are good candidates to apply Label encoding and convert them to 1,2,3 etc…
Combining features: We can create a new feature called Total Surface area which is Total Basement Area + Total 1st-floor area + Total 2nd-floor area.
Converting highly-skewed features to a normal distribution: Similar to removing skew for target variable price, as we have seen above, we can also convert highly skewed features to a normal distribution. For the price, we had applied log transformation as values in price are high values. However, input features that are skewed are mostly related to areas such as the Pool area, Basement area, Lot area. As area values are not numerically very high, box-cox transformation is a good approach.
Machine Learning
In this section, we will see how the solution for machine learning.
Cross-Validation strategy: Deciding cross-validation strategy is one of the first decisions to make for machine learning model training. In the solution, we will select K-Fold (with K=5) with a shuffle.
Accuracy Metrics: As this is a continuous value prediction problem, the accuracy metric is RMSE (root-mean-square error).
Base-Models: The base model consists of applying different machine learning algorithms: Lasso Regression, Elastic Net regression, Kernel Ridge Regression, Gradient Boosting Regression, XGBoost, and LightGBM.
Ensemble Models: This approach is combining multiple models as described here.
Average of all base models: One way is to take an average of all base models to generate the final price prediction as shown in the diagram here.
The RMSE with the ensemble average model is 0.1081, which is better than the base models.
Stacking of all base models: In this approach, we make predictions for one-fold of the model. These predictions are then used to make train another model, which is used to make test data predictions.
This method gives an RMSE of 0.07, which is better than the average method.
Conclusion
Making such as solution document is a great way to capture your intellectual property in a nice and concise way. It will prove to be efficient for you to explain your solution to others as well as for you to understand your own solution.
Additional Resources
Website
You can visit my website to make analytics with zero coding. https://experiencedatascience.com
Medium subscription and referral link
Please subscribe to stay informed whenever I release a new story.
You can also join Medium with my referral link
Youtube channel
Here is a link to my YouTube channel https://www.youtube.com/c/DataScienceDemonstrated
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS