Why write a Solution Description for a machine-learning problem

Done developing machine learning model. Well, it's not yet done

Feb 4, 2022

7 min read

You have finished solving a machine learning problem. The accuracy of your model is awesome. You are done. Wait! Not so fast!

Till now your work is probably a Jupyter notebook, which is full of code, a few visuals, and very little documentation. If you see your work after a month or so, you might struggle to understand your own creation. To make matter worse, the Jupyter notebook does not have all decisions and assumptions you have taken in the solution.

It is a good practice to make a solution description document of the amazing work you have done. Such a document can have the following benefits:

Whatever is said and done, a nicely written document is much more understandable than a Jupyter notebook.
You can document why you took a particular approach. It helps to put your solution into a perspective.
It can help the operationalization team to understand your solution. Generally, operationalization or IT team would feel comfortable in a solution that they understand better.

Let me illustrate with an example. Let us take the Kaggle House Price Prediction problem (https://www.kaggle.com/c/house-prices-advanced-regression-techniques). Here is a snapshot image of the solution in the Jupyter notebook.

👁 snapshot image of the solution in the Jupyter notebook (image by author)

snapshot image of the solution in the Jupyter notebook (image by author)

Now let us see how to write a solution description of the work done. A solution description is a way to document how you are solving a given problem. The document can have three main parts:

Business objective: This part explains the problem and why you need machine learning to solve the problem.

Solution Summary: This part has a summary of the solution. It can be a table that lists the main steps. Looking at such as summary table can be a great way to quickly understand the solution.

Solution Details: This part has a brief description of the solution, assumptions taken, and visuals that explain the approach.

Let me illustrate it with an example of the house prediction problem.

Business Objective

The objective is to predict house prices for residential homes in Ames, Iowa. There are 79 variables related to the house. It will be humanly impossible to determine the relationship between such a high number of variables and price. So we adopt a machine learning approach.

Solution Summary

The summary has a list of solution steps documented in a concise way, as shown below.

👁 Solution Summary (image by author)

Solution Summary (image by author)

Such as table helps to clearly and efficiently document the solution in a very concise way. You can go one step further and make a tab for each problem you solve. This will also give insights into how you generally approach a machine learning problem, re-use your approach as well as improve in your own game.

Solution Details

This section describes the solution with explanations and visuals. It contains the important steps as well as assumptions you have taken. In the House Price prediction problem, we summarize the solution in Data Processing, Feature Engineering, Machine Learning.

Data Processing

Remove outliers in target variable: Observing the scatterplot between Price and one of the input variables living area, we observe that some prices which are outliers. We can remove them by deleting them.

👁 House price - outlier removal (image by author)

House price – outlier removal (image by author)

Remove skew in target variable: The target variable price is skewed. So we can use log-transform **** to price in order to make it normally distributed

👁 Sales price - removing the skew (image by author)

Sales price – removing the skew (image by author)

Impute Missing values: The house price dataset has many missing values. The features which have a very high number of missing values are Pool, Miscellaneous feature, Alley, Fence, Fireplace, Front area.

👁 Features with a missing value (image by author)

Features with a missing value (image by author)

Here is the way to impute missing values.

Pool: We assume that missing values means that there is no pool. This is a good assumption as most of the houses do not have pools. We replace missing values with NONE.

Miscellaneous, Alley, Fence, Fireplace: We can use a similar method as Pool and we replace missing values with NONE.

Basement Surface Area features: We assume missing values means that there is no basement. So we replace it with zero.

Lot Front Area: Since the area of each street connected to the house property most likely has a similar area to other houses in its neighborhood, we can fill in missing values by the median value of the lot front area of the neighborhood.

MSZoning (The general zoning classification): The percentage of missing values is 0.13%, which is very less. So we can fill the missing value with the most common value, which is ‘RL’.

Electrical, Kitchen Quality, Exterior, Sals. Type: We follow the same approach as MSZoning as the percentage of missing values is very less. We replace it with the most common value.

Feature Engineering

Here we apply some ‘common-sense’ and ‘creative’ feature engineering.

Transforming some numerical variables that are really categorical: The data has features such as building class (MSSubClass), overall condition (OverallCond), Year of sell (YrSold), Month of sell (MoSold). Though they have numeric values, they actually are categorical. So we can convert them into non-numeric values. We will be able to one-hot encode it rather than normalizing it.

Label Encoding features that can represent ordered set: In the dataset, there are features that are related to quality. For example Fireplace Quality, Basement Quality, Garage Quality, etc. Low values indicate low quality and high values indicate high quality. This means the values represent an ordering: from lowest quality to highest quality. Such features are good candidates to apply Label encoding and convert them to 1,2,3 etc…

Combining features: We can create a new feature called Total Surface area which is Total Basement Area + Total 1st-floor area + Total 2nd-floor area.

Converting highly-skewed features to a normal distribution: Similar to removing skew for target variable price, as we have seen above, we can also convert highly skewed features to a normal distribution. For the price, we had applied log transformation as values in price are high values. However, input features that are skewed are mostly related to areas such as the Pool area, Basement area, Lot area. As area values are not numerically very high, box-cox transformation is a good approach.

Machine Learning

In this section, we will see how the solution for machine learning.

Cross-Validation strategy: Deciding cross-validation strategy is one of the first decisions to make for machine learning model training. In the solution, we will select K-Fold (with K=5) with a shuffle.

Accuracy Metrics: As this is a continuous value prediction problem, the accuracy metric is RMSE (root-mean-square error).

Base-Models: The base model consists of applying different machine learning algorithms: Lasso Regression, Elastic Net regression, Kernel Ridge Regression, Gradient Boosting Regression, XGBoost, and LightGBM.

👁 Base model accuracy (image by author)

Base model accuracy (image by author)

Ensemble Models: This approach is combining multiple models as described here.

Average of all base models: One way is to take an average of all base models to generate the final price prediction as shown in the diagram here.

👁 Average of price prediction for all base models

Average of price prediction for all base models

The RMSE with the ensemble average model is 0.1081, which is better than the base models.

Stacking of all base models: In this approach, we make predictions for one-fold of the model. These predictions are then used to make train another model, which is used to make test data predictions.

👁 Stacking of all base models

Stacking of all base models

This method gives an RMSE of 0.07, which is better than the average method.

Conclusion

Making such as solution document is a great way to capture your intellectual property in a nice and concise way. It will prove to be efficient for you to explain your solution to others as well as for you to understand your own solution.

Additional Resources

Website

You can visit my website to make analytics with zero coding. https://experiencedatascience.com

Medium subscription and referral link

Please subscribe to stay informed whenever I release a new story.

You can also join Medium with my referral link

Youtube channel

Here is a link to my YouTube channel https://www.youtube.com/c/DataScienceDemonstrated

Written By

Pranay Dave

See all from Pranay Dave

Artificial Intelligence, Data Science, Machine Learning, Solutions, Writing

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/why-write-a-solution-description-for-a-machine-learning-problem-3ce60649d0a5/