Data Science

A Classification Model for Source Code Languages

With around 97% accuracy

Amal Hasni

Dec 13, 2020

13 min read

VIDEO TUTORIAL

Written by: Amal Hasni & Dhia Hmila

👁 Photo by Sharon McCutcheon on Unsplash

Photo by Sharon McCutcheon on Unsplash

We recently needed to write an extension of Python’s Markdown package. For this purpose, we needed to detect the programming language of each code block to apply specific modifications. Luckily, in addition to being programming enthusiasts, we also happen to be data scientists. So we decided to use Natural Language Processing techniques to build ourselves a classification model and we will explain exactly how we did that!

Before diving into the details of how we built our model, you can try it out on your own code snippets via this demo. Bonus: You also get to see which parts of your code snippets had a decisive effect on the classification.

Now let’s jump into the model-building part!

Building the Dataset

How to find a good Dataset?

Since we’re building a classification model, we need to get ourselves a nice labeled Dataset. Unfortunately, those kinds of Datasets aren’t usually available and ready to use and we need some effort and digging to get them.

What you can do in those circumstances, is starting with a simple search on websites like Kaggle. If that doesn’t work out for you, you can widen the research to other websites by using a google search.

The steps we used to build our custom Dataset

Being in search of code snippets dataset, we were lucky enough to find a perfect one on Kaggle. The dataset out of which we created our own is called GitHub Repos and contains code and comments from 2.8 million GitHub repositories.

To create our custom Dataset we used BigQuery helper module. Our search Query needed to include the following specifications:

Select code snippets and scripts file names only
Select a specific number of rows for each language (to get a balanced Dataset)
Identify a language by the different file extensions it could have

The resulting Query looks something like this :

(SELECT sample_path, content
FROM `bigquery-public-data.github_repos.sample_contents`
WHERE (binary = False AND (sample_path LIKE "%.bat" OR sample_path LIKE "%.cmd" OR sample_path LIKE "%.btm"))
LIMIT 5000)

UNION ALL

(SELECT sample_path, content
FROM `bigquery-public-data.github_repos.sample_contents`
WHERE (binary = False AND (sample_path LIKE "%.c"))
LIMIT 5000)

UNION ALL

...

To see the whole code for how we created our Dataset and/or download it, check the following Kaggle Notebook. You just need to click on "copy and edit" then run all cells and you’ll get your downloading link at the bottom of the Notebook page.

Data exploration and transformation

Our resulting Dataset contains 131455 code snippets distributed over 34 programming languages:

👁 Plot by Author

Plot by Author

The first actions we need to perform is to make sure there are no duplicate rows or NaN values:

data.dropna(inplace=True)
data.drop_duplicates(inplace=True)

The next step is to create a "language" column containing the programming, out of the file name extension. To match the extension with the corresponding programming language, we built the following JSON file. The code to perform the matching is the following:

Now we’re all set to start building our model. We start by splitting our Dataset into test and train sets:

Preprocessing

As you might know, most machine learning classifiers get a feature vector of fixed length as their input. So, if you’re not accustomed to NLP concepts, you may wonder how we’re going to convert unstructured textual data into a numerical array that’s going to make sense?

👁 Diagram by Author

Diagram by Author

Tokenization

The first clue to vectorizing our data is Tokenization! It’s usually, the first step to every NLP project and basically consists of cutting each textual "document" (in this case, code snippets) in character substrings called Tokens.

👁 Image by Author

Image by Author

From these tokens, we can build a set of words (or vocabulary ) to be used for the vectorization of the text.

For regular English text, it can be just separating words with space character and punctuation (even though, this has many limitations). If you’re working on such data, packages such as nltk or spacy offer state of the art Tokenizers.

Since we are working on source code instead of plain English text, we’re going to customize Sklearn ‘s tokenizer to suit our needs.

For Python code, this is the result we should expect:

👁 Image by Author

Image by Author

For this purpose, we are going to use regular expressions to describe the pattern of a token. You will see that even though Regex might not seem "sexy", they’re actually very powerful and useful for these kinds of tasks or for NLP in general.

If you don’t know how to use regex or you’re in need of a refresher, you can check out this guide on python’s documentation or this quick cheat sheet.

For this project, we initially had 3 types of tokens in mind (identifiers, operators, and brackets). The pattern we used is the following:

([A-Za-z_]w*b|[!#$%&amp;*+:-./<=>?@^_|~]+|[ t(),;{}[]`"'])

It is a group with 3 alternatives corresponding to the different types:

You can try it with different code snippet examples on Regex101 which is the perfect place to play around with regex.

We can see that with this modeling, we miss numerical values such as 100% or 12px which can be useful for identifying languages such as HTML and CSS. If we include a regex to catch such tokens, we’ll end up with a vocabulary full of different numerical values and with no way to tell our model that the role played by ’12px’ is actually the same as ’14px’.

For this reason, we chose to ignore the numerical features, especially since, it had no visible effect on the model’s accuracy.

Some cleaning-up

Looking through the different tokens we obtained, we noticed that there are a lot of single-character variable names or ones constituted of a sequence of the same character such as xxx which does not add much information to the classifier.

For this reason, we chose to treat those as stop words and remove them.

For these kinds of treatments, scikit-learn provides a neat class called FunctionTransformer that we can use as follows:

Vectorization

The usual next step, in a typical NLP project, is lemmatization/stemming. But since we’re dealing with code that is not a "natural" language, altering words’ structures would result in information loss. Since what we really want to do is to be able to spot discriminating keywords for each language, we’re going to directly move to Vectorization.

At this point, we have a vocabulary of size M ( with M being the number of unique tokens across all of our code snippets) and our Dataset is composed of a list of tokens instead of raw text. We can, therefore, associate each token to its index i in the vocabulary! Now what we want to do is to convert each code snippet into an array of size M .

👁 Diagram by Author

Diagram by Author

The vector representation, we are going to go with, is called TF-IDF (term frequency-inverse document frequency) and it associates each document (source code) to an array of size M where the i-th element of the array corresponds to the scaled frequency of the token in the document.

The formula for computing the TF-IDF factor for a given token i in a document d is the following:

👁 Image by Author

Image by Author

Where TF is the term frequency and IDF the inverse document frequency.

The scale factor IDF (inverse document frequency) tries to scale down the weight associated with tokens that appear frequently across all documents such as the = operator and that, therefore, give less information on the nature of the document.

Let’s move on to coding the Vectorizer:

Now you can see that in addition to the token_pattern, we used another feature called max_features , which limits the recognized features to the top first 3000 tokens ordered by term frequency across the corpus. This serves the purpose of reducing the number of features and therefore gain in computational costs and avoid overfitting.

We can see the feature names (or tokens) used by our vectorizer to make sure everything works the way it was intended and to assess the quality of the results:

# List extracted tokens
print(vectorizer.get_feature_names())

Choice of Model

Now that we got our data in the shape that we needed, comes the part where we choose a model to feed it to. Since no algorithm works well for all data science problems, it’s up to us to explore different options and select the best performing classifier. We are going to be using Sklearn’s methods GridSreachCV and Pipeline to automate the whole deployment and model search. If you’re not familiar with these two methods (and several other scikit-learn tools) you’ll find this article extremely useful. Meanwhile, these are, in a nutshell, the definitions provided by scikit-learn documentation:

Pipeline: Pipeline can be used to chain multiple estimators into one. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

GridSearchCV : GridSearchCV can be used for exhaustive search over specified parameter values for an estimator.

To perform the GridSearch over the four models we’ve chosen to compare, we used 30% of the Dataset for computational costs reasons. The corresponding code is the following:

You might have noticed that in the code below, a second Pipeline was defined for the GaussianNB algorithm. The reason behind it is that the classifier doesn’t accept sparse matrices as input (the following error is thrown TypeError: A sparse matrix was passed, but dense data is required ). The solution to that is to add an extra step to densify the input data using FunctionTransformer and SciPy method todense() :

FunctionTransformer(lambda x: x.todense(), accept_sparse=True)

Now to get to the evaluate performance and select the winner classifier, use grid_search object attributes:

The results of our GridSearch conclude our quest for the perfect model 👑 : We will be using the Random Forest classifier for our prediction part.

Parameters tuning

We want our classification model to have as good a performance as possible. For this reason, we will be performing a grid search to optimize the model’s hyperparameters:

To display the best hyperparameters chosen after the grid search, you just need to type:

best_params = grid_search_RF.best_params_

And for our model, the result is:

To use the tuned model, we just need to unpack the parameters dictionary before deploying our classifier :

pipe_RF.set_params(**best_params)

Final Results

We finally have our best model and the best corresponding hyperparameters. After training it on the whole 80% of the whole Dataset, the accuracy it gave after being tested on the remaining 20% was:

Accuracy = 96.8 %

These results are more than satisfying and we will be analyzing more metrics in the next section.

Result Analysis

With 96.8% accuracy, our model is functional and ready to go! We could be content with the results and stop here. But why not dig deeper to understand which language syntax specificities influence more the prediction and why it struggles in some languages more than others?

Let’s start by analyzing the confusion matrix :

👁 Confusion Matrix generated by Author

Confusion Matrix generated by Author

You might have noticed in the matrix above that the model tends to confuse some languages with some specific others. A good example of that is the couple JavaScript and TypeScript :

👁 Image by Author

Image by Author

In this zoom over the confusion matrix, we can see for example that the model tends to confuse TypeScript snippets with JavaScript snippets approximately 4% of the time.

These results are not surprising since Typescript is a superset of JavaScript and therefore, all JavaScript code is totally valid TypeScript. The false positives are probably relative to code snippets with no discriminating syntax.

To test our intuition and further understand why certain confusions happen, we can use lime package (local interpretable model-agnostic explanations) to explain our predictions. If you’re not familiar with the package, Lime is in a nutshell a model made to explain what machine learning classifiers (or models) are doing.

Explaining some predictions

Using Lime for text prediction is very easy. All you need to do is define an Explainer object. If you are using a custom tokenizer (like us), you can pass it as an argument like this:

The next step is to try it on some code by giving the explain_instance method the following arguments:

the predict_proba method
the number of features to show in the explanation

exp = explainer.explain_instance(YOUR_CODE, model.predict_proba, num_features=6)

You can now visualize your explanation mainly in 3 formats:

as a list: exp.as_list()
as a plot: exp.as_pyplot_figure()
as HTML: exp.as_html()
as a notebook output: exp.show_in_notebook(text=True) (only works if you’re in a Jupyter Notebook)

Now that we’ve got that out of the way, let’s see what it gives us on the following TypeScript code:

👁 Image by Author

Image by Author

We can directly see that, as expected, the tokens: and number ( which are exclusively part of TypeScript Syntax ) had a decisive role in rightly classifying the code as TypeScript.

Below, you can find another example:

print("Hello World")

Being a python adept, I immediately tried to apply the model on that piece of code, expecting to get Python as a result. To my surprise, I got Swift instead. Contrary to what I thought at first, it wasn’t a mistake as printing "Hello World" in Swift and Lua is identical to Python’s.

👁 Image by Author

Image by Author

This can actually explain some of the mistakes, we have on short code snippets that do not provide enough discriminating information.

We can verify that hypothesis, by comparing the whole Dataset’s distribution of code snippets’ length and the misclassified code snippets’ length. In the graph below, we can spot a visible link between the length of the code snippet and the frequency of misclassification.

👁 Plot by Author

Plot by Author

Final thoughts

Finally, we can conclude that our model gives generally excellent results but struggles a bit (understandably) on extremely short code snippets with few or no discriminating syntax features.

Now that you saw how we built our end-to-end project, you can use what you hopefully learned for other NLP project ideas. You’ll find the complete code here:

GitHub repository

If you have questions, please don’t hesitate to leave them in the responses section and we’ll be more than happy to answer.

Thank you for sticking up this far, stay safe and we will see you in our next article! 😊

References

[1] Christopher D. Manning, Prabhakar Raghavan, H. S. (2008). Introduction to information retrieval. https://nlp.stanford.edu/IR-book/

[2] Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin, "Why Should I Trust You?": Explaining the Predictions of Any Classifier. https://arxiv.org/abs/1602.04938

[3] Scikit-learn’s documentation https://scikit-learn.org/stable/

Written By

Amal Hasni

See all from Amal Hasni

Data Science, Editor’s Picks, Hands On Tutorials, Machine Learning, Video Tutorial

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/classification-model-for-source-code-programming-languages-40d1ab7243c2/