VIDEO TUTORIAL
Written by: Amal Hasni & Dhia Hmila
We recently needed to write an extension of Python’s Markdown package. For this purpose, we needed to detect the programming language of each code block to apply specific modifications. Luckily, in addition to being programming enthusiasts, we also happen to be data scientists. So we decided to use Natural Language Processing techniques to build ourselves a classification model and we will explain exactly how we did that!
Before diving into the details of how we built our model, you can try it out on your own code snippets via this demo. Bonus: You also get to see which parts of your code snippets had a decisive effect on the classification.
Now let’s jump into the model-building part!
Building the Dataset
How to find a good Dataset?
Since we’re building a classification model, we need to get ourselves a nice labeled Dataset. Unfortunately, those kinds of Datasets aren’t usually available and ready to use and we need some effort and digging to get them.
What you can do in those circumstances, is starting with a simple search on websites like Kaggle. If that doesn’t work out for you, you can widen the research to other websites by using a google search.
The steps we used to build our custom Dataset
Being in search of code snippets dataset, we were lucky enough to find a perfect one on Kaggle. The dataset out of which we created our own is called GitHub Repos and contains code and comments from 2.8 million GitHub repositories.
To create our custom Dataset we used BigQuery helper module. Our search Query needed to include the following specifications:
- Select code snippets and scripts file names only
- Select a specific number of rows for each language (to get a balanced Dataset)
- Identify a language by the different file extensions it could have
The resulting Query looks something like this :
(SELECT sample_path, content
FROM `bigquery-public-data.github_repos.sample_contents`
WHERE (binary = False AND (sample_path LIKE "%.bat" OR sample_path LIKE "%.cmd" OR sample_path LIKE "%.btm"))
LIMIT 5000)
UNION ALL
(SELECT sample_path, content
FROM `bigquery-public-data.github_repos.sample_contents`
WHERE (binary = False AND (sample_path LIKE "%.c"))
LIMIT 5000)
UNION ALL
...
To see the whole code for how we created our Dataset and/or download it, check the following Kaggle Notebook. You just need to click on "copy and edit" then run all cells and you’ll get your downloading link at the bottom of the Notebook page.
Data exploration and transformation
Our resulting Dataset contains 131455 code snippets distributed over 34 programming languages:
The first actions we need to perform is to make sure there are no duplicate rows or NaN values:
data.dropna(inplace=True)
data.drop_duplicates(inplace=True)
The next step is to create a "language" column containing the programming, out of the file name extension. To match the extension with the corresponding programming language, we built the following JSON file. The code to perform the matching is the following:
Now we’re all set to start building our model. We start by splitting our Dataset into test and train sets:
Preprocessing
As you might know, most machine learning classifiers get a feature vector of fixed length as their input. So, if you’re not accustomed to NLP concepts, you may wonder how we’re going to convert unstructured textual data into a numerical array that’s going to make sense?
Tokenization
The first clue to vectorizing our data is Tokenization! It’s usually, the first step to every NLP project and basically consists of cutting each textual "document" (in this case, code snippets) in character substrings called Tokens.
From these tokens, we can build a set of words (or vocabulary ) to be used for the vectorization of the text.
For regular English text, it can be just separating words with space character and punctuation (even though, this has many limitations). If you’re working on such data, packages such as nltk or spacy offer state of the art Tokenizers.
Since we are working on source code instead of plain English text, we’re going to customize Sklearn ‘s tokenizer to suit our needs.
For Python code, this is the result we should expect:
For this purpose, we are going to use regular expressions to describe the pattern of a token. You will see that even though Regex might not seem "sexy", they’re actually very powerful and useful for these kinds of tasks or for NLP in general.
If you don’t know how to use regex or you’re in need of a refresher, you can check out this guide on python’s documentation or this quick cheat sheet.
For this project, we initially had 3 types of tokens in mind (identifiers, operators, and brackets). The pattern we used is the following:
([A-Za-z_]w*b|[!#$%&*+:-./<=>?@^_|~]+|[ t(),;{}[]`"'])
It is a group with 3 alternatives corresponding to the different types:
You can try it with different code snippet examples on Regex101 which is the perfect place to play around with regex.
We can see that with this modeling, we miss numerical values such as 100% or 12px which can be useful for identifying languages such as HTML and CSS. If we include a regex to catch such tokens, we’ll end up with a vocabulary full of different numerical values and with no way to tell our model that the role played by ’12px’ is actually the same as ’14px’.
For this reason, we chose to ignore the numerical features, especially since, it had no visible effect on the model’s accuracy.
Some cleaning-up
Looking through the different tokens we obtained, we noticed that there are a lot of single-character variable names or ones constituted of a sequence of the same character such as xxx which does not add much information to the classifier.
For this reason, we chose to treat those as stop words and remove them.
For these kinds of treatments, scikit-learn provides a neat class called FunctionTransformer that we can use as follows:
Vectorization
The usual next step, in a typical NLP project, is lemmatization/stemming. But since we’re dealing with code that is not a "natural" language, altering words’ structures would result in information loss. Since what we really want to do is to be able to spot discriminating keywords for each language, we’re going to directly move to Vectorization.
At this point, we have a vocabulary of size M ( with M being the number of unique tokens across all of our code snippets) and our Dataset is composed of a list of tokens instead of raw text. We can, therefore, associate each token to its index i in the vocabulary! Now what we want to do is to convert each code snippet into an array of size M .
The vector representation, we are going to go with, is called TF-IDF (term frequency-inverse document frequency) and it associates each document (source code) to an array of size M where the i-th element of the array corresponds to the scaled frequency of the token in the document.
The formula for computing the TF-IDF factor for a given token i in a document d is the following:
Where TF is the term frequency and IDF the inverse document frequency.
The scale factor IDF (inverse document frequency) tries to scale down the weight associated with tokens that appear frequently across all documents such as the = operator and that, therefore, give less information on the nature of the document.
Let’s move on to coding the Vectorizer:
Now you can see that in addition to the token_pattern, we used another feature called max_features , which limits the recognized features to the top first 3000 tokens ordered by term frequency across the corpus. This serves the purpose of reducing the number of features and therefore gain in computational costs and avoid overfitting.
We can see the feature names (or tokens) used by our vectorizer to make sure everything works the way it was intended and to assess the quality of the results:
# List extracted tokens
print(vectorizer.get_feature_names())
Choice of Model
Now that we got our data in the shape that we needed, comes the part where we choose a model to feed it to. Since no algorithm works well for all data science problems, it’s up to us to explore different options and select the best performing classifier. We are going to be using Sklearn’s methods GridSreachCV and Pipeline to automate the whole deployment and model search. If you’re not familiar with these two methods (and several other scikit-learn tools) you’ll find this article extremely useful. Meanwhile, these are, in a nutshell, the definitions provided by scikit-learn documentation:
Pipeline: Pipeline can be used to chain multiple estimators into one. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
GridSearchCV : GridSearchCV can be used for exhaustive search over specified parameter values for an estimator.
To perform the GridSearch over the four models we’ve chosen to compare, we used 30% of the Dataset for computational costs reasons. The corresponding code is the following:
You might have noticed that in the code below, a second Pipeline was defined for the GaussianNB algorithm. The reason behind it is that the classifier doesn’t accept sparse matrices as input (the following error is thrown TypeError: A sparse matrix was passed, but dense data is required ). The solution to that is to add an extra step to densify the input data using FunctionTransformer and SciPy method todense() :
FunctionTransformer(lambda x: x.todense(), accept_sparse=True)
Now to get to the evaluate performance and select the winner classifier, use grid_search object attributes:
The results of our GridSearch conclude our quest for the perfect model 👑 : We will be using the Random Forest classifier for our prediction part.
Parameters tuning
We want our classification model to have as good a performance as possible. For this reason, we will be performing a grid search to optimize the model’s hyperparameters:
To display the best hyperparameters chosen after the grid search, you just need to type:
best_params = grid_search_RF.best_params_
And for our model, the result is:
To use the tuned model, we just need to unpack the parameters dictionary before deploying our classifier :
pipe_RF.set_params(**best_params)
Final Results
We finally have our best model and the best corresponding hyperparameters. After training it on the whole 80% of the whole Dataset, the accuracy it gave after being tested on the remaining 20% was:
Accuracy = 96.8 %
These results are more than satisfying and we will be analyzing more metrics in the next section.
Result Analysis
With 96.8% accuracy, our model is functional and ready to go! We could be content with the results and stop here. But why not dig deeper to understand which language syntax specificities influence more the prediction and why it struggles in some languages more than others?
Let’s start by analyzing the confusion matrix :
You might have noticed in the matrix above that the model tends to confuse some languages with some specific others. A good example of that is the couple JavaScript and TypeScript :
In this zoom over the confusion matrix, we can see for example that the model tends to confuse TypeScript snippets with JavaScript snippets approximately 4% of the time.
These results are not surprising since Typescript is a superset of JavaScript and therefore, all JavaScript code is totally valid TypeScript. The false positives are probably relative to code snippets with no discriminating syntax.
To test our intuition and further understand why certain confusions happen, we can use lime package (local interpretable model-agnostic explanations) to explain our predictions. If you’re not familiar with the package, Lime is in a nutshell a model made to explain what machine learning classifiers (or models) are doing.
Explaining some predictions
Using Lime for text prediction is very easy. All you need to do is define an Explainer object. If you are using a custom tokenizer (like us), you can pass it as an argument like this:
The next step is to try it on some code by giving the explain_instance method the following arguments:
- the
predict_probamethod - the number of features to show in the explanation
exp = explainer.explain_instance(YOUR_CODE, model.predict_proba, num_features=6)
You can now visualize your explanation mainly in 3 formats:
- as a list:
exp.as_list() - as a plot:
exp.as_pyplot_figure() - as HTML:
exp.as_html() - as a notebook output:
exp.show_in_notebook(text=True)(only works if you’re in a Jupyter Notebook)
Now that we’ve got that out of the way, let’s see what it gives us on the following TypeScript code:
We can directly see that, as expected, the tokens: and number ( which are exclusively part of TypeScript Syntax ) had a decisive role in rightly classifying the code as TypeScript.
Below, you can find another example:
print("Hello World")
Being a python adept, I immediately tried to apply the model on that piece of code, expecting to get Python as a result. To my surprise, I got Swift instead. Contrary to what I thought at first, it wasn’t a mistake as printing "Hello World" in Swift and Lua is identical to Python’s.
This can actually explain some of the mistakes, we have on short code snippets that do not provide enough discriminating information.
We can verify that hypothesis, by comparing the whole Dataset’s distribution of code snippets’ length and the misclassified code snippets’ length. In the graph below, we can spot a visible link between the length of the code snippet and the frequency of misclassification.
Final thoughts
Finally, we can conclude that our model gives generally excellent results but struggles a bit (understandably) on extremely short code snippets with few or no discriminating syntax features.
Now that you saw how we built our end-to-end project, you can use what you hopefully learned for other NLP project ideas. You’ll find the complete code here:
If you have questions, please don’t hesitate to leave them in the responses section and we’ll be more than happy to answer.
Thank you for sticking up this far, stay safe and we will see you in our next article! 😊
References
[1] Christopher D. Manning, Prabhakar Raghavan, H. S. (2008). Introduction to information retrieval. https://nlp.stanford.edu/IR-book/
[2] Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin, "Why Should I Trust You?": Explaining the Predictions of Any Classifier. https://arxiv.org/abs/1602.04938
[3] Scikit-learn’s documentation https://scikit-learn.org/stable/
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS