![]() |
VOOZH | about |
Hackathons have shaped my data science career in a huge way. They helped me understand the importance of structured thinking and how to use it when working with tight deadlines. This idea is actually the essence that drives the role of a successful data scientist.
I get a lot of questions from aspiring data science professionals wondering how to stand out from the competition and land a role in this field. This is a multi-layered question but one of the common elements I always point out – start participating in hackathons and gauge where you stand.
👁 marketing_analytics_hackathon
And if you can climb up the leaderboard, even better!
In this article, I am excited to share the top three winning approaches (and code!) from the WNS Analytics Wizard 2019 hackathon. This was Analytics Vidhya’s biggest hackathon yet and there is a LOT to learn from these winners’ solutions.
So bring out a pen and paper, take notes and don’t miss out on any other hackathons! Head straight to the DataHack platform and enroll in the upcoming competitions today.
The WNS Analytics Wizard 2019 was the biggest hackathon ever hosted by Analytics Vidhya. Here’s a summary of the numbers behind this historic hackathon:
It was a memorable 9-day hackathon with a wide range of data scientists participating from all over the globe.
Let’s check out the problem statement for this hackathon.
Zbay is an e-commerce website that sells a variety of products on its online platform. Zbay records the user behavior of its customers and stores it in log form. However, most of the time users do not buy the products instantly and there is a time gap during which the customer might surf the internet and perhaps visit competitor websites.
Now, to improve the sales of products, Zbay has hired Adiza, an Adtech company that built a system where advertisements are shown for Zbay’s products on its partner websites.
If a user comes to Zbay’s website and searches for a product, and then visits these partner websites or apps, his/her previously viewed items or their similar items are shown as an advertisement (ad). If the user clicks this ad, he/she will be redirected to Zbay’s website and might buy the product.
👁 marketing analytics hackathon
In this problem, the task is to predict click probability, i.e., the probability of a user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.
The participants were provided with:
The training data contains the impression logs during 2018/11/15 – 2018/12/13 along with the label which specifies whether the ad is clicked or not. The final model was evaluated on the test data which has impression logs during 2018/12/12 – 2018/12/18 without the labels.
The contest is over but you can still make submissions to this contest and get your rank. Download the dataset from here and see if you can beat the top score.
The participants were provided with the below files:
train.csv:
| Variable | Definition |
| impression_id | AD impression id |
| impression_time | Time of the impression at partner website |
| user_id | user id |
| app_code | Application Code for a partner website where the ad was shown |
| os_version | Version of operating system |
| is_4G | 1-Using 4G, 0-No 4G |
| is_click | (target) Whether user clicked the AD (1-click, 0-no click) |
view_log.csv:
| Variable | Definition |
| server_time | Timestamp of the log |
| device_type | Device type of the user |
| session_id | Browser session id |
| user_id | user id |
| item_id | Item id |
item_data.csv:
| Variable | Definition |
| item_id | Item id |
| item_price | Price of the item |
| category_1 | Category depth 1 |
| category_2 | Category depth 2 |
| category_3 | Category depth 3 |
| product_type | anonymized item type |
Winning a hackathon is a remarkably challenging task. There are a lot of obstacles to overcome, not to mention the sheer amount of competition from the top data scientists in the world.
I loved going through these top solutions and approaches provided by our winners. First, let’s look at who won and congratulate them:
You can check out the final rankings of all the participants on the Leaderboard.
The top three winners have shared their detailed approach from the competition. I am sure you are eager to know their secrets so let’s begin.
Here’s what Team AK shared with us.
Our final solution was an ensemble of LightGBM, Neural Networks and CatBoost models.
Taking a cursory look at the datasets, it seemed that the features generated from view_logs would play a significant role in improving the score. But soon we discovered that most of the features generated from view_logs were overfitting the training set.
This was because a higher percentage of training data had recent view_logs than in the test set. So, we focused more on feature engineering on the training set.
The features that worked for us were percentage_clicks_tilldate per user/app_code/user_app_code. These features alone helped us reach 0.73xx on the public leaderboard. Along with these, we used some time-based features and a couple of features from view_logs.
We both had slightly different approaches giving similar public leaderboard scores of around 0.75xx.
Apart from these common features, we used a few different techniques/features in our individual approaches giving scores of around 0.75xx. Our final submission was ranking the average of our best individual models.
My validation strategy was a simple time-based split:
You can check out the full code for this approach here.
Here’s what Arefev shared with us.
Overall Approach
Let’s look at Arafev’s step-by-step approach now.
Data-preprocessing and Feature Engineering
My final model
Key Takeaways
Things a participant must focus on while solving such problems
Here’s the full code for Arefev’s approach.
Here’s what Roman shared with us.
My solution consists mainly of feature engineering.
I generated features based on user id and app code characteristics. This is described in more detail below. I trained a LightGBM model on 10 different subsamples. As a final prediction, I averaged these models by rank.
For validation, I used StratifiedKFold (from sklearn.model selection) with the below parameters:
For each user_id and impression_time, I calculated:
I used LightGBM as my final model with the below parameters:
I used a StratifiedKFold with 10 folds, so I got 10 models that were used to make predictions on the test data. Rank averaging among these 10 predictions was used as the final prediction. On local validation, I got the following average ROC-AUC value: 0.7383677000.
And here is the full code for Roman’s winning solution!
Phew – take a deep breath. Those were some mind-blowing frameworks that won this hackathon. As I mentioned earlier, it is quite a task winning a hackathon and our three winners really stood out with their thought process.
I encourage you to head over to the DataHack platform TODAY and participate in the ongoing and upcoming hackathons. It will be an invaluable learning experience (not to mention a good addition to your budding resume!).
IIT Bombay Graduate with a Masters and Bachelors in Electrical Engineering. I have previously worked as a lead decision scientist for Indian National Congress deploying statistical models (Segmentation, K-Nearest Neighbours) to help party leadership/Team make data-driven decisions. My interest lies in putting data in heart of business for data-driven decision making.
GPT-4 vs. Llama 3.1 – Which Model is Better?
Llama-3.1-Storm-8B: The 8B LLM Powerhouse Surpa...
A Comprehensive Guide to Building Agentic RAG S...
Top 10 Machine Learning Algorithms in 2026
45 Questions to Test a Data Scientist on Basics...
90+ Python Interview Questions and Answers (202...
8 Easy Ways to Access ChatGPT for Free
Prompt Engineering: Definition, Examples, Tips ...
What is LangChain?
What is Retrieval-Augmented Generation (RAG)?
Congratulations to all the winners and thank you for sharing the Winner's approach and codes. This is why Analytics Vidhya is the best place to learn Machine Learning.
Many many congratulation to all winners. I really appreciate your article. Thank You Sir
Edit
Resend OTP
Resend OTP in 45s