Building Your Own Dataset: Benefits, Approach, and Tools
The importance of building your own dataset instead of using a pre-built solution
If we search on Google for how to become data scientists we’ll find a plethora of resources that urge us to learn the basics of linear algebra and calculus.
The motivation we are given is that these will be the basis of our understanding of machine and deep learning algorithms that we will study later. They are preparatory pieces and, in my opinion, the resources that say this are right. There’s no way around this.
However, I discovered that although necessary for the growth and understanding of the matter as a data scientists, they are not sufficient. There are plenty of incredibly skilled analysts that struggle to find work and to prove their worth in a professional environment.
I write this post in January 2022, but I’m pretty sure this statement will hold true for the future as well. Just go and read any discussion on Kaggle (for example here), Reddit (in /r/machinelearning or learnmachinelearning) or in a Discord channel that deals with the topic.
These individuals are very, very skilled. Yet they struggle in interfacing with reality. A reality that requires knowledge that goes beyond data science or computer engineering. I don’t want to be arrogant in saying that I know the truth and that these people should follow me. This is not the case.
I myself faced this dilemma and I often still do so today after 6 years of experience in this job. It was not easy, but I gradually developed heuristics and approaches that proved useful to increase my value at the workspace and outside of it.
The advice I feel I can share today is one that is not often heard in the many data science courses that are online, namely that of creating a dataset from scratch to solve a problem that interests us.
Let’s get started.
What are the benefits of building your own dataset from scratch?
There are several reasons that lead me to recommend this approach. These range from pragmatic to more personal ones. Here is an overview:
- we take full responsibility for the project, from start to finish (accountability)
- we are full owners of the material we use (ownership)
- we develop a very deep understanding of the problem and our data (specific knowledge)
- our problem cannot be solved through known open datasets (leverage)
The topics of accountability, ownership, knowledge and leverage are, in my opinion, central to personal and professional development. Let’s explore them one by one to make sense of the business of creating a dataset from scratch.
Accountability
Creating a dataset from scratch forces us in a situation of complete responsibility – any errors or biases in the data are attributable to us and us alone. It is therefore very important to be able to retrieve the data correctly, respecting the validity and consistency of the measurement systems (if we use physical sensors, for instance) or by writing clean and well-structured code if the data is found online (for example with texts).
Ownership
When we get involved and take the time to do the dirty work of scraping, creating surveys or conducting interviews, we develop discipline, patience and skills with these tools.
This would not happen if the dataset used was public, as other researchers have done this work for us in the past. Having ownership of our data also allows us to conduct any team available for the study with intention and purpose.
Specific knowledge
The path will be full of challenges that we will face alone or in company. In any case, we can be sure that we will cross the red line with more specific and general knowledge about the problem. If we tackle this path several times, we can also become experts on the subject (the so-called niche).
Leverage
Having leverage means that we are able to do or offer something that not everyone can. Having leverage gives us value. If we have a lot of leverage, we can offer solutions that serve others and that they are able to pay well for.
When we build a dataset from scratch for a problem that a public dataset does not adequately cover, we are creating an asset with value. This value is more or less valuable depending on the ambition of the project and the final product.
Basically, as with almost all development projects, what you get out of it is almost always important for personal and professional development. The entrepreneurial attitude emerges – if the project is ours at all the mentioned levels then we must be able to make sacrifices to give it the opportunity to see the light. In data science, this sacrifice often begins with the creation of a training dataset.
Approach
Let’s say we want to model our state of health when we are at the PC. The hypothesis is that there are times of the day when our body gets more tired and this fatigue can be due to the work routine or to physiological fluctuations.
The right approach for anyone who wants to study this phenomenon is to put down a small experimental design. In fact, the modeling phase follows the experimental phase in all respects.
You don’t model something you don’t know anything about, and the only way to know something accurately enough is to apply the scientific method.
The example I mentioned is interesting – let’s see how to apply the scientific method for such a project. Here are the steps
-
Hypotheses definitionResearch is almost always based on one or more experimental hypotheses. Sometimes it can be just exploratory, but most have statements made by the researchers who back them up. In our case we have mentioned it just now: t_here are moments of the day when we are more tired and these moments are due to reasons of work stress or physiological fluctuations. O_ur goal will be to falsify this claim (Popper’s principle of falsifiability) by finding evidence to support the opposite. If this evidence is important (statistically) then we are able to reject the null hypothesis (the one describing the world before the experiment) and accept an alternative hypothesis (the one on which we have gathered enough evidence through our experiment).
-
Preparation The preparatory phase allows us to organize the flow of activities and to find the various instrumentation we need. In our case we should collect the data during the work day. We will buy sensors to apply to the body to track heart rate, blood pressure and blood saturation. Furthermore, we can use software to track our activities on the PC. We must also take into account the time of the day: the initial intuition is that the index of each activity will be a timestamp indicating the precise time.
-
Data collection and experimentAt this stage we try to observe or cause the effect that falsifies our hypothesis. In our case, the indicators expressed by the sensors worn and by the software used will inform us of what really happens. Maybe we observe a fall in energy levels only when we talk to a specific group of people at work, or when we do an activity that we don’t like that much. Each event that takes place will be measured and stored with the technical apparatus set up in the previous point.
-
Result analysis This is the final stage of the process (unless you publish a paper about it, in which case the method ends with the presentation stage). Here we will apply a large number of data analysis techniques to validate and clarify the research results.
It is only after analyzing the results of our study that we can say that we understood the problem we faced at the beginning. I add that this is the only moment where we are ethically justified in modeling the observed phenomenon.
I use the word ethically because, although modeling can be done without following the approach described, it is only through this path that the author of the research gets what he wants from the beginning: a greater understanding of the phenomenon.
Any other attempt is conveyed by objectives that are beyond a deeper understanding, such as, for example, making a good impression on someone (the so-called status game) or solving someone else’s problem who in turn does not need to understand the problem on a deeper level.
I want to specify that there is nothing wrong with not wanting to pursue a purely knowledge-oriented goal. We all play the status game, and we all enjoy doing a good job and getting paid for it. That’s not the point.
The point is, if I were to advise my own son on how to approach a personal project in a field like this, I would say these words to him. There is nothing wrong with not doing it… but I am of the opinion that if you do it you will come out as a better person, professionally and personally.
Useful tools for an analyst to create a dataset from scratch
Here’s a list of methods I have used throughout my career to collect data for my projects. Developing skills in applying these techniques is useful for any analyst who wants to have the opportunity to work independently on his or her project.
Web scraping
Definitely one of the most well-known digital techniques for data retrieval. There is nothing these days that cannot be scraped if you are skilled enough. It consists of collecting data from websites or other online resources.
The data must be visible from a web browser and must be designed to be viewed by other users.
There are rules, though: don’t scrape sites that explicitly ask you not to— it’s unethical – and don’t scrape assiduously (that is, don’t flood the server). When scraping, we are sending requests to the server to receive the data back and save it. If we do this too quickly and in an uncontrolled way we could damage the server in question.
The biggest barrier is social media, but there are still valid methods that can be found online.
The tools we can use in Python are
- for small projects -> combination of BeautifulSoup + Requests (here is a great template to start)
- for large projects -> Scrapy
- for any JavaScript rendering need -> Playwright
Surveys
We often forget that surveys are very powerful data collection techniques. There are websites like Pollfish that allow you to submit large-scale surveys and interviews thousands of users over the web through very precise profiling.
Physical sensors
The use of physical sensors has grown a lot in the last decade thanks to the growth of the IoT sector and wearable devices such as Apple Watch and the like. Today we can easily set up physical instrumentation for our data collection, both for things and for people.
Scripts
If we own a website and are proficient in Javascript or Php, we can write tracking scripts which, with the user’s consent, can collect data on its usage behavior. A bit like Google Analytics and HotJar do. I would add that collecting this type of information can fall under the GDPR regulation, so we must be cautious and use the data responsibly.
Conclusion
In conclusion, I must mention that it is also possible to purchase data, which differ in price according to their quality.
Often these are data that are difficult to find by a single individual (such as data that come from deep regions of space or from the sea depths) and are very difficult to put together precisely because of the phenomenon they describe.
Data scientists often purchase datasets to have a larger amount of data to join to a repository of data already in their possession.
If you want to support my content creation activity, feel free to follow my referral link below and join Medium’s membership program. I will receive a portion of your investment and you’ll be able to access Medium’s plethora of articles on data science and more in a seamless way.
I hope I have contributed to your education. Until next time! 👋
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS