VOOZH about

URL: https://towardsdatascience.com/what-do-data-scientists-do-13526f678129/

⇱ What Do Data Scientists Do? | Towards Data Science


What Do Data Scientists Do?

Attempting to Shed Some Light on an Ambiguous Field and Profession

7 min read
👁 Photo by rawpixel.com from Pexels
Photo by rawpixel.com from Pexels

I recently completed a data science bootcamp and have started working as a data scientist at a fintech startup. I am obviously happy and relieved to be gainfully employed again.

But one thing I’ve noticed since I started working is that a good amount of people including my dad (Hi Dad!) have asked me, "What’s data science and what is it exactly that you do?"

Normally, I would refer them to my blog but I realized I have never written on this before. I’ve written plenty of articles about specific data science and machine learning concepts but I never personally defined what the profession and industry mean to me. So let’s rectify that right now.


What is Data Science?

First a disclaimer – I am by no means a data science expert. While I do have a fair bit of statistics and quant research experience, I consider myself somewhat new to the data science field. But it’s a free country so I can still give my 2 cents!

Let’s start with what the world believes data science to be:

  • Using artificial intelligence to foretell the future.
  • Building an awesome machine learning model that takes in massive amounts of seemingly worthless data and produces insights worth their weight in gold.
  • Making cars drive themselves.
  • Autogenerating books, paintings, or techno music.

While many of these have an element of truth to them (along with a lot of exaggeration and a pinch of sarcasm), data science is not always so glamorous. More often than not, our job is to help folks run their businesses better. I define data science as:

Using quantitative data to generate business insights in order to help your company make more money.

I know what you are thinking – Tony is such a capitalist. But at the end of the day, the quest to sell more widgets or increase customer engagement or reduce user churn is what most data scientists (as well as other quantitative analysts) are employed to do. Even cool research projects are at the end of the day designed to increase the visibility and brand of the company or university that they are affiliated with (thought leadership is a word that gets thrown around way too much these days).


👁 Photo by rawpixel.com from Pexels
Photo by rawpixel.com from Pexels

Helping The Man Earn More Benjamins

So depending on your view of capitalistic society, you may or may not be happy to hear that data scientists are are all about driving growth or optimizing the bottom line (profits).

I mean unless you are a teacher or a fire fighter or a social worker, then chances are that your role is all about helping your boss earn more Benjamins too. I will say though, in my opinion, good data scientists are on average able to impact the companies they work for more than many other job functions. Let me explain why (and also explain what data scientists do).


That "A-HA" Moment

Have you ever been up to your eyes in an Excel spreadsheet, scatter plotting your target variable (the thing you are trying to explain or predict) against every feature you can get your hands on? And just as you lose hope of ever finding something correlated with your target, you see something like the following relationship and yell, "A-HA!"

👁 Strong correlations are the nectar that every data scientist needs in order to survive
Strong correlations are the nectar that every data scientist needs in order to survive

This is an example of a strong, positive relationship (correlation) between two variables (cube root transformation applied to make them linear) – the scatter plot shows that the more you spend on your film (in terms of both film budget and marketing), the more likely it is to earn big money at the box office. Granted, it would be foolish to just give your film crew carte blanche to spend as much as they want (and notice that the target variable is revenues, not profit); but still after seeing this, we know that unless we hit jackpot with an indie viral hit, the general trend is:

Go big or go home!

Finding these types of relationships in the data is the objective of any quantitative analyst, including data scientists. So why did I say that data scientists potentially have more impact?

If we were manually plotting each feature one by one versus our target or running simple linear regressions, it would take forever to go through a huge feature set. And if we get discouraged and give up, we may never reach that "A-HA" moment.

Data scientists take advantage of automation and versatile algorithms to comb through as much data as possible in search of interesting statistical relationships – this increases the probability of hitting that "A-HA" moment.


👁 Photo by rawpixel.com from Pexels
Photo by rawpixel.com from Pexels

Unstructured Data

Another advantage that data scientists have is an appreciation for the signals hidden in unstructured data (such as Reddit comments, tweets, images, or blog posts) and the ability separate out those signals from all the accompanying noise.

If you think about it, potentially useful data is everywhere. It would be a shame to just limit ourselves to what our company decided to store in its databases.

So a big part of data science is finding "off the beaten path" type features that give you an analytical edge. These can be either in the form of proprietary datasets painstakingly collected over the years (think Facebook, Google, Amazon, TenCent, and Netflix). It could also be using available data in a way that no one had done previously (for example John Hollinger’s Player Efficiency Rating, now part of every quantitative basketball analyst’s toolkit, is calculated purely off of readily available box score data).

Some common methods that data scientists employ in their quest for differentiating data include:

  • Web scraping.
  • Stitching together multiple datasets to form a custom and hopefully more powerful set of features.
  • Dimensionality reduction (PCA, topic modeling, etc.) to render all that unstructured data into a usable format.
  • Going onto Kaggle (just kidding).

A Word on Algorithms and Models

Notice that we have barely mentioned anything about machine learning algorithms or statistical models so far. While it is important to know how the algorithms (like neural nets or XGBoost) work as well as its advantages and limitations, building and running models will only take up a small percentage of a data scientist’s time.

Much more time is spent defining the problem and approach, gathering data, and cleaning data (there will be tons of this). Yes, there are PhD scientists working on the latest and most cutting edge research but unfortunately, we can’t all be Andrew Ng. The rest of us stay employed by taking his and others’ research and finding useful, value generating applications for them.

So if you asked me – what’s more important, knowing the algorithms like the back of your hand or being intimately familiar with your company’s industry and product, I would say the latter.

Oftentimes the difference in performance between the best algorithm and the second best one is not huge. So it’s OK if we pick logistic regression when XGBoost would have yielded a better result.

But if we don’t know how to ask the right question and end up wasting significant time and resources trying to solve problems that don’t exist, then that is not OK (we would probably be fired). So know the algorithms but really know the business.


Conclusion

I hope you enjoyed my ramblings on what data scientists do. I’m honestly thrilled to be a part of this exciting field. If you have any thoughts on working as a data scientist that you would like to share, please leave a comment. Cheers!


More Data Science and Analytics Related Posts By Me:

How Much Do Data Scientists Earn?

Are Data Scientists at Risk of Automation

Got Data Science Jobs?

_Understanding PCA_

_Understanding Random Forest_

_Understanding Neural Nets_


Written By

Tony Yiu

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles