![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
You understand the importance of data. It shapes machine learning (ML) models and drives decisions based on large-scale analytics applications. But what is your data really telling you?
Let me explain. I originally trained as a research biochemist. Many years ago, on a flight to a scientific conference, I had an interesting conversation with my seatmate who was heading for the same event. He worked for a pet food company, specifically on a project to develop a “new and improved” flavoring for dog food. He entertained me with an account of how difficult it is to determine if a new flavor really is better. To find out, his team did a simple test: They put down two bowls of dog food, one on the left with the new flavor and one on the right with the original version. To their disappointment, the test dog went to the original flavor, with the same result in each of multiple trials.
The team went back to the lab and weeks later tried again with a new flavor. In comes the dog, and again, he goes to the original flavor. Multiple trials yield the same disappointing result. But then, it occurred to one of the team to reverse the position of the bowls, now with the new flavor on the right. Turns out, the dog just preferred the bowl on the right, regardless of which food it contained.
When you mine data for insights or to automate a decision, whether, through machine learning or data analytics, it matters to be careful how you design each question. You need to know how data was collected and to keep asking yourself, “Does the data represent what I think it does?”
Take this real-world example from machine learning: A data scientist was tasked with building a recommendation system for an online video streaming service. This data scientist was experienced with developing recommenders and knew to look at what people do rather than what they say they like (behavior over reported ratings) to discover preferences. In this case, the data to be used for training the recommender was the videos people clicked on — this was the behavior used to reveal preferences. Surprisingly, however, the results were poor; the recommendation system did not perform well although it used approaches that had been successful in the past.
The solution was to re-examine a broader group of viewer behavior data through direct inspection and re-think the assumptions about what the data represented. It turned out, using video titles as the indicator of preference wasn’t a good idea. In many cases, people selected a title but quickly clicked away, often because the title did not match the content, either through error or spamming. But using a different target for training — watching the first 30 seconds of a video rather than just clicking on it — resulted in a video recommendation system that worked beautifully.
The lesson here is not about video but about the importance of keeping an open mind to what your data tells you, trying different approaches and continually questioning your assumptions. It’s also an example of what newly trained data scientists discover: Data in the real world is not as clean and straightforward compared to the carefully selected data sets often used in machine learning classes.
Clearly, potential pitfalls exist in data selection and in framing the question you are addressing. So, what can you do about that?
No specific set of steps is guaranteed to avoid these problems. Much of the ability to avoid pitfalls comes through experience and being generally suspicious about your own assumptions. Just being alert to the potential for data to be misleading is already a step in the right direction. And a number of practices can help you better develop your skills and instincts on how to approach these issues. In addition to working on a system with efficient data management and data engineering, keep in mind these tips about data and decisions:
Whichever approaches you choose, it is helpful to have a comprehensive data strategy across an enterprise rather than isolating data science teams and the data they use. This shared data strategy makes it easier to explore different types of data for feature extraction and to use a wide range of machine learning or large-scale analytics tools without having to set up separate systems. A comprehensive data strategy along with a unifying data infrastructure to support it also encourage collaboration between data scientists and non-data scientists who hold valuable domain expertise. All of this helps you to keep questioning what your data is telling you and to keep testing your conclusions.
A few years ago, I was at a conference, signing and giving away books with my co-author, Ted Dunning. We gave people a choice between: “Practical Machine Learning: Innovations in Recommendation” and “A New Look at Anomaly Detection,” but each person could take only one book. We were surprised that over 80% chose the book on recommendations over the one on anomaly detection. Then a thought occurred to me. I leaned over and whispered “dog food” to Ted. He swapped the positions of the books.
Turns out, data scientists prefer the book on the left.
If you’d like to read our latest short book, download a free PDF courtesy of HPE: AI and Analytics at Scale: Lessons from Real World Production Systems.