VOOZH about

URL: https://towardsdatascience.com/a-data-scientists-guide-to-stakeholders-ed81b573e6be/

⇱ A Data Scientist's Guide to Stakeholders | Towards Data Science


Skip to content

A Data Scientist’s Guide to Stakeholders

How data scientists can best communicate with non-DS people

8 min read

A Data Scientist’s Guide to Stakeholders

👁 Photo by Campaign Creators on Unsplash
Photo by Campaign Creators on Unsplash

The stakeholder-data scientist relationship

A stakeholder is anyone who has a vested interest in the decision making or activities of a project or business. They could be internal coworkers or external clients.

Any time a data scientist works on a project at a company they are not working in isolation. There is always a business purpose for the project they are creating.

For example, if you as a data scientist are developing a model that can forecast energy usage in a specific commercial building, stakeholders may include:

  • Building owners/managers
  • Engineers working on projects on-site
  • Coworkers at your organization that will deliver results to the client (eg software engineers or BI analysts who will create dashboards)

Most stakeholders come from a non-data science background. They may be business analysts, sales people, or engineers and have various levels of understanding of ML, AI and data science.

Despite this, whenever a data science solution is needed for a project, these "non-technical" (for lack of a better word) stakeholders play an important role in the development process and you must learn to work well with them.

This means mastering the art of effective communication with non-DS stakeholders.

Defining the problem & solution

Sometimes stakeholders will come to you and ask for your help in implementing an ML/AI/statistical solution to their problem. It is your job as the data scientist to take a good look at the data, the issue at hand and determine if it warrants your intervention.

If it does, it is your job to clarify the business case with the stakeholder/customer, ask as many questions as you can about what their needs are and what they are looking for, and start to build a plan for implementing that solution.

Other times you and your data science team will look at something that is clearly not working in your software or product and determine that machine learning is necessary.

Regardless of how you came to the conclusion that some problem needed a data science solution, once you have laid all your desired goals and outcomes out, you must devise a plan for implementing it. This includes:

  • Exploring the data
  • Diagnosing the problem
  • Selecting a model
  • Running tests

You’ll test various models and solutions on your own time before coming to a conclusion that x model will work best for y problem.

Once you are confident in the plan for your ML solution, it’s time to present your findings and proposal to your stakeholders.

Always provide background

Put the problem into context. Explain the use case for ML/statistics/AI in the given scenario.

Why should they care? How can ML improve this problem where it wasn’t being used before?

👁 Example presentation slide giving background for a sample problem. Image by author
Example presentation slide giving background for a sample problem. Image by author

When introducing the model, focus on its benefits rather than how it works.

Talk about the model you are using. Explain a brief background of what it is and what it does, but don’t go into too much depth into the technical details.

Explain why you chose this model or method over other similar options. Does this type of model work better for the type of data you have? (For example, if you have time series data, perhaps you are using ARIMA or an LSTM because these are designed to deal with time series data. If you’re dealing with a time series dataset with a lot of external variables like temperature or humidity then you may want to consider a model like XGBoost where it is easier to add in these kinds of variables).

Discuss limitations + future improvements

Is your model high on performance but low on explainability (eg CNN, LSTM or other neural net)?

Is your model higher on explainability (eg linear regression) but therefore less complex with a potential for underfitting and reduced accuracy?

Clients may also be confused by terms like under/overfitting, so you’ll have to clarify these quickly and efficiently with them in addition to explaining the model.

Here’s a quick example of how I would explain over/under fitting to a customer:

  • When we train a model, we want to make sure that it not only learns the patterns of the data that it was trained with, but that it is also able to predict future patterns.
  • Training data often has noise in it. Not every data point has direct relevance to the model nor can it be used to predict future events.
  • When a model learns the patterns of the training data too closely, it loses its ability to generalize to new data and simply predicts exactly what happened in the past (Including noise and potentially outliers). This is what is meant by overfitting.
  • Underfitting is the opposite of overfitting. The model barely learns the patterns of the training data, so when it tries to predict the future, it is too loose and general. Thus missing the ability to predict the future using information from the past.

Notice how I tried to keep my language and concepts extremely basic. I did not mention the bias-variance tradeoff or train/test splits.

Diagrams and images can also be extremely helpful. Providing people with visual representations of data science concepts can really solidify these concepts in their mind and bring them to an understanding much quicker than doing so with words.

👁 Image by author
Image by author

It’s also important to loop stakeholders into a conversation about the direction of the project and improvements. The first iteration of the model may be basic but functional, just to get something out there and working as a proof of concept.

However, start discussing how you could improve it in the future. Could you collect more data? Add more features? Involve the stakeholders in this conversation. Ask about the barriers to getting more data, what kinds of other variables they think may influence predictions and what else they would like this model to be able to do in its final form.

Know your audience

How much depth you go into will also vary greatly depending on who you’re presenting to. It’s important to adjust the depth of your explanations based on what kind of person you’re speaking with, their qualifications and level of involvement in the project.

It may not be necessary to explain overfitting to someone who is more on the business/sales side of things, but an engineer or analyst might be more interested in knowing the underlying mechanisms of the models and what issues may arise.

Adjust your level of detail to your target audience. If there is a room of mixed types of people, try to keep things balanced — only diving into more technical concepts when it is necessary to move the conversation forward.

👁 Photo by Artem Maltsev on Unsplash
Photo by Artem Maltsev on Unsplash

Give them space for questions and feedback

Stakeholders and non-data scientists are often very curious about your projects. They may ask questions that seem very obvious to you and your data science colleagues. Be patient with them and allow them a space to ask these questions with no judgement.

Try to answer their questions at a surface level so as not to cause more confusion. Often an explanation such as this will suffice:

"A random forest model is capable of telling us which features, or variables, were most impactful to its ability to make predictions. When the model is done training, it returns a ranked list of features, with the most important ones at the top. This can give us a better idea of which variables are most important and which ones we may not need to include in our final model."

Notice how I didn’t talk about splitting, information gain or any of the other details that they don’t really need to know about. And that way, in the future, when they ask about why we did not include "Humidity" as a feature in our model we can explain that it consistently ranked low on the importance list, and removing it did not significantly affect its performance.

Conclusion

It’s normal and common to get stuck in your technical bubble with your ways of thinking and jargon as a data scientist. After all you are working on these problems day in and day out so many of the terms and definitions seem obvious to you.

However it’s important to stay well rounded and be able to explain these concepts simply to other people so that you can really succeed in working together and building effective solutions.


Thanks for reading

Get an email whenever Haden Pelletier publishes.


Written By

Haden Pelletier

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles