Regardless of the industry you’re in, if your company is growing, it’s almost guaranteed that someone will ask you what you’re planning to do with data science. It could be your investors and shareholders. But it could also be that friend with the MBA telling you about deep learning, or an article in Wired that convinces you that, without a data science team well-versed in artificial intelligence, you might as well declare bankruptcy and go home right now.
There are plenty of examples of data science making big differences for all kinds of organizations. For example, the City of Boston used data science to reduce the number of calls about rat populations. In the commercial space, data science a broad range of things, from helping to deliver groceries optimally, to improving search engine results, to improving healthcare.
But data science is still a relatively new field. It is still ill-defined, and the three-dimensional, fuzzy nature of the skill set means that it can be hard to figure out where to place data scientists in a company, and how they can be a value-add to your business. For all the glossy press, as a manager or executive, it’s not immediately obvious how data science might benefit you, or how to start implementing it.
What Data Science Really Does
Suppose you lure a group of data scientists into a room–presumably with the promise of free cloud computing credits–and ask them to tell you what they do. Inevitably, the conversation will turn to The Venn Diagram of Data Science, originally conceived by Drew Conway:
Conway created this diagram as way of understanding all the skills a data science job could require. Since then, it’s become both a good way of level-setting a discussion about what data science means, as well as an object of countless hours of debate as both practitioners and managers try to figure out where their teams should fit.
At its core, data science is the process of answering questions that are either predictive or explanatory in nature. The real marker of whether you need a data scientist is: do you know what you don’t know?
For example, you probably already know how many downloads you served last week, or how many repeat customers you have. These questions are in the realm of data analysts, who query your back-end databases and come up with aggregated, descriptive counts of events that happened historically.
But the patterns that drive those numbers, and questions like “Why aren’t people logging in as much as we thought they would be?”, are the realm of data science.
Data science is the process of using that aggregated data to answer questions and move the needle on business decisions. Let’s say you run a food delivery service. A data analysis question would be, “how many people used our app in New York City this year?” The data science equivalent question would be, “if the number of users in New York City is decreasing, to what can we attribute the decline, and what can we expect our sales to be next year if we adjust for that decline?”
At a high level, the process of doing data science involves asking a question, formulating a hypothesis about that question, and performing data analysis that either confirms or rejects the result of the hypothesis. All of the questions should be directly related to business strategy, and they usually involve improving customer experience.
Here are some real examples of these questions being answered every day at companies with large data science teams:
- Being able to predict a customer’s next order.
- Predicting what format of content Netflix serves you to improve your streaming quality.
- Helping people who list rentals on AirBNB price their rentals correctly.
The largest challenge is understanding whether you’re at a point to need data science. And if you are, do you have enough data for the team to use?
Your First Data Scientist Won’t Be a Data Scientist
When do you need a data scientist? When you have questions about the business that you can’t easily answer with just a gut feel or high-level domain knowledge of your business. Let’s continue our example of the food delivery service. If you are just starting to make and track deliveries, chances are that you don’t yet need a data scientist. You’ll need a good analyst who’ll help you set baseline patterns and do weekly reporting. You can then work with a data scientist to come up with some hypotheses: “People order delivery more when it rains. Can we predict how many drivers we’ll need for the next rainy day? Can we see what happens when we send out more drivers? Was the change in orders due to an increase in drivers, or due to something else entirely?” Here’s where predictive modeling comes into play.
So, let’s now assume that you have your data in place and you’re ready to hire. What kind of person do you need, and what kinds of skills should they have? Although Conway’s diagram specifies three areas of overlap, in my own career I’ve seen the most necessary skills arise along a spectrum of a blend of engineering and statistics.
Here’s one way to visualize it:
In other words, the earlier you are in the process, the more you’ll need data engineers and data analysts. Once you start asking questions that require predictive modeling, you’ll move towards needing data scientists, and if you have a lot of data at scale, machine learning engineers to put large models into production environments.
If you’re early on in your data science team work, to get to modeling some of the more complicated questions described above, and even to do simple reporting, you always first need good, clean data. Any data professional will tell you that getting and keeping clean data throughout time takes up the majority of their workflows, so the less mature your data infrastructure is, the more cleaning and data movement you’ll need to do.
So, you’ll need to set expectations with early analytics employees that they’ll be doing a significant amount of data engineering. All of that ingested data needs to be cleaned and analyzed to set baseline patterns. This is where the analyst comes in. Of course, it’s possible that both these roles will be filled by one person. In that case, make sure to again set expectations about responsibilities. Not everyone needs an AI expert right away, and many data scientists who start promising positions to answer specific business questions end up leaving after the data engineering/analyst skills needed are a mismatch with their expectations.
The more advanced your ingest data pipeline becomes, the more your hires’ roles will shift towards the top and to the right of the chart. Of course, in any successful data science team, you will need a diverse combination of skill sets as the demand for data continues to increase across the organization.
What Else Do You Need?
To fuel your ingest pipeline, you need clean, stable data that is secure. For many small businesses, data pipelines often start as Google Docs or customers submitting information in web forms. The data science process looks something like a pipeline. First, you start with the data you have available. It needs to be cleaned and structured into a format that data scientists can use for modeling. That data then needs to be placed somewhere secure where your data scientists can access it.
The majority of time spent is in the data cleaning phase. If you don’t have good inputs, you’ll never be able to get to predictive modeling, which is what will ultimately answer your questions. For example, if you’re a food delivery service, and you don’t have a standard list of restaurant names, it can be impossible to figure out where orders are coming from.
Then you’ll have to think through your ongoing processes. Once you’re able to answer your first set of questions, you want that work to be reproducible, which involves setting up orchestration and infrastructure around that individual data flow.
As an example of this workflow, let’s use our earlier food delivery company. It might be that the first time you answer how many users you have, you’re pulling SQL queries that you then keep on your own laptop. But if you want to have the user number constantly available to you for later downstream processes, like analysis in Jupyter notebooks or R or any other statistical tool, you’ll need a process that pulls the number of users every day and puts it into some kind of platform that data scientists can use.
It’s Ok If You’re Still “Behind”
Data science is still a relatively new field, which means lots of buzz, hype, and confusion. At its heart, though, data science is about answering business questions that need the support of numbers, so it’s worth understanding and evaluating if it makes sense for your business to start down that path.
But, don’t let the popular press tell you otherwise: data science shouldn’t be viewed as a be-all, end-all. And, it’s worth keeping in mind that you’ll need to spend some time and money setting up infrastructure and structuring your company so that you can do data science.
Once you do, though, it can be used as a good pulse check. “Is my company headed in the direction of where it needs to go?”
Vicki Boykis is a senior manager at CapTech Consulting, working on projects in machine learning and data engineering across many industries. She has a BS in Economics from Penn State and an MBA from Temple University. In her free time, she enjoys blogging, reading, chasing her toddler, and purchasing Adidas footwear. Follow her on Twitter @vboykis or check out her blog.