Crawl, Walk, Run: Bringing Data Science into your Organization
In this three-part series, we’re exploring a tiered approach to introducing and incorporating data science into your organization. In Part One: Crawl, we discussed how to get started from scratch. Today in Part Two: Walk, we’ll address issues that may emerge and how to overcome them, how to build out a dedicated data science team, and more.
Part two: We’re walking!
Where do we go from here?
You now have your toolsets identified for development. You developed your first model. You built some confidence that machine learning can solve some of your organization’s problems, and you’re starting to get more requirements for new model development. Now what? It’s time to walk.
There are a couple of issues that start to crop up at this point.
You’ve got to figure out how to build out a team to start handling all the new requests. We’ll talk briefly about some of the options you have for building a team, but that could be a whole topic of discussion itself.
Once you build a team, you must figure out the larger workflow process. You’ll have multiple people working on the code base, so you’ll need to figure out code management strategies. You’ll also likely need to figure out a review and approval workflow, as well as some quality gates to ensure the models you’re deploying meet expectations.
As you start producing more models, you also need to start thinking about how you’re going to get the results of all these models into your stakeholders’ hands. It would be a shame for all the hard work your team does to get stuck in the “data science lab” and never get used. Unfortunately, that is another barrier that many data science teams face. Ideally, the results will be available in an interactive format that responds to changes, but that may not always be possible. Let’s aim for that goal though.
Building a data science team
How you build out your data science team depends a lot on the structure of your company and how much open communication there is between teams. Some suggested configurations are listed below.
|Dedicated team of data scientists compartmentalized from the rest of the company.||Dedicated team of data scientists where individuals are temporarily assigned to other teams, but return to work on special projects.||Data scientists are permanently part of cross-functional teams.|
Keeping your data science team centralized allows for greater cohesion of the data science team and an increased sharing of best practices. It’s easier to ensure consistency with coding styles, code management, and quality requirements, as well.
Embedding data scientists allows for greater cohesion with the team focused on the specific project, though, and can help provide the contextual knowledge necessary for successful feature engineering and the building of quality models.
I like having a team that is somewhere in between centralized and matrixed where the data scientists are partially assigned to another team during the project lifetime but do not completely “leave” the centralized data science team. This helps ensure continued collaboration and discussion with fellow data scientists.
Model and code management
Inevitably, once you have more people working on code, you need to have some sort of model and code management system. If you’re not familiar with code repositories like Bitbucket or Github, they allow you to store your code in a central location and help guide your team through the development workflow.
These repositories feature version control to allow your team to track the changes made to the code base over time (and easily revert to previous versions if necessary). They also use access control to restrict who has access to the code and workflow tools to guide the code review processes through pull requests.
Make quality a priority
As your team picks up momentum and expands the development of models–especially if your organization comes to rely heavily on a “citizen data science” model where those creating the models may not have a data science background–it’s incredibly important to have a well-defined code review and quality assurance process.
Having a codebase that is maintainable and understandable is incredibly important. One way this can be achieved is to ensure the code is fully documented using a well-defined documentation style. A reliable codebase can be obtained through the implementation of peer reviews that look for both technical and logical correctness, as well as by developing unit tests that must pass successfully before the code is submitted for review.
In addition to having a well-documented, reliable codebase, you also want to make sure your models perform well. In the Walk phase, we talked about how to define whether a model is successful or not. As the team grows, there needs to be a well-defined evaluation process. Standardization of the evaluation process ensures all models are assessed consistently.
There are many ways to operationalize your models. Which one works best depends a lot on how your organization is set up (self-hosted vs. cloud-hosted) and exactly what you’re trying to do with the results of the model (sending the results to another service vs. end-user directly accessing the results).
For algorithms that have pre-determined slicing/dicing possibilities, you may want to integrate the models directly into the ETL (extract, transform, load) process and make the results available alongside the source data to use in reports and dashboards. For models where it would be beneficial to have the ability to change input variables on the fly, you may want to use microservices to be able to make requests in real-time and see different results in your report/dashboard platform.
We’re well into the process now, but there’s one more phase to go. Stay tuned for the third blog post to learn about the final phase of bringing data science into your organization.