DevOps for Data Scientists: Taming the Unicorn

Stow do we version control the model and add it to an app? How will people interact with our website based on the outcome? How will it scale!?

When most data scientists start working, they are equipped with all the neat math concepts they learned from school textbooks. However, pretty soon, they realize that the majority of data science work involve getting data into the format needed for the model to use. Even beyond that, the model being developed is part of an application for the end user. Now a proper thing a data scientist would do is have their model codes version controlled on Git. VSTS would then download the codes from Git. VSTS would then be wrapped in a Docker Image, which would then be put on a Docker container registry. Once on the registry, it would be orchestrated using Kubernetes. Now, say all that to the average data scientist and his mind will completely shut down. Most data scientists know how to provide a static report or CSV file with predictions. However, how do we version control the model and add it to an app? How will people interact with our website based on the outcome? How will it scale!? All this would involve confidence testing, checking if nothing is below a set threshold, sign off from different parties and orchestration between different cloud servers (with all its ugly firewall rules). This is where some basic DevOps knowledge would come in handy.

What is DevOps?

 
Long story short, DevOps are the people who help the developers (e.g. data scientists) and IT work together.
Usual battle between Developers and IT

Developers have their own chain of command (i.e. project managers) who want to get features out for their products as soon as possible. For data scientists, this would mean changing model structure and variables. They couldn’t care less what happens to the machinery. Smoke coming out of a data center? As long as they get their data to finish the end product, they couldn’t care less. On the other end of the spectrum is IT. Their job is to ensure that all the servers, networks and pretty firewall rules are maintained. Cybersecurity is also a huge concern for them. They couldn’t care less about the company’s clients, as long as the machines are working perfectly. DevOps is the middleman between developers and IT. Some common DevOps functionalities involve:

  • Integration
  • Testing
  • Packaging
  • Deployment

The rest of the blog will explain the entire Continuous Integration and Deployment process in detail (or atleast what is relevant to a Data Scientist). An important note before reading the rest of the blog. Understand the business problem and do not get married to the tools. The tools mentioned in the blog will change, but the underlying problem will remain roughly the same (for the foreseeable future atleast).

[Read More]

Leave a Reply

Your email address will not be published. Required fields are marked *