What constitutes an efficient development environment for data scientists?
The goal of building an efficient development environment for data scientists is more than assisting them to conduct PoC’s, it is also around making them more productive.
👋 Hi folks, thanks for reading my newsletter! My name is Yunna Wei, and I write about modern data stack, MLOps implementation patterns and data engineering best practices. The purpose of this newsletter is to help organizations select and build an efficient Data+AI stack with real-world experiences and insights and assist individuals to build their data and AI career journey with deep-dive tutorials and guidance; Please consider subscribing if you haven’t already, reach out on LinkedIn if you ever want to connect
Background
At first glance, the heading to this article, may sound like a very simple question. The answer seems quite straightforward as well — Data scientists just need a notebook environment to start developing ML models. This might have been true 5 or 10 years ago. However we have long past the stage of ML PoC stage. Organizations invest in resources and efforts for AI projects with the expectations that the solution will eventually be deployed into a production environment so that the business value of AI can be yielded.
Therefore the goal of building an efficient development environment for data scientists is more than assisting them to conduct PoCs, but making them more productive and involved in the end to end ML workflow from data exploration → feature engineering → model experimentation → model deployment → continuous ML pipelines.
Therefore, in my opinion, a discussion around what constitutes an efficient development environment for data scientists is definitely worthwhile. For organizations where data scientists still mainly leverage a notebook in their local laptop to do ML experiments, I hope this article can serve as a trigger point to start thinking about building a more functional environment for your data scientists.
Data exploration
We will start with data exploration, as it is the most fundamental function that data scientists perform, before they start feature engineering and ML experiments. Broadly speaking, data exploration here, refers to assisting data scientists understand the data deeply. Therefore it includes data visualization, data profiling and data quality checking. To perform these tasks, data scientists mainly leverage Pandas and popular visualization frameworks in Python, such as Matplotlib, Seaborn and Plotly (for interactive data visualization).
In order to fully understand the data they work with, data scientists would need to code to transform the data (such as groupby, pivot tables and merge/join) so that they can explore the data at different angles. The biggest pain point of this coding process is to remember the right pandas syntaxes or to find the syntaxes through Googling and Stack overflow. This is the same for data visualization. Regardless which visualization library they use, data scientists need to know the exact syntaxes to build visualization graphs and charts. Hence, data scientists need to spend quite a bit time on remembering and searching the syntaxes during the data exploration stage.
Since we are talking about how to help data scientists to be more efficient with their work, we need to explore a better way to assist them to conduct data exploration more quickly and easily. A better way here would mean less coding and less remembering and searching for pandas syntaxes. Bamboolib is such as tool that enables anyone to analyze data in Python without having to write code. You can think of bamboolib as a GUI for pandas data frames. With this GUI, data scientists can perform common data transformations (filter, sort, groupby, join/merge, change data types, string manipulation and extracting datetime attributes) and common data visualizations (histogram, bar plot, line plot, box plot, scatter plot, density heatmap and so on).
You can install bamboolib open source edition for Jupyter Notebook or Jupyter Lab on your local computer. Last year, bamboolib was acquired by Databricks. The best parts of bamboolib have been integrated into the Databricks platform. If you are already using Databricks, you can experience no-code data analysis and transformations in a Databricks notebook with the support of bamboolib.
A shared and standardized notebook environment
Most of the time, data scientists start coding for a new project in a notebook of a python virtual environment on their local laptop. Within this local virtual environment, data scientists would install all the necessary libraries for their work. Additionally, they have their notebooks stored on their laptops. A few problems can arrive from this setup:
Each data scientist has his/her own development environment, making it very difficult for other team members to reuse and reproduce their work due to environment inconsistency, and limiting the room for collaboration among team members.
Each data scientist has to install their own libraries and the same effort repeats for every one. Installing libraries and preparing environments can be quite time consuming. Plus, data scientists are not so good at managing environments and infrastructure, they have to google a lot to make things work. This is definitely not be the best use of a data scientist’s time.
Hence, an efficient notebook environment should encompass the following key components:
A shared/hosted notebook environment — Instead of data scientists running notebooks on their own laptops, data scientists can use a shared instance to run notebooks. For example, both AWS SageMaker and Databricks provide managed ML notebook instances where data scientists can select different types of compute and use pre-installed runtime, saving data scientists a significant amount of time from setting up the notebook environment themselves on their own laptops.
A standardized runtime — A standard runtime means a set of pre-installed common ML and data analysis libraries. This standard runtime will be used among all data scientists, which can eliminate the errors due to runtime inconsistency when one data scientist tries to reuse the notebooks and artifacts produced by his/her peer. Additionally as the common ML and data analysis libraries are pre-installed, data scientists can save the time of manually installing the libraries themselves.
Integration with code version control tools — When several data scientists are working together on a same project, it is inevitable that they need to share their work. Having code version control integrated within their development environment, data scientists can check their codes into a remote repository instead of saving different versions of notebooks locally. It is necessary to have a remote repository in place before the ML project is officially kicked off. Data scientists start developing with code versioning control from scratch.
ML experiment tracking and logging
Building ML models is an extremely iterative and experiment-driven processes. data scientists need to perform huge number of experiment runs to try combinations of different feature engineering techniques, model architecture definitions, and hyperparameters to get a model’s performance that meets the business requirements. Manually recording these metrics and parameters and selecting the best-performance is very challenging particularly when the number of experiments grows quickly.
To make data scientists more efficient with their work, a mechanism that allows them to easily log every experiment run with details, and analyze and compare these runs to select the best-performing one would be very useful. There are some open-sources libraries such as MLflow and Weights & Biases that provide experiment tracking capabilities for logging parameters, code versions, metrics, and output files for further analysis and visualizations.
When designing a development environment for data scientists, it is necessary to build a centralized tracking server where all data scientists can not only log their own experiments, but also discover and reuse the experiments logged by other team members. This will reduce the manual and redundant experiment tracking work by data scientists, and promote sharing and improve collaboration among data scientists.
For mlflow, there are 2 deployment options:
If you are using Databricks already, you can leverage the Databricks hosted mlflow tracking server.
If you are using opensource mlflow, you can refer to this tutorial for building your own mlflow tracking server.
For Weights & Biases, there are 3 deployment options:
W&B SaaS cloud — A multi-tenant SaaS offering;
W&B dedicated cloud — A managed, dedicated deployment on W&B’s single-tenant infrastructure, in your choice of cloud region;
W&B customer managed — Customers can deploy and manage Weights & Biases on customer’s own managed cloud or on-premises servers;
Building ML pipelines
In my previous article,
Why Data Scientists Should Adopt Machine Learning (ML) Pipelines
I shared my opinions relating to why it is necessary for data scientists to start adopting ML pipelines. Automating a machine learning pipeline can cut deployment time from months to days even minutes. However developing ML pipelines is fundamentally different from developing ML models in a notebook. Therefore it is necessary to consider what can be done to make it easier for data scientists to author ML pipelines. ML pipelines generally consist of YAML files and .py scripts. At the very least, it is necessary to have a development environment that allow data scientists to easily switch between notebooks and YAML files and .py scripts.
There are 2 options to build such a development environment:
The first option is to embed notebook capabilities in an Integrated Development Environment (IDE). The popular IDE’s are Visual Studio Code(VSC) and PyCharm, particularly for Python-based developers. Both VSC and PyCharm support notebooks. Although this option has some learning curve for data scientists as they need to spend time getting familiar with an IDE, it is the popular option from what I know.
The second option is to allow data scientists to develop non-notebook artifacts, such as configuration files, YAML files and .py scripts in a notebook environment. JupyterLab is such an example. JutypterLab is claimed as the next-generation notebook interface by Jupyter. It enables code in any text file (Markdown, Python, R, LaTeX, etc.) to be run interactively in any Jupyter kernel.
Other than selecting the development interface for authoring data pipelines, another important decision is to select a pipeline/workflow tool that assists data scientists to develop ML pipelines. There are a few open-source ML pipeline tools, such as Kubeflow pipeline, Metaflow, ZenML. Instead of explaining how each tool works here, I think it is more important to share the selection criteria. The two most important aspects when selecting an ML pipeline tool, are:
The supported workflow orchestrator engine;
The supported deployment infrastructure;
Let’s dive into each one of them.
The supported workflow orchestrator engine
For most organizations, they should already have at least one existing workflow orchestrator tool. The popular ones include Apache Airflow, Argo workflow, Dagster and Prefect. Therefore when you select an ML pipeline, you need to understand if the one you have selected supports native integration with the existing workflow orchestrator tool in your organization. If not, you would have the extra work of managing another orchestrator. Therefore, I would recommend selecting the one that already has integration with your orchestrator. For example:
Kubeflow Pipelines runs on Argo Workflows as the workflow orchestrator engine;
Metaflow supports Argo Workflows, Airflow and AWS Step Function as the workflow orchestrator engine.
ZenML has native integration with workflow orchestrator engines, such as Airflow. What’s special about ZenML is that it provides integration not only with workflow orchestrators, it also supports to push the user-defined ML workflows into other ML pipeline tools such Kubeflow Pipelines, or cloud managed ML service, such as Google Vertex AI and AWS Sagemaker, or CI/CD pipeline tools, such as GitHub Actions and Tekton pipelines.
The supported deployment infrastructure
For deployment infrastructure, most organizations generally have a focused one. For example, some organizations are AWS focused (most of their infrastructure leverage AWS services), while others are Kubernetes focused. When you select the ML pipeline tools, you also need to take your organization’s existing deployment into consideration:
If your organization is very Kubernetes-focused, it is quite straightforward to select Kubeflow pipeline as it is a component of Kubeflow, which is dedicated to making deployments of machine learning (ML) workflows on Kubernetes, extremely simple.
If your organization is AWS-focused, Metaflow could be a good option as it natively supports AWS managed services, such as AWS Batch for compute and AWS Step Functions for orchestrating workflows in production
Of course, other than these 2 aspects, there are other considerations as well when selecting tools for ML pipelines. Each organization is unique in its way. You can make the decision based on what works best for your organization.
Secure access to data in the production environment
Data scientists generally start their development in a dev environment. Since ML is extremely data-dependent, data scientists will need access to data in a production environment. There are a few ways to do this:
The first way is to copy a portion of production data from the production environment into the the ML dev environment. However there are a few issues arising from this approach. First, generally there is data duplication and some manual effort is required in order to assist data copying activities. Secondly, data scientists will not be able to have the most recent data and by the time they finish the model training, the underlying data may have always drifted to a different distribution. Thirdly, the dev environment is relatively loose on access control. Copying production data into a dev environment could cause data leakage and potential violations on data security, privacy and regulation requirements.
The second way is give data scientists access to the production environment. This could solve the problems that are possibly going to arise from the first approach. However, this approach has its own concerns. First, generally production environments are very stringent on access control. Unless specifically required, the production environments should not be touched by a human. Additionally, data scientists have little experience operating a production environment. Therefore giving a data scientist access to a production environment is quite risky as they could break down the production environment. The consequece could be huge. Therefore the second option is not ideal either.
The third option is to turn the ML development environment into a sandbox environment. In this sandbox, data scientists are given read only access to the production data. They can only work with production data in the sandbox and is not allowed to copy or move the data outside of the sandbox environment. This sandbox would require some initial setup and configuration efforts, however it provides data scientists access to most recent and updated data without compromising the security of a production environment.
Summary
To summarize, an efficient development environment for data scientists should:
Assist data scientists to do data exploration quickly.
Provide a standardized and pre-installed runtime.
Allow data scientists to easily conduct ML experiments tracking and logging.
Support data scientists to switch from developing notebooks to developing ML pipelines easily.
Provide secure access to the most updated production data.
I hope you have enjoyed this article. Please let me know if you have any comments and questions on this topic! I generally publish 1 article related to building efficient data and AI stack every week. Feel free to subscribe so that you can get notified when these articles are published.
Thanks!
Thanks for sharing!