How is an ML Driven System Unique? — Understanding Why MLOps Is Necessary and What Components of ML Infrastructure is Required
A deeper explanation of why MLOps is required, and what problems MLOps is trying to solve to make AI work in a real world production environment
👋 Hi folks, thanks for reading my newsletter! My name is Yunna Wei, and I write about modern data stack, MLOps implementation patterns and data engineering best practices. The purpose of this newsletter is to help organizations select and build an efficient Data+AI stack with real-world experiences and insights and assist individuals to build their data and AI career journey with deep-dive tutorials and guidance; Please consider subscribing if you haven’t already, reach out on LinkedIn if you ever want to connect!
For any organizations that have already built ML solutions or even at the PoC stage of developing a ML model, I am sure MLOps has been a topic. Indeed, in order to develop and deploy an ML solution in a production environment in a reliable, secure and highly available manner, MLOps is required.
In today’s blog, I would love to take one step back to explain:
Why is MLOps necessary and required?
More importantly, what problems is MLOps trying to solve?
What components of ML infrastructure are required in order to solve the problem?
Hopefully by understanding why, you can make more informed decisions with regards to if MLOps is needed for your organization and if yes, what components of MLOps are necessary in order to solve the problems / challenges that stop AI from working in your organization.
I would like to start by explaining how a ML driven system is unique and why this uniqueness requires a MLOps solution.
How is ML Driven System Unique
To understand why MLOps is necessary and assess what capabilities your organization needs to build with regards to MLOps, we need to first understand how an ML driven system is unique. I have summarized the “uniqueness” of an ML driven system into the following 4 aspects:
Data Centric — Data-centric not only means that the success of an ML driven solution depends on the quality and quantity of the data, but also the intrinsic characteristics of the data itself determine what MLOps capabilities are required.
Multi-team and Multi-skillsets Collaboration — In most organizations, building and operating a ML driven system generally requires the effort of multiple teams including data scientists, data engineers, ML engineers, DevOps engineers and so on. I will shortly explain how the team structure might impact the design a MLOps workflow and process.
Dynamic Ecosystem — There is an extremely rich and fast-growing ecosystem for ML and AI. For example, there are quite a few open-source libraries for ML and Deep Learning (DL) algorithms, like Scikit Learn, Tensorflow, Keras, PyTorch, Fast.ai, and Hugging Face. Additionally, there are various AI applications that can be built from ML, including supervised, semi-supervised, self-supervised, unsupervised, reinforcement learning and so on. Depending what algorithms and applications your ML systems build, the required MLOps capabilities also vary.
Continuous-Change — ML driven systems are built and learnt from data. When the underlying data changes, the ML model and algorithm also need to be updated accordingly. ML driven systems are always in a continuous-change state, and there is really no ending. Therefore, monitoring both data changes and model performance changes, becomes very necessary. The faster the business environment and data changes, the more necessary it is that the ML model is “re-trained” and “re-learnt”. So you really need to understand you business requirements and data landscape change, then determine how important it is to build a monitoring solution in your MLops workflow.
Let’s deep dive into how each of these above 4 aspects influences your design of choice for an overall MLOps solution.
Data Centric
When we talk about the ML system being data centric, the obvious interpretation is that the performance of an ML model significantly depends on the quality and quantity of data. Indeed, this is true. However I think ‘data centric’ has more implications when it comes to MLOps:
First, data always changes. The fact that data always changes requires a monitoring solution to detect data drifting, including key statistics changes such as minimum, maximum, standard deviation changes and data distribution changes. Once the underlying data changes, the performance of the ML models trained on this data is also very likely have changed (generally in a bad way). It is required to monitor the model performance as well. Therefore if your data changes very frequently, it is absolutely necessary to build monitoring capability in your overall MLOps infrastructure stack.
Second, data is almost always dirty. It is well known that data scientists spend quite a bit of their time cleaning and transforming the data into a clean state. It is worthwhile understanding how much time you own data scientists spend on cleansing dirty data. If the amount of time is high, it is probably useful to understand how your overall data pipeline is running right now, and include some data quality checks and schema enforcement to improve overall data quality and reliability.
Third, data is not always ML useable. Rarely can the raw data can be fed into ML algorithms for learning immediately. Data scientists very often must to covert raw data into features to assist the ML algorithms learning more quickly and accurately. This conversion process is called feature engineering. Feature engineering generally require a lot of time and effort from data scientists. Therefore, features are also assets to an organization, like data itself. If your organization has multiple teams of data scientists and they work on similar raw data, there is a good chance that they are producing similar or even same features. In this case, building a central feature store for data scientists to publish, share their features and discover and reuse features created by other data scientists can add significant value and speed up the ML model development time. Of course, other than feature discovery and feature reuse, a feature store has other functions as well. I will have another deep dive blog on “feature stores” soon. Please feel free to follow me on Medium if you want to a notification when the blog on feature stores is published.
Above are the implications that the fact “ML is data centric” brings. These implications require different components of your ML infrastructure to be tackled, like data and model monitoring, and feature stores.
Multi-team and Multi-Skillsets Collaboration
Building an end-to-end ML solution, requires different skillsets of multiple teams. In my previous article — Learn the Core of MLOps — Building Machine Learning (ML) Pipelines, I mentioned that in order to build an end-to-end ML solution, it requires at least the following 3 key pipelines:
Data and Feature Engineering Pipelines, which are generally developed owned by the data engineering team.
ML Model Training and Re-training Pipelines, which are generally developed and owned by the data science team.
ML Model Inference/Serving Pipeline, which are generally developed and owned by the ML engineers or production engineers team.
In order to automate these 3 pipelines in a reliable manner, there are also other key components required — Continuous Integration / Continuous Training / Continuous Deployment (CI / CT / CD) and Infrastructure as Code (IaC). These components are generally owned by the infrastructure team or the DevOps team.
Generally these teams use very different services and toolsets. For example, data scientists mainly use Jupyter notebook for the ML model training and experimentation, data engineers use Python, SQL and Spark for developing data pipelines, while ML engineers need to not only understand how a ML model works to convert the data scientists notebook into modular and testable codes , they also must have knowledge of underlying infrastructure, (like Container and Kubernetes and DevOps), and DevOps pipelines (like Github actions and Azure DevOps pipelines).
Using different tools, services and frameworks, generally creates more gaps between these teams. Teams have to spend more time on integrating different services and frameworks, which potentially duplicates some work and can create inconsistencies. If these scenarios happen a lot in your organization, it will be quite valuable to consider a single platform that can enable various workloads (data engineering, ML training and deployment, ML pipelines, feature store, workflow orchestration, DevOps integration, data and model monitoring) so that these various teams can work on the same platform. I know Databricks has such a data and AI unified platform that supports the above mentioned various workloads.
Having such a unified platform will make it much easier to manage your ML infrastructure and set up your end to end MLOps workflow, as you avoid “stitching together” many different services of different providers.
If teams spend quite a bit of time integrating different platforms, resulting in inconsistency, (even errors) due to these integrations, or team members communicating (even arguing) due to these platform differences, maybe it is time to consider moving to a unified platform, to solve these problems from the infrastructure layer.
Dynamic Ecosystem
All aspects of the ML ecosystem have been fast-growing, including ML model algorithms, distributed training, data and model monitoring, ML lifecycle management, ML pipeline frameworks, ML infrastructure abstraction, workflow orchestration, as well as web applications for sharing data insights and ML model results. Additionally the varieties of business use cases that AI and ML can be applied to, have been flourishing as well — such as computer vision, Natural Language Processing (NLP), audio, tabular data, re-informing learning and robotics , as well as multi-model tasks.
Below, I have summarized the top few open source libraries for each of these ML aspects:
ML model Algorithms — Scikit Learn, Tensorflow, Keras, Pytorch, fastai, Hugging Face Transformers, XGboost
ML Distributed training — Ray, Horovod, Dask
Data and Model Monitoring — Evidently, arize, whylabs, Deepchecks,
ML lifecycle / Pipeline Management — mlflow, ZenML
ML Infrastructure Abstraction — Metaflow, kubeflow
Web Applications for sharing data and ML results — Streamlit, Gradio
Workflow orchestration — Metaflow, kubeflow, ZenML, Argo, Luigi, Dagster, Prefect
Use cases — Computer vision (image classification, image segmentation, object detection), Natural Language Processing (conversational, Text Classification, Question answering, summarization), Audio(audio classification, speech detection), Tabular (classification, regression), Reinforcement learning and robotics
This dynamic ecosystem has definitely encouraged and created a lot of innovation within the AI and ML community. But it also posts challenges to a build a standardized and consistent ML workflow across an organization, if each team has their own data and ML stacks and libraries.
If your organization has various data science and ML teams decentralized to various functions, it will be useful to evaluate and recommend a consistent set of open source stacks and libraries across the teams and include them as part of your overall ML infrastructure. In this case, teams’ work can be reused, knowledge can be shared, and ML workflow / MLOps practices can be standardized.
Continuous Change State
Different from normal software, the iterations of ML driven systems can be very fast due to the fact that ML models are learnt from data, instead of programmed with defined and fixed rules. We know for a fact, data can change pretty quickly. As a result, the performance of ML driven systems can change significantly (generally in a deteriorating manner). In order to address these challenges and make sure ML driven systems always stay performant and reliable, there are two aspects that are very critical. One is monitoring, and the other is automation.
Monitoring — Generally speaking, ML monitoring includes data monitoring, feature monitoring and model monitoring. Data monitoring refers to understanding if the data’s summary statistics, distribution and trends, have changed. Similar to data monitoring, feature monitoring is to understand how the features used to train the models have changed over time. Model monitoring focuses on performance degradation of key metrics and bringing to the surface unknown issues before they generate real damage to your products and business. If we take a classification model as an example, these key metrics would include classification accuracy, precision, recall, F1 and a confusion matrix. To ML driven systems that could potentially generate bias and unfairness against a particular group, gaining insights into how models arrive at outcomes across specific cohorts is also critical. Uproot and mitigate the impact of potential model bias on marginalized groups with multidimensional comparisons.
Automation — Since the iterations of ML driven systems can be very fast, it is unrealistic to ask anyone to manually respond to the changes by re-developing new features, retraining the ML models and re-deploying the models, particularly when you have hundreds of ML models running in production. First, this manual approach is very labor intensive and you have to constantly grow you data science and ML team to respond to the changes. Second the approach is not testable and if the ML driven systems do not go through sufficient testing and validation, there could be more errors incurred. Thirdly, this manual approach can take days to complete and may never be quick enough to handle the changes and have a newly updated version of ML models deployed. Therefore it is necessary to automate this model re-training and re-deployment process. Strong monitoring plus vigorous Continuous Integration / Continuous Training / Continuous Deployment (CI / CT / CD) can make this automation happen. This is why MLOps is very necessary, particularly for organizations that have many ML models in production or have very complex and critical systems using ML models as an essential part.
Summary
For lots of ML driven systems, MLOps is necessary. Hopefully by explaining why an ML system is unique — data centric, multi-team and multi-skillsets, dynamic ecosystem and continuous change state — you can have a better understanding why MLOps necessary. More importantly, you can understand what problems MLOps is trying to solve so that you can make more informed decisions with regards to if MLOps is needed for your organization and if the answer is “yes”, then what components of MLOps are necessary and how should you build your data and AI infrastructure to truly reap the value of AI.
If you want to see more guides, deep dives, and insights around modern and efficient data+AI stack, please subscribe to my free newsletter — Efficient Data+AI Stack, thanks! I generally publish 1 or 2 articles on data and AI every week.