MLOps 简介本文简介了MLOps的定义与业界相关实践。通过本文，你可以理解MLOps是什么，以及如何使用MLOps来

一、MLOps

1.1 定义

MLOps (Machine Learning Operations) is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualization, implementation, monitoring, deployment, and scalability of machine learning products. Most of all, it is an engineering practice that leverages three contributing disciplines: machine learning, software engineering (especially DevOps), and data engineering. MLOps is aimed at productionizing machine learning systems by bridging the gap between development (Dev) and operations (Ops). Essentially, MLOps aims to facilitate the creation of machine learning products by leveraging these principles: CI/CD automation, workflow orchestration, reproducibility; versioning of data, model, and code; collaboration; continuous ML training and evaluation; ML metadata tracking and logging; continuous monitoring; and feedback loops.

MLOps = Machine Learning + DevOps

1.2 目标

打通Machine Learning从实验到生产中的各个环节，缩短机器学习从实验到生产的周期，自动化机器学习生产过程中的手动流程，实现端到端的全自动机器学习生产链路。

1.3 理念

CI / CD automation: It carries out the build, test, delivery, and deploy steps, provides fast feedback to developers.
Workflow orchestration: It coordinates the tasks of ML workflow pipeline according to DAGs.
Reproducibility: It is the ability to reproduce an ML experiment and obtain the exact same results.
Versioning: It ****ensures the versioning of data, model, and code to enable not only reproducibility, but also traceability
Collaboration: It ensures the possibility to work collaboratively on data, model, and code (different roles).
Continuous ML training & evaluation: It ****means periodic retraining of the ML model based on new feature data.
ML ****metadata ****tracing /logging: Metadata is tracked and logged for each orchestrated ML workflow task.
Continuous monitoring: It implies the periodic assessment of data, model, code, infrastructure resources, and model serving performance (e.g., prediction accuracy) to detect potential errors or changes that influence product quality.
Feedback loops: Multiple feedback loops are required to integrate insights from the quality assessment step into the development or engineering process.

1.4 组件

CI / CD Compponet(P1, P6, P9): It ensures continuous integration, continuous delivery, and continuous deployment. Eg: Jenkins, Gitlab ****CI , SCM , and so on.
Source Code Repository(P4, P5): It ensures code storing and versioning. Eg: Bitbucket, Gitlab, GitHub and so on.
Workflow Orchestration Component(P2, P3, P6): It coordinates the tasks of ML workflow pipeline according to DAGs. Eg: Apache Airflow, Kubeflow Pipelines, Luigi, AWS SageMaker Pipelines, Byteflow and so on.
Feature Store System(P3, P4): It ensures central storage of commonly used features. Eg: Google Feast, Amazon AWS Feature Store, Tecton.ai, Hopswork.ai and so on.
Model Training infrastructure(P6): It provides foundational computation resources. Eg: local machine、cloud computation、Arnold ****and so on .
Model Registry(P3, P4): It stores centrally the trained ML models together with their metadata. Eg: MLflow, AWS SageMaker Model Registry, Microsoft Azure ML Model Registry, Neptune.ai and so on.
ML ****Metadata Stores(P4, P7): It allows for the tracking of various kinds of metadata.
Model Serving Component (P1): It helps to serve models. Eg: KServing of Kubeflow, TensorFlow serving, Triton, Ivory and so on.
Monitoring Component(P8, P9): It takes care of the continuous monitoring of the model serving performance. Additionally, monitoring of the ML infrastructure, CI/CD, and orchestration are required. Eg: Prometheus with Grafana, ELK stack, TensorBoard, cloud watch, Argos and so on.

1.5 角色

1.6 流程

二、业界实践

	方案	CI/CD Componnents	Source Code Repository	Workflow Orchestration Component	Feature Store System	Model Training Infrasturcture	Model Registry	ML Metadata Stores	Model Serving Component	Monitoring Component
开源	Kubeflow	k8s	/	kubeflow Pipelines	featst	pytorch/tensorflow/...	/	Yes	kfserving/pytorch serving/seldon core/trition/bentoML	promethues(不支持Performace Monitoring)
MLflow	/	MLflow Projects	/	/	/	MLflow Models MLflow ModelReistry	MLflow Tracking	/	/
商业	WanDB	/	/	/	/	Hyperparameter Tuning	Artifacts	Experiment Tracking	/	Yes
SageMaker	Yes	Yes	SageMaker Pipeline	Yes	Yes	Yes	Yes	Yes	Amazon CloudWatch
字节	Reckon::MLOps	/	/	/	FeatureStore	/	Yes	Yes	PilotLite/Groot/Laplace/MarineServiceHub	Javis
DLSpace	GitLab CI	Gitlab	byteflow	/	Arnold	建设中	支持	Ivory	字节生态

2.1 Kubeflow

简介

The Kubeflow project is dedicated to making deployments of machine learning ( ML ) workflows on kubernetes simple, portable and scalable.

Kubeflow是google开源的基于kubernetes的MLOps解决方案，集成了比较多的工具，主要包含以下功能：