在Python中以ONNX格式保存和加载堆叠的集合分类器

堆叠集合模型是指通过结合两个或几个机器学习模型的结果，并通过元学习器来运行它们，从而提高预测性能的学习器，而不是独立的学习器。

堆积模型是不同的（不是单一的类型），与bagging方法（只是决策树）不同，堆积中的每个模型不会像boosting中发生的那样修正前面的预测。你可以通过阅读Adhinga Fredrick的这篇文章来了解如何建立这样的一个Ensemble模型。

开放神经网络交换（ONNX）是微软开发的深度学习和传统机器学习的开源格式，尽管模型是在哪个库中开发的，但有一个统一的模式来保存模型。

2017年12月推出，这给数据科学家和机器学习工程师提供了一种持续保存模型的方法，而不必担心平台不一致和库的版本淘汰。它作为一种避免供应商锁定的手段，因为ONNX模型可以部署在任何平台上--不仅仅是它们被训练的地方。

假设你在英伟达的GPU上训练了一个图像识别模型。但是，出于操作的需要，你决定将其部署到谷歌TPU的生产环境中。那么，ONNX是一个很好的工具，可以在两者之间转移模型。使用Docker将模型推送到生产环境的基于容器的方法也可以完全绕过。

对于那些可能想跨平台运送模型的机器学习工程师来说，或将其容器化，ONNX模型可以帮助他们完全避免这种情况。

前提条件

Python的基本知识。
Scikit-Learn中机器学习模型的建立、评估和验证。
基本的数据操作技能。
在计算机上安装Python（含pip 、numpy 、pandas 、sklearn ），或在Google Colab或Kaggle等在线环境中安装。

目标

在本文中，你将学习如何。

安装ONNX和onnxruntime
确定ONNX的输入初始类型。
将堆积的合奏序列化并保存为ONNX格式。
使用ONNX运行时推理会话将其加载到生产中。

设置环境

要在本地环境中安装ONNX和onnxruntime ，请运行以下命令。

如果你使用pip ，在你的终端上。

pip install onnx
pip install onnxruntime

如果你使用 Anaconda，在 anaconda 终端。

conda install -c conda-forge onnx
conda install -c conda-forge onnxruntime

注意：ONNX没有预装在Google Colab和Kaggle笔记本的运行环境中。

要在Google Colab或Kaggle上安装ONNX和onnxruntime 。

!pip install onnx
!pip install onnxruntime

注意：由于内存分配不足，repl.it 等在线编辑器可能无法运行我们的代码。

导入和准备数据

让我们开始导入pandas 库和数据集。

import pandas as pd
path='https://raw.githubusercontent.com/iannjari/datasets/main/diabetes.csv'
df=pd.read_csv(path,engine='python')
print(df)

数据集

我们将使用Kaggle的 "心力衰竭预测数据集"。

这个数据集是其他5个数据集的组合，总共包含11个特征。这里，要分类的目标特征是Heart Disease 。一个病人是否有心脏病，分别用1或0表示。

输出。

DataFrame

作者提供的数据集截图

我们将把目标数据列Outcome 从其他特征列中分离出来，如图所示。

target_name = "Outcome"
target = df[target_name]

data = df.drop(columns=[target_name])
print(data)

将数据分成训练和测试分区，分割比例为70-30，如图所示。

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    data,target, test_size=0.33, random_state=42)

训练和评估叠加分类器

我们将采用随机森林分类器、kNN分类器、梯度提升分类器和逻辑回归器的叠加作为最终模型。

随机森林分类器在随机选择的数据子集上使用一些决策树，并根据投票情况从这些树中做出决定。k-Nearest Neighbors分类器根据距离相似性对可能的数据点进行分类。

梯度boosing分类器将许多弱学习分类器结合在一起，以创建一个强大的预测模型。逻辑回归是用来像线性回归一样建立数据模型，然后预测属于类的结果，而不是让它们作为连续值。

让我们导入所有必要的包。

from sklearn.ensemble import (RandomForestClassifier, StackingClassifier, GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from  sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

然后，我们初始化堆栈。

clf=StackingClassifier(estimators=[
            ("rf",RandomForestClassifier(n_estimators=10,random_state=42)),
            ("gb",GradientBoostingClassifier(n_estimators=10,random_state=42)),
            ("knn",KNeighborsClassifier(n_neighbors=5))],final_estimator=LogisticRegression())

现在，让我们建立一个管道，在训练数据上拟合，并在测试数据上评分。

pipeline = make_pipeline(
            StackingClassifier(estimators=[
            ("rf",RandomForestClassifier(n_estimators=10,random_state=42)),
            ("gb",GradientBoostingClassifier(n_estimators=10,random_state=42)),
            ("knn",KNeighborsClassifier(n_neighbors=5))],final_estimator=LogisticRegression()))

pipeline.fit(x_train,y_train)
print(pipeline.score(x_test,y_test))

输出。

0.7716535433070866

当使用混淆矩阵评估模型时，如图所示，我们可以得到精度、召回率和F1分数。

from sklearn.metrics import confusion_matrix
preds=pipeline.predict(x_test)
c_matrix=confusion_matrix(y_test, preds)
tn, fp, fn, tp = c_matrix.ravel()
precision= tp/(tp+fp)
misclassification= (fp+fn)/(tn+fn+tp+fp)
f_one=tp/(tp+0.5*(fp+fn))

print('Precision=',precision)
print('Misclassification=',misclassification)
print('F1 score=',f_one)

输出。

Precision= 0.6842105263157895
Misclassification= 0.2283464566929134
F1 score= 0.6419753086419753

现在，模型已经训练完毕，并且得分很高，让我们保存它并从中推断。

保存模型

为了序列化（保存）模型，我们需要从skl2onnx 包中导入convert_sklearn ，同时导入common.data_types ，将我们的特征类型定义为一个参数initial_types 。

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

convert_sklearn 函数需要一个参数initial_types 来保存模型。数据列的每个数据类型都必须分配给这个参数。例如，如果数据包含3个列的float ，然后是2个String 类型，1个int64 ，那么下面将是声明。

initial_types =  [('feature_input', FloatTensorType([None, 3])),
                   ('feature_input', StringTensorType([None, 2])),
                   ('feature_input', FloatTensorType([None, 1]))]

在我们的例子中，数据集有8个float 类型。

注意：Int可以被视为float，因为它可以进行类型转换。

所以，我们应将变量initial_types 为。

initial_types =  [('feature_input', FloatTensorType([None, 8]))]

现在，我们将通过将模型pipeline 和initial_types 传递给convert_sklearn 函数来保存该模型，如图所示。

onx = convert_sklearn(pipeline,
                      initial_types=
                      initial_types)

with open("stacked_clf.onnx", "wb") as f:
    f.write(onx.SerializeToString())

模型被成功地保存。

注意：如果建立初始类型的难度太大。例如，如果数据有太多的特征，那么可以使用to_onnx 方法。

您只需将x_test 数据（或其某一列）作为参数传递，ONNX 将自动提取该数据。

# Use a section of data instead of x_test to avoid key errors
x=data.loc[44].to_numpy(dtype='float32')
# Give x as a keyword argument by using X=x
# Note case-sensivity
onx=skl2onnx.to_onnx(pipeline, X=x)

with open("stacked_clf.onnx", "wb") as f:
    f.write(onx.SerializeToString())

使用ONNX运行时推理会话加载模型

为了从模型中进行预测，请导入onnxruntime 并调用InferenceSession 。

import onnxruntime as rt
sess = rt.InferenceSession("stacked_clf.onnx")
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], 
                 {input_name:
                   x_test.to_numpy(dtype='float32')})

x_test 可以用测试数据的形状和类型的数组来代替。

让我们看看我们的预测结果。

print(pred_onx)

输出。

[array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,
        0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
        0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1,
        0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
        0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
        0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1,
        1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0], dtype=int64)]

正如你在这里看到的，我们以ONNX格式保存了模型，然后尝试加载它们进行预测。

总结

在本教程中，我们学习了如何安装ONNX和onnxruntime ，确定ONNX输入的初始类型，序列化，将一个堆叠的集合体保存为ONNX格式，并使用ONNX运行时推理会话将其加载到生产中。

这个模型现在可以通过任何网络应用框架，如Streamlit或Dash，使用Django或Flask通过API提供服务。

如何在Python中以ONNX格式保存和加载堆叠的集合分类器

在Python中以ONNX格式保存和加载堆叠的集合分类器

目录

前提条件

目标

设置环境

导入和准备数据

数据集

输出。

训练和评估叠加分类器

保存模型

使用ONNX运行时推理会话加载模型

总结