如何在Python机器学习项目中建立一个C++模型
在构建机器学习模型时,Python是相当通用的。这是由于庞大的社区、众多的库,以及简短易懂的代码。然而,在执行速度方面,它有一个缺点。这就是像C++这样的高速语言出现的地方。
尽管我们可以用C++建立一个快速的ML模型,但在机器学习库的数量上,它无法与Python相比。然而,我们可以利用Python库,如Numpy和Pandas进行数据预处理,然后建立一个在C++上运行的模型。
Python有ctypes模块,允许我们调用C++代码并在我们的程序中使用它。在这篇文章中,我们将利用 ctypes 的能力,创建一个 ML 模型。我们将建立一个Logistic回归模型,然后用梯度下降法对其进行优化。这篇文章的主要目的是指导你如何使用C++建立你的自定义模型。
前提条件
这是一些高级别的内容。因此,需要对以下语言有一个扎实的了解。
- C++- 你应该对指针、数据结构(如向量)和面向对象的编程语义有一些了解。
- Python- 你应该熟悉它的工具和生态系统。
- 机器学习概念。
你还需要以研究导向的心态来对待这个教程。这是一个数据科学家的必备技能。
概述
我们将首先简要地看一下Logistic回归的内容。接下来,我们将讨论梯度下降优化算法。
此后,我们将编写C++代码。最后,我们将把C++文件建立为共享库,并使用ctypes模块在Python中使用它。
让我们开始吧!
Logistic 回归
这是一种用于监督学习的分类算法。它的主要目的是显示一个实例属于目标下的某个类别的概率。它通过计算特征的总和乘以它们的权重加上偏置项来实现。
为了进行预测,总和被传递到一个sigmoid函数中,如下式所示。

当模型对一个积极的实例输出一个非常高的概率,对一个消极的实例输出一个非常低的概率时,就会使用一个成本函数*(log loss*)。
整个训练集的成本是所有实例的成本的平均值。一个实例的成本是通过计算预测误差来完成的,即预测值-实际值。

我们可以使用任何优化算法来优化成本函数,如梯度下降,因为它是凸的。要做到这一点,我们必须得到对数损失的导数。这是用部分导数完成的。

我们将在后面的C++代码中详细研究这个函数。
梯度下降(GD)算法
它通过反复更新参数(权重和偏置)使成本函数最小化,直到达到收敛。
GD计算误差函数的梯度,并沿着下降的梯度移动,直到达到最小值。请看下面的伪代码。
weight = 0
bias = 0
update until minimum:
weight = weight - (learning rate × (weight gradient))
bias = bias - (learning rate × (bias gradient))
对于逻辑回归,偏差的梯度是通过简单地找到对数损失的导数来计算的,而权重的梯度是通过对数损失的导数乘以特征权重得到的。
学习率被用来控制迭代次数,直到收敛。
现在让我们来看看C++代码。
C++代码
在展示完整的代码之前,我们将把这个项目分成几个小部分。
第一步是导入所需的模块和std 名称空间。
#include<iostream>
#include <math.h>
#include <vector>
using namespace std;
创建一个具有方法签名的类,如下所示。
class CPPLogisticRegression{
public:
//method for updating the weights and bias
vector<double> updateWeightsAndBias(int noOfIterations, int noOfRows, int noOfColumns);
//method for the prediction
double predict(vector<double> vW, double* X_train_test);
};
更新权重和偏置项
接下来,我们剖析一下更新权重和偏置项的方法。
vector<double> CPPLogisticRegression::updateWeightsAndBias(int noOfIterations, int noOfRows, int noOfColumns){
double row_pred_diff = 0.0;
double total_diff = 0.0;
double feature_weight[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
double total_feature_weight[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
double weight_derivative[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
double bias_derivative = 0.0;
double W[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
double bias = 0.0;
vector<double> vWB;
//train set
double X_train[noOfRows][noOfColumns] = {
{57.0,0.0,0.0,140.0,241.0,0.0,1.0,123.0,1.0,0.2,1.0,0.0,3.0},
{45.0,1.0,3.0,110.0,264.0,0.0,1.0,2.0,0.0,1.2,1.0,0.0,3.0},
{68.0,1.0,0.0,144.0,13.0,1.0,1.0,141.0,0.0,3.4,1.0,2.0,3.0},
{57.0,1.0,0.0,80.0,1.0,0.0,1.0,115.0,1.0,1.2,1.0,1.0,3.0},
{57.0,0.0,1.0,0.0,236.0,0.0,0.0,174.0,0.0,0.0,1.0,1.0,2.0},
{61.0,1.0,0.0,140.0,207.0,0.0,0.0,8.0,1.0,1.4,2.0,1.0,3.0},
{46.0,1.0,0.0,140.0,311.0,0.0,1.0,120.0,1.0,1.8,1.0,2.0,3.0},
{62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0},
{62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0}};
//labels
double Y[noOfRows] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0};
for (int l = 0; l < noOfIterations; l++){
for (int i = 0; i < noOfRows; i++){
double Wx = 0.0;
//computing W.x
for (int j = 0; j < noOfColumns; j++){
Wx = W[j] * X_train[i][j];
}
//computing (σ(W.x) + b) - Y
row_pred_diff = (1/(1 + exp(-(Wx+bias))))-Y[i];
for (int k = 0; k < noOfColumns; k++){
//computing (σ(W.x) + b) - Y × x(i)
feature_weight[k] = row_pred_diff * X_train[i][k];
//summation(Σ) of each feature weight
total_feature_weight[k] += feature_weight[k];
}
//summation(Σ) of predictions
total_diff += row_pred_diff;
}
//updating the weights for each feature
for (int z = 0; z < noOfColumns; z++){
//computing the average of the weights(1/m)
weight_derivative[z] = total_feature_weight[z]/noOfRows;
W[z] = W[z] - 0.1 * weight_derivative[z];
//storing the values in a vector
vWB.push_back(W[z]);
}
//calculating the bias
bias_derivative = total_diff/noOfRows;
bias = bias - 0.1 * bias_derivative;
vWB.push_back(bias);
}
return vWB;
}
我们需要适当地初始化这些数组。接下来,我们创建一个带有两个内循环的for-loop。
在第一个内循环中,我们有两个用于计算加权和(W.x)的内循环,另一个用于计算每个特征权重的总和。
最后,我们计算每个实例的预测(成本)之和(Σ)。
for (int i = 0; i < noOfRows; i++){
double Wx = 0.0;
//computing W.x
for (int j = 0; j < noOfColumns; j++){
Wx = W[j] * X_train[i][j];
}
//computing (σ(W.x) + b) - Y
row_pred_diff = (1/(1 + exp(-(Wx+bias))))-Y[i];
for (int k = 0; k < noOfColumns; k++){
//computing (σ(W.x) + b) - Y × x(i)
feature_weight[k] = row_pred_diff * X_train[i][k];
//summation(Σ) of each feature weight
total_feature_weight[k] += feature_weight[k];
}
//summation(Σ) of predictions
total_diff += row_pred_diff;
}
在第二个内循环中,我们通过计算总特征权重的平均值来计算每个特征的权重,然后更新它们。
然后,这些权重被储存在一个向量中(0.1是学习率)。
for (int z = 0; z < noOfColumns; z++){
//computing the average of the weights(1/m)
weight_derivative[z] = total_feature_weight[z]/noOfRows;
W[z] = W[z] - 0.1 * weight_derivative[z];
//storing the values in a vector
vWB.push_back(W[z]);
}
外循环的最后一步是更新偏置项,并将其作为向量中的最后一项存储。
我们将权重和偏置存储在一个向量中,因为C++不允许像Python那样从一个方法/函数中返回一个以上的值。
//calculating the bias
bias_derivative = total_diff/noOfRows;
bias = bias - 0.1 * bias_derivative;
vWB.push_back(bias);
该函数返回包含权重和偏置项的向量。
预测
我们从上一个函数返回的向量与测试特征数组一起被传入这个函数。
我们像在前一个函数中那样计算加权和,然后计算sigmoid,得到一个概率。
由于我们只有几个测试特征,所以准确率相当低。
double CPPLogisticRegression::predict(vector<double> vW, double* X_train_test){
static double predictions;
double Wx_test = 0.0;
//calculating the σ(W.x)
for (int j = 0; j < 13; j++){
Wx_test += (vW[j] * X_train_test[j]);
}
//adding the bias term
predictions = 1/(1 + exp(-(Wx_test + vW.back())));
//making the prediction
if(predictions>0.5){
predictions = 1.0;
}else{
predictions = 0.0;
}
return predictions;
}
我们使用extern C 语句来编写可以在C++代码之外访问的函数。
这些是我们将在Python代码中调用的函数。对于Windows来说,你将在函数ie前附加字面意思__declspec(dllexport) 。
__declspec(dllexport) CPPLogisticRegression* LogisticRegression(){
//......
}
你可以从这个官方文档中阅读更多关于ctypes的内容。
extern "C"{
//vector to store the weights and bias gotten from the updateWeightsAndBias() function
vector<double> vX;
CPPLogisticRegression* LogisticRegression(){
CPPLogisticRegression* log_reg = new CPPLogisticRegression();
return log_reg;
}
void fit(CPPLogisticRegression* log_reg) {
vX = log_reg->updateWeightsAndBias(50,9,13);
}
double predict(CPPLogisticRegression* log_reg, double* array){
return log_reg->predict(vX,array);
}
}
在上面的代码中,LogisticRegression() 函数将我们创建的类实例化并返回。
fit() 函数调用方法来更新权重和偏置项。它返回一个向量,这个向量随后被传递给predict() 函数中的类的predict() 函数。
注意这两个名字相似的预测函数之间的区别。用于预测函数的数组将从Python中传入。
下面是完整的C++代码。
#include<iostream>
#include <math.h>
#include <vector>
using namespace std;
class CPPLogisticRegression{
public:
//method for updating the weights and bias
vector<double> updateWeightsAndBias(int noOfIterations, int noOfRows, int noOfColumns);
//method for the prediction
double predict(vector<double> vW, double* X_train_test);
};
vector<double> CPPLogisticRegression::updateWeightsAndBias(int noOfIterations, int noOfRows, int noOfColumns){
double row_pred_diff = 0.0;
double total_diff = 0.0;
double feature_weight[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
double total_feature_weight[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
double weight_derivative[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
double bias_derivative = 0.0;
double W[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
double bias = 0.0;
vector<double> vWB;
//train set
double X_train[noOfRows][noOfColumns] = {
{57.0,0.0,0.0,140.0,241.0,0.0,1.0,123.0,1.0,0.2,1.0,0.0,3.0},
{45.0,1.0,3.0,110.0,264.0,0.0,1.0,2.0,0.0,1.2,1.0,0.0,3.0},
{68.0,1.0,0.0,144.0,13.0,1.0,1.0,141.0,0.0,3.4,1.0,2.0,3.0},
{57.0,1.0,0.0,80.0,1.0,0.0,1.0,115.0,1.0,1.2,1.0,1.0,3.0},
{57.0,0.0,1.0,0.0,236.0,0.0,0.0,174.0,0.0,0.0,1.0,1.0,2.0},
{61.0,1.0,0.0,140.0,207.0,0.0,0.0,8.0,1.0,1.4,2.0,1.0,3.0},
{46.0,1.0,0.0,140.0,311.0,0.0,1.0,120.0,1.0,1.8,1.0,2.0,3.0},
{62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0},
{62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0}};
//labels
double Y[noOfRows] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0};
for (int l = 0; l < noOfIterations; l++){
for (int i = 0; i < noOfRows; i++){
double Wx = 0.0;
//computing W.x
for (int j = 0; j < noOfColumns; j++){
Wx = W[j] * X_train[i][j];
}
//computing (σ(W.x) + b) - Y
row_pred_diff = (1/(1 + exp(-(Wx+bias))))-Y[i];
for (int k = 0; k < noOfColumns; k++){
//computing (σ(W.x) + b) - Y × x(i)
feature_weight[k] = row_pred_diff * X_train[i][k];
//summation(Σ) of each feature weight
total_feature_weight[k] += feature_weight[k];
}
//summation(Σ) of predictions
total_diff += row_pred_diff;
}
//updating the weights for each feature
for (int z = 0; z < noOfColumns; z++){
//computing the average of the weights(1/m)
weight_derivative[z] = total_feature_weight[z]/noOfRows;
W[z] = W[z] - 0.1 * weight_derivative[z];
//storing the values in a vector
vWB.push_back(W[z]);
}
//calculating the bias
bias_derivative = total_diff/noOfRows;
bias = bias - 0.1 * bias_derivative;
vWB.push_back(bias);
}
return vWB;
}
double CPPLogisticRegression::predict(vector<double> vW, double* X_train_test){
static double predictions;
double Wx_test = 0.0;
//computing σ(W.x)
for (int j = 0; j < 13; j++){
Wx_test += (vW[j] * X_train_test[j]);
}
//adding the bias term
predictions = 1/(1 + exp(-(Wx_test + vW.back())));
//making the prediction
if(predictions>0.5){
predictions = 1.0;
}else{
predictions = 0.0;
}
return predictions;
}
extern "C"{
//vector to store the weights and bias gotten from the updateWeightsAndBias() function
vector<double> vX;
CPPLogisticRegression* LogisticRegression(){
CPPLogisticRegression* log_reg = new CPPLogisticRegression();
return log_reg;
}
void fit(CPPLogisticRegression* log_reg) {
vX = log_reg->updateWeightsAndBias(50,9,13);
}
double predict(CPPLogisticRegression* log_reg, double* array){
return log_reg->predict(vX,array);
}
}
在我们看Python代码之前,让我们先创建一个共享库。
创建一个共享库
创建一个名为setup.py的Python文件并添加以下代码。
from setuptools import setup, Extension
module1 = Extension('logistic',
sources = ['logistic.cpp'])
setup (name = 'Logistic Regression Model',
version = '1.0',
description = 'This is a Logistic Regression Model writen in C++',
ext_modules = [module1])
上面的代码从logistic.cpp文件中创建了一个名为logistic的共享库。该文件将在构建目录中被创建。
注意,这与平台无关。对于Linux,它将创建一个.so文件,而Windows将产生一个.pyd文件。
我在Linux上运行我的文件,它产生了一个名为logistic.cpython-310-x86_64-linux-gnu.so的文件。请务必检查你的文件。
在你的终端中使用下面的命令运行该代码。
python setup.py build
Python代码
正如我们对C++代码所做的那样,我们首先导入所需的模块。
import ctypes as ct
import numpy as np
import pandas as pd
接下来,我们加载我们创建的共享库。
#the build file location
libfile = r"build/lib.linux-x86_64-3.10/logistic.cpython-310-x86_64-linux-gnu.so"
#loading it for use
our_lib = ct.CDLL(libfile)
然后我们在C++文件的extern C 部分设置函数的返回类型。
#setting the return types for our C++ methods
our_lib.fit.argtypes = [ct.c_void_p]
our_lib.predict.argtypes = [ct.c_void_p, np.ctypeslib.ndpointer(dtype=np.float64)]
our_lib.predict.restype = ct.c_double
剩下的代码是用于初始化类,创建要添加到predict() 方法中的数组,并显示预测的值。
#initializing the class
tree_obj = our_lib.LogisticRegression()
#the array to test the model
test_features = np.array((62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0))
test_features = test_features.astype(np.double)
#calling the fit method
predictions = our_lib.fit(tree_obj)
#predictiing
pred = our_lib.predict(tree_obj,test_features)
print("Predicted value:",pred)
完整的Python代码如下所示。
import ctypes as ct
import numpy as np
import pandas as pd
#the build file location
libfile = r"build/lib.linux-x86_64-3.10/logistic.cpython-310-x86_64-linux-gnu.so"
#loading it for use
our_lib = ct.CDLL(libfile)
#setting the return types for our C++ methods
our_lib.fit.argtypes = [ct.c_void_p]
our_lib.predict.argtypes = [ct.c_void_p, np.ctypeslib.ndpointer(dtype=np.float64)]
our_lib.predict.restype = ct.c_double
#initializing the class
tree_obj = our_lib.LogisticRegression()
#the array to test the model
test_features = np.array((62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0))
test_features = test_features.astype(np.double)
#calling the fit method
predictions = our_lib.fit(tree_obj)
#predictiing
pred = our_lib.predict(tree_obj,test_features)
print("Predicted value:",pred)
结论
在本教程中,我们讨论了Logistic回归和梯度下降优化算法。然后我们编写了C++代码,并建立了一个共享库,将在Python中使用。
因此,你可以使用这些知识来创建你的C++模型。
除了ctypes,还有其他封装工具,如CFFI、PyBind11等。