如何在Python机器学习项目中建立一个C++模型

106 阅读7分钟

如何在Python机器学习项目中建立一个C++模型

在构建机器学习模型时,Python是相当通用的。这是由于庞大的社区、众多的库,以及简短易懂的代码。然而,在执行速度方面,它有一个缺点。这就是像C++这样的高速语言出现的地方。

尽管我们可以用C++建立一个快速的ML模型,但在机器学习库的数量上,它无法与Python相比。然而,我们可以利用Python库,如NumpyPandas进行数据预处理,然后建立一个在C++上运行的模型。

Python有ctypes模块,允许我们调用C++代码并在我们的程序中使用它。在这篇文章中,我们将利用 ctypes 的能力,创建一个 ML 模型。我们将建立一个Logistic回归模型,然后用梯度下降法对其进行优化。这篇文章的主要目的是指导你如何使用C++建立你的自定义模型。

前提条件

这是一些高级别的内容。因此,需要对以下语言有一个扎实的了解。

  1. C++- 你应该对指针、数据结构(如向量)和面向对象的编程语义有一些了解。
  2. Python- 你应该熟悉它的工具和生态系统。
  3. 机器学习概念。

你还需要以研究导向的心态来对待这个教程。这是一个数据科学家的必备技能。

概述

我们将首先简要地看一下Logistic回归的内容。接下来,我们将讨论梯度下降优化算法。

此后,我们将编写C++代码。最后,我们将把C++文件建立为共享库,并使用ctypes模块在Python中使用它。

让我们开始吧!

Logistic 回归

这是一种用于监督学习的分类算法。它的主要目的是显示一个实例属于目标下的某个类别的概率。它通过计算特征的总和乘以它们的权重加上偏置项来实现。

为了进行预测,总和被传递到一个sigmoid函数中,如下式所示。

Sigmoid image

当模型对一个积极的实例输出一个非常高的概率,对一个消极的实例输出一个非常低的概率时,就会使用一个成本函数*(log loss*)。

整个训练集的成本是所有实例的成本的平均值。一个实例的成本是通过计算预测误差来完成的,即预测值-实际值。

Log loss image

我们可以使用任何优化算法来优化成本函数,如梯度下降,因为它是凸的。要做到这一点,我们必须得到对数损失的导数。这是用部分导数完成的。

Derived log loss image

我们将在后面的C++代码中详细研究这个函数。

梯度下降(GD)算法

它通过反复更新参数(权重和偏置)使成本函数最小化,直到达到收敛。

GD计算误差函数的梯度,并沿着下降的梯度移动,直到达到最小值。请看下面的伪代码。

weight = 0
bias = 0
update until minimum:
    weight = weight - (learning rate × (weight gradient))
    bias = bias - (learning rate × (bias gradient))

对于逻辑回归,偏差的梯度是通过简单地找到对数损失的导数来计算的,而权重的梯度是通过对数损失的导数乘以特征权重得到的。

学习率被用来控制迭代次数,直到收敛。

现在让我们来看看C++代码。

C++代码

在展示完整的代码之前,我们将把这个项目分成几个小部分。

第一步是导入所需的模块和std 名称空间。

#include<iostream>
#include <math.h>
#include <vector>

using namespace std;

创建一个具有方法签名的类,如下所示。

class CPPLogisticRegression{
    public:
        //method for updating the weights and bias
        vector<double> updateWeightsAndBias(int noOfIterations, int noOfRows, int noOfColumns);
        //method for the prediction
        double predict(vector<double> vW, double* X_train_test);
};

更新权重和偏置项

接下来,我们剖析一下更新权重和偏置项的方法。

vector<double> CPPLogisticRegression::updateWeightsAndBias(int noOfIterations, int noOfRows, int noOfColumns){
            double row_pred_diff = 0.0;
            double total_diff = 0.0;
            double feature_weight[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
            double total_feature_weight[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
            double weight_derivative[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
            double bias_derivative = 0.0;
            double W[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
            double bias = 0.0;
            vector<double> vWB;

            //train set
            double X_train[noOfRows][noOfColumns] = {
            {57.0,0.0,0.0,140.0,241.0,0.0,1.0,123.0,1.0,0.2,1.0,0.0,3.0},
            {45.0,1.0,3.0,110.0,264.0,0.0,1.0,2.0,0.0,1.2,1.0,0.0,3.0},
            {68.0,1.0,0.0,144.0,13.0,1.0,1.0,141.0,0.0,3.4,1.0,2.0,3.0},
            {57.0,1.0,0.0,80.0,1.0,0.0,1.0,115.0,1.0,1.2,1.0,1.0,3.0},
            {57.0,0.0,1.0,0.0,236.0,0.0,0.0,174.0,0.0,0.0,1.0,1.0,2.0},
            {61.0,1.0,0.0,140.0,207.0,0.0,0.0,8.0,1.0,1.4,2.0,1.0,3.0},
            {46.0,1.0,0.0,140.0,311.0,0.0,1.0,120.0,1.0,1.8,1.0,2.0,3.0},
            {62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0},
            {62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0}};

            //labels
            double Y[noOfRows] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0};          
                
            for (int l = 0; l < noOfIterations; l++){
                    for (int i = 0; i < noOfRows; i++){
                        double Wx = 0.0;
                        //computing W.x
                        for (int j = 0; j < noOfColumns; j++){
                            Wx = W[j] * X_train[i][j];
                        }
                        //computing (σ(W.x) + b) - Y
                        row_pred_diff = (1/(1 + exp(-(Wx+bias))))-Y[i];
                        for (int k = 0; k < noOfColumns; k++){
                            //computing (σ(W.x) + b) - Y × x(i)
                            feature_weight[k] = row_pred_diff * X_train[i][k];
                            //summation(Σ) of each feature weight
                            total_feature_weight[k] += feature_weight[k];
                        }
                        //summation(Σ) of predictions
                        total_diff += row_pred_diff;
                            
                    }
                //updating the weights for each feature    
                for (int z = 0; z < noOfColumns; z++){
                        //computing the average of the weights(1/m)
                        weight_derivative[z] = total_feature_weight[z]/noOfRows;
                        W[z] = W[z] - 0.1 * weight_derivative[z];
                        //storing the values in a vector
                        vWB.push_back(W[z]);
                    }
                        
                    //calculating the bias
                    bias_derivative = total_diff/noOfRows;
                    bias = bias - 0.1 * bias_derivative;
                    vWB.push_back(bias);
            }
        return vWB;

}

我们需要适当地初始化这些数组。接下来,我们创建一个带有两个内循环的for-loop

在第一个内循环中,我们有两个用于计算加权和(W.x)的内循环,另一个用于计算每个特征权重的总和。

最后,我们计算每个实例的预测(成本)和(Σ)

    for (int i = 0; i < noOfRows; i++){
        double Wx = 0.0;
        //computing W.x
            for (int j = 0; j < noOfColumns; j++){
                Wx = W[j] * X_train[i][j];
            }
        //computing (σ(W.x) + b) - Y
            row_pred_diff = (1/(1 + exp(-(Wx+bias))))-Y[i];
            for (int k = 0; k < noOfColumns; k++){
                //computing (σ(W.x) + b) - Y × x(i)
                feature_weight[k] = row_pred_diff * X_train[i][k];
                //summation(Σ) of each feature weight
                total_feature_weight[k] += feature_weight[k];
            }
            //summation(Σ) of predictions
            total_diff += row_pred_diff;
                        
    }

在第二个内循环中,我们通过计算总特征权重的平均值来计算每个特征的权重,然后更新它们。

然后,这些权重被储存在一个向量中(0.1是学习率)。

    for (int z = 0; z < noOfColumns; z++){
        //computing the average of the weights(1/m)
        weight_derivative[z] = total_feature_weight[z]/noOfRows;
        W[z] = W[z] - 0.1 * weight_derivative[z];
        //storing the values in a vector
        vWB.push_back(W[z]);
    }

外循环的最后一步是更新偏置项,并将其作为向量中的最后一项存储。

我们将权重和偏置存储在一个向量中,因为C++不允许像Python那样从一个方法/函数中返回一个以上的值。

//calculating the bias
bias_derivative = total_diff/noOfRows;
bias = bias - 0.1 * bias_derivative;
vWB.push_back(bias);

该函数返回包含权重偏置项的向量

预测

我们从上一个函数返回的向量与测试特征数组一起被传入这个函数。

我们像在前一个函数中那样计算加权和,然后计算sigmoid,得到一个概率。

由于我们只有几个测试特征,所以准确率相当低。

    double CPPLogisticRegression::predict(vector<double> vW, double* X_train_test){
        static double predictions;
        double Wx_test = 0.0;
            //calculating the σ(W.x)
            for (int j = 0; j < 13; j++){
                Wx_test += (vW[j] * X_train_test[j]);
            }
            //adding the bias term
            predictions = 1/(1 + exp(-(Wx_test + vW.back()))); 
            //making the prediction
            if(predictions>0.5){
                predictions = 1.0;
            }else{
                predictions = 0.0;
            }
        return predictions;
    }

我们使用extern C 语句来编写可以在C++代码之外访问的函数。

这些是我们将在Python代码中调用的函数。对于Windows来说,你将在函数ie前附加字面意思__declspec(dllexport)

__declspec(dllexport) CPPLogisticRegression* LogisticRegression(){
    //......
}

你可以从这个官方文档中阅读更多关于ctypes的内容。

    extern "C"{
        //vector to store the weights and bias gotten from the updateWeightsAndBias() function
        vector<double> vX;
        CPPLogisticRegression* LogisticRegression(){
            CPPLogisticRegression* log_reg = new CPPLogisticRegression();
            return log_reg;
        }

        void fit(CPPLogisticRegression* log_reg) {
            vX = log_reg->updateWeightsAndBias(50,9,13); 
        }

        double predict(CPPLogisticRegression* log_reg, double* array){
            return log_reg->predict(vX,array);
        }
    }

在上面的代码中,LogisticRegression() 函数将我们创建的类实例化并返回。

fit() 函数调用方法来更新权重偏置项。它返回一个向量,这个向量随后被传递给predict() 函数中的类的predict() 函数。

注意这两个名字相似的预测函数之间的区别。用于预测函数的数组将从Python中传入。

下面是完整的C++代码。

#include<iostream>
#include <math.h>
#include <vector>

using namespace std;

class CPPLogisticRegression{
    public:
        //method for updating the weights and bias
        vector<double> updateWeightsAndBias(int noOfIterations, int noOfRows, int noOfColumns);
        //method for the prediction
        double predict(vector<double> vW, double* X_train_test);
};

        vector<double> CPPLogisticRegression::updateWeightsAndBias(int noOfIterations, int noOfRows, int noOfColumns){
                double row_pred_diff = 0.0;
                double total_diff = 0.0;
                double feature_weight[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
                double total_feature_weight[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
                double weight_derivative[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
                double bias_derivative = 0.0;
                double W[noOfColumns] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0};
                double bias = 0.0;
                vector<double> vWB;

                //train set
                double X_train[noOfRows][noOfColumns] = {
                {57.0,0.0,0.0,140.0,241.0,0.0,1.0,123.0,1.0,0.2,1.0,0.0,3.0},
                {45.0,1.0,3.0,110.0,264.0,0.0,1.0,2.0,0.0,1.2,1.0,0.0,3.0},
                {68.0,1.0,0.0,144.0,13.0,1.0,1.0,141.0,0.0,3.4,1.0,2.0,3.0},
                {57.0,1.0,0.0,80.0,1.0,0.0,1.0,115.0,1.0,1.2,1.0,1.0,3.0},
                {57.0,0.0,1.0,0.0,236.0,0.0,0.0,174.0,0.0,0.0,1.0,1.0,2.0},
                {61.0,1.0,0.0,140.0,207.0,0.0,0.0,8.0,1.0,1.4,2.0,1.0,3.0},
                {46.0,1.0,0.0,140.0,311.0,0.0,1.0,120.0,1.0,1.8,1.0,2.0,3.0},
                {62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0},
                {62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0}};

                //labels
                double Y[noOfRows] = {0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0};          
                

                for (int l = 0; l < noOfIterations; l++){
                        for (int i = 0; i < noOfRows; i++){
                        double Wx = 0.0;
                            //computing W.x
                            for (int j = 0; j < noOfColumns; j++){
                                Wx = W[j] * X_train[i][j];
                            }
                            //computing (σ(W.x) + b) - Y
                            row_pred_diff = (1/(1 + exp(-(Wx+bias))))-Y[i];
                            for (int k = 0; k < noOfColumns; k++){
                                //computing (σ(W.x) + b) - Y × x(i)
                                feature_weight[k] = row_pred_diff * X_train[i][k];
                                //summation(Σ) of each feature weight
                                total_feature_weight[k] += feature_weight[k];
                            }
                            //summation(Σ) of predictions
                            total_diff += row_pred_diff;
                            
                        }
                    //updating the weights for each feature    
                    for (int z = 0; z < noOfColumns; z++){
                            //computing the average of the weights(1/m)
                            weight_derivative[z] = total_feature_weight[z]/noOfRows;
                            W[z] = W[z] - 0.1 * weight_derivative[z];
                            //storing the values in a vector
                            vWB.push_back(W[z]);
                    }
                        
                        //calculating the bias
                        bias_derivative = total_diff/noOfRows;
                        bias = bias - 0.1 * bias_derivative;
                        vWB.push_back(bias);
                }
            return vWB;

        }

        double CPPLogisticRegression::predict(vector<double> vW, double* X_train_test){
            static double predictions;
            double Wx_test = 0.0;
                //computing σ(W.x)
                for (int j = 0; j < 13; j++){
                    Wx_test += (vW[j] * X_train_test[j]);
                }
                //adding the bias term
                predictions = 1/(1 + exp(-(Wx_test + vW.back()))); 
                //making the prediction
                if(predictions>0.5){
                    predictions = 1.0;
                }else{
                    predictions = 0.0;
                }
            return predictions;
        }

extern "C"{
    //vector to store the weights and bias gotten from the updateWeightsAndBias() function
    vector<double> vX;
    CPPLogisticRegression* LogisticRegression(){
        CPPLogisticRegression* log_reg = new CPPLogisticRegression();
        return log_reg;
    }

    void fit(CPPLogisticRegression* log_reg) {
        vX = log_reg->updateWeightsAndBias(50,9,13); 
    }

    double predict(CPPLogisticRegression* log_reg, double* array){
        return log_reg->predict(vX,array);
    }
}

在我们看Python代码之前,让我们先创建一个共享库。

创建一个共享库

创建一个名为setup.py的Python文件并添加以下代码。

from setuptools import setup, Extension

module1 = Extension('logistic',
                    sources = ['logistic.cpp'])

setup (name = 'Logistic Regression Model',
    version = '1.0',
    description = 'This is a Logistic Regression Model writen in C++',
    ext_modules = [module1])

上面的代码从logistic.cpp文件中创建了一个名为logistic的共享库。该文件将在构建目录中被创建。

注意,这与平台无关。对于Linux,它将创建一个.so文件,而Windows将产生一个.pyd文件。

我在Linux上运行我的文件,它产生了一个名为logistic.cpython-310-x86_64-linux-gnu.so的文件。请务必检查你的文件。

在你的终端中使用下面的命令运行该代码。

python setup.py build

Python代码

正如我们对C++代码所做的那样,我们首先导入所需的模块。

import ctypes as ct
import numpy as np
import pandas as pd

接下来,我们加载我们创建的共享库。

#the build file location
libfile = r"build/lib.linux-x86_64-3.10/logistic.cpython-310-x86_64-linux-gnu.so"
#loading it for use
our_lib = ct.CDLL(libfile)

然后我们在C++文件的extern C 部分设置函数的返回类型。

#setting the return types for our C++ methods
our_lib.fit.argtypes = [ct.c_void_p]
our_lib.predict.argtypes = [ct.c_void_p, np.ctypeslib.ndpointer(dtype=np.float64)]
our_lib.predict.restype = ct.c_double

剩下的代码是用于初始化类,创建要添加到predict() 方法中的数组,并显示预测的值。

#initializing the class
tree_obj = our_lib.LogisticRegression()

#the array to test the model
test_features = np.array((62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0))
test_features = test_features.astype(np.double)

#calling the fit method
predictions = our_lib.fit(tree_obj)

#predictiing
pred = our_lib.predict(tree_obj,test_features)
print("Predicted value:",pred)

完整的Python代码如下所示。

import ctypes as ct
import numpy as np
import pandas as pd

#the build file location
libfile = r"build/lib.linux-x86_64-3.10/logistic.cpython-310-x86_64-linux-gnu.so"
#loading it for use
our_lib = ct.CDLL(libfile)

#setting the return types for our C++ methods
our_lib.fit.argtypes = [ct.c_void_p]
our_lib.predict.argtypes = [ct.c_void_p, np.ctypeslib.ndpointer(dtype=np.float64)]
our_lib.predict.restype = ct.c_double

#initializing the class
tree_obj = our_lib.LogisticRegression()

#the array to test the model
test_features = np.array((62.0,1.0,1.0,128.0,208.0,1.0,0.0,140.0,0.0,0.0,2.0,0.0,2.0))
test_features = test_features.astype(np.double)

#calling the fit method
predictions = our_lib.fit(tree_obj)

#predictiing
pred = our_lib.predict(tree_obj,test_features)
print("Predicted value:",pred)

结论

在本教程中,我们讨论了Logistic回归和梯度下降优化算法。然后我们编写了C++代码,并建立了一个共享库,将在Python中使用。

因此,你可以使用这些知识来创建你的C++模型。

除了ctypes,还有其他封装工具,如CFFI、PyBind11等。