# 10 种机器学习算法的要点（附 Python 和 R 代码）

8,829

1. 线性回归
2. 逻辑回归
3. 决策树
4. SVM
5. 朴素贝叶斯
6. K最近邻算法
7. K均值算法
8. 随机森林算法
9. 降维算法

## 1、线性回归

Python 代码

```#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model

#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets

# Create linear regression object
linear = linear_model.LinearRegression()

# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)

#Equation coefficient and Intercept
print('Coefficient: n', linear.coef_)
print('Intercept: n', linear.intercept_)

#Predict Output
predicted= linear.predict(x_test)```

R代码

```#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets="" y_train="" <-="" target_variables_values_training_datasets="" x_test="" input_variables_values_test_datasets="" x="" cbind(x_train,y_train)="" #="" train="" the="" model="" using="" training="" sets="" and="" check="" score="" linear="" lm(y_train="" ~="" .,="" data="x)" summary(linear)="" #predict="" output="" predicted="predict(linear,x_test)
2、逻辑回归

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Python代码
#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Equation coefficient and Intercept
print('Coefficient: n', model.coef_)
print('Intercept: n', model.intercept_)

#Predict Output
predicted= model.predict(x_test)
R代码
x <- cbind(x_train,y_train)="" #="" train="" the="" model="" using="" training="" sets="" and="" check="" score="" logistic="" <-="" glm(y_train="" ~="" .,="" data="x,family='binomial')" summary(logistic)="" #predict="" output="" predicted="predict(logistic,x_test)

3、决策树

Python代码
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini

# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)
R代码
library(rpart)
x <- cbind(x_train,y_train)="" #="" grow="" tree="" fit="" <-="" rpart(y_train="" ~="" .,="" data="x,method=" class")"="" summary(fit)="" #predict="" output="" predicted="predict(fit,x_test)
4、支持向量机

上面示例中的黑线将数据分类优化成两个小组，两组中距离最近的点（图中A、B点）到达黑线的距离满足最优条件。这条直线就是我们的分割线。接下来，测试数据落到直线的哪一边，我们就将它分到哪一类去。

Python代码
#Import Library
from sklearn import svm

#Assumed you have, X (predic
tor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object
model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)
R代码
library(e1071)
x <- cbind(x_train,y_train)="" #="" fitting="" model="" fit="" <-svm(y_train="" ~="" .,="" data="x)" summary(fit)="" #predict="" output="" predicted="predict(fit,x_test)
5、朴素贝叶斯

P(c|x) 是已知预示变量（属性）的前提下，类（目标）的后验概率
P(c) 是类的先验概率
P(x|c) 是可能性，即已知类的前提下，预示变量的概率
P(x) 是预示变量的先验概率

Python代码
#Import Library
from sklearn.naive_bayes import GaussianNB

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link
# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)
R代码
library(e1071)
x <- cbind(x_train,y_train)="" #="" fitting="" model="" fit="" <-naivebayes(y_train="" ~="" .,="" data="x)" summary(fit)="" #predict="" output="" predicted="predict(fit,x_test)
6、KNN（K – 最近邻算法）

KNN 的计算成本很高。

Python代码
#Import Library
from sklearn.neighbors import KNeighborsClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)
R代码
library(knn)
x <- cbind(x_train,y_train)="" #="" fitting="" model="" fit="" <-knn(y_train="" ~="" .,="" data="x,k=5)" summary(fit)="" #predict="" output="" predicted="predict(fit,x_test)
7、K 均值算法
K – 均值算法是一种非监督式学习算法，它能解决聚类问题。使用 K – 均值算法来将一个数据归入一定数量的集群（假设有 k 个集群）的过程是简单的。一个集群内的数据点是均匀齐次的，并且异于别的集群。

K – 均值算法怎样形成集群：

K – 均值算法给每个集群选择k个点。这些点称作为质心。

K – 均值算法涉及到集群，每个集群有自己的质心。一个集群内的质心和各数据点之间距离的平方和形成了这个集群的平方值之和。同时，当所有集群的平方值之和加起来的时候，就组成了集群方案的平方值之和。

Python代码
#Import Library
from sklearn.cluster import KMeans

#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object model
k_means = KMeans(n_clusters=3, random_state=0)

# Train the model using the training sets and check score
model.fit(X)

#Predict Output
predicted= model.predict(x_test)
R代码
library(cluster)
fit <- 5="" kmeans(x,="" 3)="" #="" cluster="" solution<="" pre="">
8、随机森林

Python
#Import Library
from sklearn.ensemble import RandomForestClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)
R代码
library(randomForest)
x <- cbind(x_train,y_train)="" #="" fitting="" model="" fit="" <-="" randomforest(species="" ~="" .,="" x,ntree="500)" summary(fit)="" #predict="" output="" predicted="predict(fit,x_test)
9、降维算法

Python代码
#Import Library
from sklearn import decomposition

#Assumed you have training and test data set as train and test
# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA
train_reduced = pca.fit_transform(train)

#Reduced the dimension of test dataset
test_reduced = pca.transform(test)

R Code
library(stats)
pca <- princomp(train,="" cor="TRUE)" train_reduced="" <-="" predict(pca,train)="" test_reduced="" predict(pca,test)<="" pre="">

Python代码
#Import Library

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object