随机森林应用案例 —— otto产品分类_随机森林分类实例，2024年最新互联网寒冬公司倒闭后既有适合小白学习的零基础资

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

在这里插入图片描述
由上图可以看出，该数据类别不均衡，因数据量庞大，采用随机欠采样进行处理

4.2 数据基本处理

（1）确定特征值和标签值

# 采用随机欠采样之前需要确定数据的特征值和标签值
y=data["target"]
x=data.drop(["id","target"],axis=1)

（2）随机欠采样处理

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
x_resampled,y_resampled = rus.fit_resample(x,y)

查看欠采样后的数据形状

x.shape,y.shape
# ((61878, 93), (61878,))
x_resampled.shape,y_resampled.shape
# ((17361, 93), (17361,))

查看数据经过欠采样之后类别是否平衡

sns.countplot(y_resampled)
plt.show()

在这里插入图片描述

（3）把标签值转换为数字

y_resampled

在这里插入图片描述

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_resampled = le.fit_transform(y_resampled)
y_resampled

在这里插入图片描述
（4）分割数据

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x_resampled,y_resampled,test_size=0.2)

4.3 模型训练

from sklearn.ensemble import RandomForestClassifier

estimator = RandomForestClassifier(oob_score=True)
estimator.fit(x_train,y_train)

4.4 模型评估

本题要求使用logloss进行模型评估

y_pre = estimator.predict(x_test)
y_test,y_pre

在这里插入图片描述

需要注意的是：logloss在使用过程中，必须要求将输出用one-hot表示

from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder(sparse=False)
y_pre = one_hot.fit_transform(y_pre.reshape(-1,1))
y_test = one_hot.fit_transform(y_test.reshape(-1,1))
y_test,y_pre

在这里插入图片描述

from sklearn.metrics import log_loss

log_loss(y_test,y_pre,eps=1e-15,normalize=True)
# 7.637713870225003

改变预测值的输出模式，让输出结果为可能性的百分占比，降低logloss值

y_pre_proba = estimator.predict_proba(x_test)
y_pre_proba

在这里插入图片描述

log_loss(y_test,y_pre_proba,eps=1e-15,normalize=True)
# 0.7611795612521034

由此可见，log_loss值下降了许多

4.5 模型调优

（1）确定最优的n_estimators

# 确定n\_estimators的取值范围
tuned_parameters = range(10,200,10)

# 创建添加accuracy的一个numpy
accuracy_t = np.zeros(len(tuned_parameters)) 

# 创建添加error的一个numpy
error_t = np.zeros(len(tuned_parameters)) 

# 调优过程实现
for i,one_parameter in enumerate(tuned_parameters):
    estimator = RandomForestClassifier(n_estimators=one_parameter,
                                       max_depth=10,
                                       max_features=10,
                                       min_samples_leaf=10,
                                       oob_score=True,
                                       random_state=0,
                                       n_jobs=-1)
    estimator.fit(x_train,y_train)
    
    # 输出accuracy
    accuracy_t[i] = estimator.oob_score_
    
    # 输出log\_loss
    y_pre = estimator.predict_proba(x_test)
    error_t[i] = log_loss(y_test,y_pre,eps=1e-15,normalize=True)

# 优化结果过程可视化 
fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(20,4),dpi=100)
axes[0].plot(tuned_parameters,accuracy_t)
axes[1].plot(tuned_parameters,error_t)

axes[0].set_xlabel("n\_estimators")
axes[0].set_ylabel("accuracy\_t")

axes[1].set_xlabel("n\_estimators")
axes[1].set_ylabel("error\_t")

axes[0].grid()
axes[1].grid()

在这里插入图片描述
经过图像展示，最后确定n_estimators=175时，效果不错

（2）确定最优的max_depth

# 确定max\_depth的取值范围
tuned_parameters = range(10,100,10)

# 创建添加accuracy的一个numpy
accuracy_t = np.zeros(len(tuned_parameters)) 

# 创建添加error的一个numpy
error_t = np.zeros(len(tuned_parameters)) 

# 调优过程实现
for i,one_parameter in enumerate(tuned_parameters):
    estimator = RandomForestClassifier(n_estimators=175,
                                       max_depth=one_parameter,
                                       max_features=10,
                                       min_samples_leaf=10,
                                       oob_score=True,
                                       random_state=0,
                                       n_jobs=-1)
    estimator.fit(x_train,y_train)
    
    # 输出accuracy
    accuracy_t[i] = estimator.oob_score_
    
    # 输出log\_loss
    y_pre = estimator.predict_proba(x_test)
    error_t[i] = log_loss(y_test,y_pre,eps=1e-15,normalize=True)

# 优化结果过程可视化 
fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(20,4),dpi=100)
axes[0].plot(tuned_parameters,accuracy_t)
axes[1].plot(tuned_parameters,error_t)

axes[0].set_xlabel("max\_depth")
axes[0].set_ylabel("accuracy\_t")

axes[1].set_xlabel("max\_depth")
axes[1].set_ylabel("error\_t")

axes[0].grid()
axes[1].grid()

在这里插入图片描述
经过图像展示，最后确定max_depth=30时，效果不错

（3）确定最优的max_features

# 确定max\_features取值范围
tuned_parameters = range(5,40,5)

# 创建添加accuracy的一个numpy
accuracy_t = np.zeros(len(tuned_parameters)) 

# 创建添加error的一个numpy
error_t = np.zeros(len(tuned_parameters)) 

# 调优过程实现
for i,one_parameter in enumerate(tuned_parameters):
    estimator = RandomForestClassifier(n_estimators=175,
                                       max_depth=30,
                                       max_features=one_parameter,
                                       min_samples_leaf=10,
                                       oob_score=True,
                                       random_state=0,
                                       n_jobs=-1)
    estimator.fit(x_train,y_train)
    
    # 输出accuracy
    accuracy_t[i] = estimator.oob_score_
    
    # 输出log\_loss
    y_pre = estimator.predict_proba(x_test)
    error_t[i] = log_loss(y_test,y_pre,eps=1e-15,normalize=True)

# 优化结果过程可视化
fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(20,4),dpi=100)
axes[0].plot(tuned_parameters,accuracy_t)
axes[1].plot(tuned_parameters,error_t)

axes[0].set_xlabel("max\_features")
axes[0].set_ylabel("accuracy\_t")

axes[1].set_xlabel("max\_features")
axes[1].set_ylabel("error\_t")

axes[0].grid()
axes[1].grid()

在这里插入图片描述
经过图像展示，最后确定max_features=15时，效果不错

（4）确定最优的min_samples_leaf

# 确定n\_estimators的取值范围
tuned_parameters = range(1,10,2)

# 创建添加accuracy的一个numpy
accuracy_t = np.zeros(len(tuned_parameters)) 

# 创建添加error的一个numpy
error_t = np.zeros(len(tuned_parameters)) 

# 调优过程实现
for i,one_parameter in enumerate(tuned_parameters):
    estimator = RandomForestClassifier(n_estimators=175,
                                       max_depth=30,
                                       max_features=15,
                                       min_samples_leaf=one_parameter,
                                       oob_score=True,
                                       random_state=0,
                                       n_jobs=-1)
    estimator.fit(x_train,y_train)
    
    # 输出accuracy
    accuracy_t[i] = estimator.oob_score_
    
    # 输出log\_loss
    y_pre = estimator.predict_proba(x_test)
    error_t[i] = log_loss(y_test,y_pre,eps=1e-15,normalize=True)

# 优化结果过程可视化
fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(20,4),dpi=100)
axes[0].plot(tuned_parameters,accuracy_t)
axes[1].plot(tuned_parameters,error_t)

axes[0].set_xlabel("min\_samples\_leaf")
axes[0].set_ylabel("accuracy\_t")

axes[1].set_xlabel("min\_samples\_leaf")
axes[1].set_ylabel("error\_t")

axes[0].grid()
axes[1].grid()

在这里插入图片描述
经过图像展示，最后确定min_samples_leaf=1时，效果不错

（5）确定最优模型

estimator = RandomForestClassifier(n_estimators=175,
                                       max_depth=30,
                                       max_features=15,
                                       min_samples_leaf=1,
                                       oob_score=True,
                                       random_state=0,
                                       n_jobs=-1)
estimator.fit(x_train,y_train)
y_pre_proba = estimator.predict_proba(x_test)
log_loss(y_test,y_pre_proba)


![img](https://p3-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/b2333f1426934efda5179831d509bb1d~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg55So5oi3MzM5MTQ5MjgwNjA=:q75.awebp?rk3s=f64ab15b&x-expires=1771316265&x-signature=MkyTCoQPcGo3kAhCerNhBitJbMU%3D)
![img](https://p3-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/fc6187b3675f4941ab23f29850c07716~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg55So5oi3MzM5MTQ5MjgwNjA=:q75.awebp?rk3s=f64ab15b&x-expires=1771316265&x-signature=zAQhmDzRh4JNjB9uzqDwawLNXHc%3D)

**网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。**

**[需要这份系统化资料的朋友，可以戳这里获取](https://gitee.com/vip204888)**


**一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！**