本节课讲师依旧使用IMDB数据集为例
一开始提出了问题:既然cross validation也是评估模型在unseen data上的效果,为何不在整个数据集上做cross-validation呢?
答: 有时候,cross validation score会过于乐观,只有保留一份test data 才是比较靠谱的检查。但棘手的是,如果你拿到的数据集本身就很小,那么测试集也很小,这种情况也不能获得一个准确的test error. 所以说整体数据量越少,你越不要高估模型的能力,数据量越多,相关的overfitting问题(可以是validation set上的,也可以是我们之前学的普通过拟合)情况会越好。
Optimization bias (也叫 overfitting on the validation set)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 16
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
imdb_df = pd.read_csv('data/imdb_master.csv', index_col=0, encoding="ISO-8859-1")
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]
# 这次课没有采样,用了全部5万条数据
# test_size=0.99为了演示 overfitting on the validation set
imdb_train, imdb_test = train_test_split(imdb_df, random_state=123, test_size=0.99)
# Here, we'll simulate having a tiny dataset by putting most of the data in test.
# Having a huge test set will be useful for pedagogical reasons as we can really trust the test error.
# (If I showed you a tiny dataset then I couldn't get a good test error at the end)
imdb_train.shape # (500, 4)
imdb_test.shape # (49500, 4)
X_train_imdb_raw = imdb_train['review']
y_train_imdb = imdb_train['label']
X_test_imdb_raw = imdb_test['review']
y_test_imdb = imdb_test['label']
countvec_fixed_params = {"min_df" : 0.01}
lr_fixed_params = {"max_iter" : 1000}
countvec = CountVectorizer(**countvec_fixed_params)
lr = LogisticRegression(**lr_fixed_params)
pipe = Pipeline([
('countvec', countvec),
('lr', lr)])
param_grid = {
"countvec__max_df" : np.arange(0.02,0.42,0.01), # 共40个元素
"lr__C" : [0.01, 0.1, 1, 10, 100] # 5个元素
}
grid_search = GridSearchCV(pipe, param_grid, verbose=2, n_jobs=-1, return_train_score=True)
grid_search.fit(X_train_imdb_raw, y_train_imdb);
# Fitting 5 folds for each of 200 candidates, totalling 1000 fits
把IDMB数据集的99%作为test set,现实中不会这么操作,这里只是为了教学目的, 随后我们进行网格搜索,由于这里的网格有200个不同的参数组合,所以这里我们其实一次又一次地运行在validation data上,虽然在交叉验证中我们每次都是使用 train fold训练,用validation fold验证,但是这个过程在量大的时候,由于运行次数太多,某种意义上说,validation fold也不再是很纯的unseen data
grid_results_df = pd.DataFrame(grid_search.cv_results_)[['mean_test_score', 'mean_train_score', 'param_countvec__max_df', 'param_lr__C', 'mean_fit_time', 'rank_test_score']]
grid_results_df = grid_results_df.sort_values(by="mean_test_score", ascending=False)
grid_results_df
打印出 grid_results_df, 由于整个 train_set很小,仅有500条(用作交叉验证后,400条训练,100条验证,100条实在不多),所以这里最高的 mean_test_score并不可靠 (数据量小,就存在很大的randomness,发现的好参数可能就是恰好在这少量的验证数据上表现好,就好比如果就考几道题,学渣完全有可能幸运地考过学霸)
我们有49500条测试集数据(数量足够提供一个可靠的得分),完全可以验证一下, 可以看到测试集得分为0.78,并不是刚才排名靠前的 0.84, 这个现象就叫 optimization bias
grid_search.score(X_test_imdb_raw, y_test_imdb)
# 得分为:
# 0.7853131313131313
至此,讲师一再强调,这个问题真的很难,现实中该怎么办呢?因为也没有什么好办法,我们还是会从grid_results_df中选择得分最高的超参数, 但由于optimization bias现象,我们不会对这里的得分过度自信(不会告诉老板或用户这个模型有这么高的准确率,但最后上线后被打脸),而且由于grid_results_df中的得分不可靠,所以排名第一的超参数不见得就比排名靠后的超参数在测试集上表现好。讲师再次强调,这门课的目的之一就是要不断教会我们: 不要夸大模型的结果。
- Remember, we can trust this test score because it's a huge test set.
- So, what we have here is a VERY optimistic cross-validation score.
- This is why we need a test set.
- The frustrating part is that if our dataset is small then our test set is also small.
- So, you never know.
课堂上讲师演示了grid_results_df中排名第一和第六的超参数在测试集上的表现,的确发现反倒第六的更好(实践中虽然我们建议test set只用一次,但有时候多用一两次也无伤大雅)。
如果固定C的值,我们可以绘制出 max_df 与各种得分的关系:
# WARNING: this cell is very slow to run (because the test set is very large here)
C = 0.01
test_scores = []
for max_df in param_grid["countvec__max_df"]:
pipe = Pipeline([
('countvec', CountVectorizer(max_df=max_df, **countvec_fixed_params)),
('lr', LogisticRegression(C=C, **lr_fixed_params))])
pipe.fit(X_train_imdb_raw, y_train_imdb)
test_scores.append(pipe.score(X_test_imdb_raw, y_test_imdb))
r = grid_results_df[grid_results_df["param_lr__C"] == C].sort_values(by=["param_countvec__max_df"])
plt.plot(r["param_countvec__max_df"], r["mean_test_score"], label="cv")
plt.plot(r["param_countvec__max_df"], r["mean_train_score"], label="train")
best_param_cv = r.sort_values(by=["rank_test_score"]).iloc[0]["param_countvec__max_df"]
best_val_cv = r.sort_values(by=["rank_test_score"]).iloc[0]["mean_test_score"]
plt.plot(best_param_cv, best_val_cv, 'b*', markersize=20)
plt.plot(param_grid["countvec__max_df"], test_scores, label="test")
best_param_test = param_grid["countvec__max_df"][np.argmax(test_scores)]
best_val_test = np.max(test_scores)
plt.plot(best_param_test, best_val_test, 'g*', markersize=20)
plt.legend();
plt.title("C = %s" % C);
The size of your dataset
- So, a very important factor here is how much data you have.
- With infinite amounts of training data, overfitting would not be a problem and you could have your test score = your train score.
- This is a subtle point that takes time to digest.
- But think of it this way: overfitting happens because you only see a bit of data and you learn patterns that are overly specific to your sample.
- If you saw "all" the data, then the notion of "overly specific" would not apply.
- So, more data will make your test score better.
- But furthermore, it will make your score estimates less noisy.
- Remember,
train_test_splitis a random split. - What if you split differently?
- You don't want all your insights to be specific to your split - be careful!
What to do if your test score is much lower than your cross-validation score:
- For one thing, you can more realistically report this to others.
- You can try a few different things, use the test set a couple of times, it's not the end of the world.
- I suggest trying simpler models, being conservative when communicating.