动机,为何要用pipelines
在上次课中,代码中没有使用cross validation,原因是cross validation涉及splitting,而它会违反golden rule。
Golden Rule of ML: the test data should not influence the training process in any way. If we violate the Golden Rule, our test score will be overly optimistic!
test data模拟的是
未来的、从未见过的数据,因为是未来的,所以它不可能对早于它发生的训练过程产生影响
回顾上次的代码, 我们一起来理解为何使用 cross_val_score 会有问题:
注意:一定要在刚读入数据之后立刻做train_test_split,然后再考虑EDA等步骤,而不可在train_test_split之前EDA(如此,则test set不再是纯的unseen data,也在某种意义上影响了训练过程,切记!)
imdb_df = pd.read_csv('data/imdb_master.csv', index_col=0, encoding="ISO-8859-1")
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]
imdb_df = imdb_df.sample(frac=0.2, random_state=999) # Take a subsample of the dataset for speed
# We want to split right away - better not even look at summary stats of the test data, or even eyeball it.
# 为了不破坏 golden rule
imdb_train, imdb_test = train_test_split(imdb_df, random_state=123)
X_train_imdb_raw = imdb_train['review']
y_train_imdb = imdb_train['label']
X_test_imdb_raw = imdb_test['review']
y_test_imdb = imdb_test['label']
vec = CountVectorizer(min_df=50, binary=True)
# !!!注意:这里是不能用cross_val_score的根源
X_train_imdb = vec.fit_transform(X_train_imdb_raw)
X_test_imdb = vec.transform(X_test_imdb_raw);
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_imdb, y_train_imdb);
lr.score(X_train_imdb, y_train_imdb) # 0.9833333333333333
lr.score(X_test_imdb, y_test_imdb) # 0.8256
# Last time, we avoided cross-validation. Why?
cross_val_score(lr, X_train_imdb, y_train_imdb) # 默认 5-fold
# cross_validation过程中,需要对 X_train_imdb 和 y_train_imdb 进行 k-fold splitting
# 假设第一次计算,会将 X_train_imdb分割成 X_train_fold_1 和 X_valid_fold_1
# 然后使用 X_train_fold_1 训练模型, 随后用训练完的模型在 X_valid_fold_1 验证效果
# 这里就出现了问题,破坏了 golden rule
# X_train_imdb是数值型矩阵,来自于vec.fit_transform(X_train_imdb_raw)
# 也就是说 X_valid_fold_1 对应的那些文本(电影评论)和 X_train_fold_1对应的那些文本
#一同构成了CountVectorizer的词汇表,然后才生成了数值型的X_train_imdb
# 即 X_valid_fold_1影响了 cross_validation的训练过程,破坏了 golden rule
用图形理解更直观, 可知X_valid_fold_1在cross_validation中作为测试数据,确实是参与影响了训练过程:
如何解决呢,这就需要引出pipeline的概念,本次课讲师依旧使用IMDB数据集演示
pipeline介绍
pipeline基本用法
sklearn中使用pipeline的基本代码如下, 我们将把预处理和模型两个部分放在一个pipeline中:
from sklearn.pipeline import Pipeline
countvec = CountVectorizer(min_df=50, binary=True)
lr = LogisticRegression(max_iter=1000)
# Pipeline的各步骤,放入一个List
# 最后一步必须是 model/classifier, 这之前的步骤是transformers
pipe = Pipeline([
('countvec', countvec),
('lr', lr)])
# 在整个pipe上使用.fit(),会依此使用pipe中的步骤来处理数据
pipe.fit(X_train_imdb_raw, y_train_imdb); # Note that I passed in the raw text data, not the vectorized word counts:
# When we call `predict` (or `score`), we also feed in the raw data:
pipe.predict(X_test_imdb_raw)
# 输出如下:
# array(['pos', 'pos', 'pos', ..., 'pos', 'pos', 'pos'], dtype=object)
调用了pipe.fit()之后,幕后实际上执行了以下步骤:
- Fitting
CountVectorizer. - Transforming the data using the fit
CountVectorizer. - Fitting the
LogisticRegressionon the transformed data.
sklearn的主要作者之一Andreas Mueller在课程COMS4995中分享的图可以更直观的看出pipeline的运行机制,下图假设在pipeline中前两步都是transformer:
图中清楚的展示了,在pipe.fit()时,幕后会依此调用:
T1.fit() -> T1.transform() -> T2.fit() -> T2.transform(), 最后是 classifier.fit()
pipeline用于交叉验证
有了pipeline这个利器之后,我们再使用cross_validation就不会违反 golden rule
上文 <动机,为何要用pipelines> 一节中有问题的用法是:
cross_val_score(lr, X_train_imdb, y_train_imdb)
# 输出结果,虽然代码能运行,但违反了golden rule
# array([0.82866667, 0.836 , 0.83733333, 0.83266667, 0.834 ])
现在有了pipeline之后,正确的用法为:
# cross_val_score 实际上调用了pipe的 `fit` 和 `score`方法.
# k-fold交叉验证中的每个validation fold都是真正意义上的 unseen data
cross_val_score(pipe, X_train_imdb_raw, y_train_imdb)
# 输出结果:
# array([0.82666667, 0.824 , 0.83133333, 0.83066667, 0.83533333])
实践中,我们要养成好的习惯,正确准确的使用sklearn等工具