动机，为何要用pipelines

在上次课中，代码中没有使用cross validation，原因是cross validation涉及splitting，而它会违反golden rule。

Golden Rule of ML: the test data should not influence the training process in any way. If we violate the Golden Rule, our test score will be overly optimistic!

test data模拟的是未来的、从未见过的数据，因为是未来的，所以它不可能对早于它发生的训练过程产生影响

回顾上次的代码, 我们一起来理解为何使用 cross_val_score 会有问题：

注意：一定要在刚读入数据之后立刻做train_test_split，然后再考虑EDA等步骤，而不可在train_test_split之前EDA（如此，则test set不再是纯的unseen data，也在某种意义上影响了训练过程，切记！)


imdb_df = pd.read_csv('data/imdb_master.csv', index_col=0, encoding="ISO-8859-1")
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]
imdb_df = imdb_df.sample(frac=0.2, random_state=999) # Take a subsample of the dataset for speed

# We want to split right away - better not even look at summary stats of the test data, or even eyeball it.
# 为了不破坏 golden rule
imdb_train, imdb_test = train_test_split(imdb_df, random_state=123)

X_train_imdb_raw = imdb_train['review']
y_train_imdb = imdb_train['label']

X_test_imdb_raw = imdb_test['review']
y_test_imdb = imdb_test['label']


vec = CountVectorizer(min_df=50, binary=True)

# !!!注意：这里是不能用cross_val_score的根源
X_train_imdb = vec.fit_transform(X_train_imdb_raw) 
X_test_imdb = vec.transform(X_test_imdb_raw);


lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_imdb, y_train_imdb);

lr.score(X_train_imdb, y_train_imdb) # 0.9833333333333333
lr.score(X_test_imdb, y_test_imdb) # 0.8256

# Last time, we avoided cross-validation. Why?
cross_val_score(lr, X_train_imdb, y_train_imdb) # 默认 5-fold
# cross_validation过程中，需要对 X_train_imdb 和 y_train_imdb 进行 k-fold splitting
# 假设第一次计算，会将 X_train_imdb分割成 X_train_fold_1 和 X_valid_fold_1
# 然后使用 X_train_fold_1 训练模型， 随后用训练完的模型在 X_valid_fold_1 验证效果
# 这里就出现了问题，破坏了 golden rule
# X_train_imdb是数值型矩阵，来自于vec.fit_transform(X_train_imdb_raw)
# 也就是说 X_valid_fold_1 对应的那些文本（电影评论）和 X_train_fold_1对应的那些文本
#一同构成了CountVectorizer的词汇表，然后才生成了数值型的X_train_imdb
# 即 X_valid_fold_1影响了 cross_validation的训练过程，破坏了 golden rule

用图形理解更直观，可知X_valid_fold_1在cross_validation中作为测试数据，确实是参与影响了训练过程:

不能乱用cross_val_score.drawio (1).png

如何解决呢，这就需要引出pipeline的概念，本次课讲师依旧使用IMDB数据集演示

截屏2022-11-24 下午4.53.48.png

pipeline介绍

pipeline基本用法

sklearn中使用pipeline的基本代码如下, 我们将把预处理和模型两个部分放在一个pipeline中:


from sklearn.pipeline import Pipeline


countvec = CountVectorizer(min_df=50, binary=True)
lr = LogisticRegression(max_iter=1000)

# Pipeline的各步骤，放入一个List
# 最后一步必须是 model/classifier， 这之前的步骤是transformers
pipe = Pipeline([
    ('countvec', countvec),
    ('lr', lr)])
    

# 在整个pipe上使用.fit()，会依此使用pipe中的步骤来处理数据
pipe.fit(X_train_imdb_raw, y_train_imdb); #  Note that I passed in the raw text data, not the vectorized word counts:

# When we call `predict` (or `score`), we also feed in the raw data:
pipe.predict(X_test_imdb_raw)


# 输出如下：
# array(['pos', 'pos', 'pos', ..., 'pos', 'pos', 'pos'], dtype=object)

调用了pipe.fit()之后，幕后实际上执行了以下步骤:

Fitting CountVectorizer.
Transforming the data using the fit CountVectorizer.
Fitting the LogisticRegression on the transformed data.

sklearn的主要作者之一Andreas Mueller在课程COMS4995中分享的图可以更直观的看出pipeline的运行机制，下图假设在pipeline中前两步都是transformer:

截屏2022-11-24 下午5.10.02.png

图中清楚的展示了，在pipe.fit()时，幕后会依此调用:

T1.fit() -> T1.transform() -> T2.fit() -> T2.transform(), 最后是 classifier.fit()

pipeline用于交叉验证

有了pipeline这个利器之后，我们再使用cross_validation就不会违反 golden rule

上文 <动机，为何要用pipelines> 一节中有问题的用法是:

cross_val_score(lr, X_train_imdb, y_train_imdb)

# 输出结果，虽然代码能运行，但违反了golden rule
# array([0.82866667, 0.836     , 0.83733333, 0.83266667, 0.834     ])

现在有了pipeline之后，正确的用法为：

# cross_val_score 实际上调用了pipe的 `fit` 和 `score`方法.
# k-fold交叉验证中的每个validation fold都是真正意义上的 unseen data
cross_val_score(pipe, X_train_imdb_raw, y_train_imdb)

# 输出结果：
# array([0.82666667, 0.824     , 0.83133333, 0.83066667, 0.83533333])

实践中，我们要养成好的习惯，正确准确的使用sklearn等工具

截屏2022-11-24 下午5.34.25.png

Lecture 05: pipelines 如何使用sklearn中的pipeline

动机，为何要用pipelines

pipeline介绍

pipeline基本用法

pipeline用于交叉验证