很多数据集中都含有categorical variables,即类别型的特征列,但使用sklearn建模时,这种数据无法直接使用,需对其预处理
数据集 (成人普查数据预测收入)
本次使用的也是一个经典的数据集,成人普查数据从而预测收入
census = pd.read_csv('data/adult.csv')
# 养成习惯,一开始就要 train_test_split
census_train, census_test = train_test_split(census, test_size=0.2, random_state=123)
EDA(探索性数据分析)
现实中我们拿到一个新数据集之后,EDA是很重要的一步,可以让我们对数据有个了解,可以使用传统的pandas来手动从一些维度看数据集,也可以使用近几年出现的一些工具,如pandas_profiling来更方便的做EDA
# 随后再对 train set EDA
census_train.head()
census_train.shape
(26048, 15)
census_train.columns
# 输出:
Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
'marital.status', 'occupation', 'relationship', 'race', 'sex',
'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
'income'],
dtype='object')
# 要预测的目标列: "income"
census_train["income"].value_counts()
# 输出:
<=50K 19810
>50K 6238
Name: income, dtype: int64
使用DummyClassifier获得baseline:
dc = DummyClassifier(strategy='prior')
dc.fit(None, census_train["income"])
dc.score(None, census_train["income"])
# 0.7605190417690417
pandas_profiling 做EDA
安装:
pip install pandas-profiling
或者
conda install -c conda-forge pandas-profiling
使用:
from pandas_profiling import ProfileReport
profile = ProfileReport(census_train, title='Pandas Profiling Report') #, minimal=True)
# profile.to_file('profile_report.html')
# This next line can take a while...
profile.to_notebook_iframe();
可自行查看pandas_profiling导出的报告
编码类别型变量
因为数据中有非数值型的列,我们无法直接在其上进行逻辑回归建模
X_train_census = census_train.drop(columns=["income"])
y_train_census = census_train["income"]
这就需要进行预处理
numeric_features = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
categorical_features = ['workclass', 'education', 'martial.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
target_columns = 'income'
我们由易到难,第一版模型,我们可以考虑将类别型的列全部删除(真实工作中不推荐, 因为类别型列中通常包含有用的信息)
census_train_numeric = X_train_census.drop(columns=categorical_features)
census_train_numeric.head()
lr = LogisticRegression()
lr.fit(census_train_numeric, y_train_census)
Ordinal encoding 顺序编码 (偶尔推荐)
可以使用sklearn中的 OrdinalEncoder,为类别型变量中的每个类别分配一个整数
from sklearn.preprocessing import OrdinalEncoder
# 就像CountVectorizer一样,也是一个 transformer
oe = OrdinalEncoder(dtype=int)
transformed = oe.fit_transform(X_train_census[categorical_features])
transformed = pd.DataFrame(data=transformed, columns=categorical_features, index=X_train_census.index)
# we can see the mapping from strings to integers here:
oe.categories_ # workclass列中有?符号,下节课会学习处理方法
# 与数值型的列合并
census_train_ord = pd.concat((census_train_numeric, transformed), axis=1)
census_train_ord
至此,所有的列都转为了sklearn接受的格式, 再次使用逻辑回归
lr = LogisticRegression()
lr.fit(census_train_ord, y_train_census)
其实我们已经能发现问题了:在数值型列上强制赋予了顺序的含义,但很多列上其实说不通 (比如国家,sklearn会根据字典顺序给国家列分配数字,我们训练了逻辑回归模型,其实就是给每一列学一个权重,国家这列就有一个权重,但你不能说国家字典顺序靠后的就比靠前的收入更高)
OHE 独热编码
One-hot encoding会将某一列转换为多列,sklearn提供了 OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, dtype='int')
transformed_ohe = ohe.fit_transform(X_train_census[categorical_features])
transformed_ohe = pd.DataFrame(data=transformed_ohe, columns=ohe.get_feature_names(categorical_features), index=X_train_census.index)
transformed_ohe.head()
可以看到,原来的 workclass列现在变成了9个新的列
census_train_ohe = pd.concat((census_train_numeric, transformed_ohe), axis=1)
lr = LogisticRegression()
lr.fit(census_train_ohe, y_train_census) # sklearn不要求y_train_census必须是数值型, 所以可直接使用原始的 y-values
理论上,决策树是可以直接用于类别型变量的,e.g., "is the treatment equal to C?"可以是一个决策问题,只不过
sklearn中的实现不支持,所以我们用sklearn中的决策树时,还需要转为数值型
既然sklearn中提供了 OneHotEncoder 和 OrdinalEncoder两种选项,就说明在某些情况下,OrdinalEncoder更合适, 比如此数据集中的 education列,仔细想想,学历确实内含了顺序,如果我们有无穷多的数据,即使对学历做了独热编码,由于数据量大,模型最终也能学到这个顺序,但现实中数据量没那么大,如果能把人为的一些直觉,一列理解提供给模型,就相当于帮了模型一把,那么模型在有限数据量下也能做的更好。
因此我们给学历一列提供顺序:
education_levels = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th',
'10th', '11th', '12th', 'HS-grad', 'Prof-school',
'Assoc-voc', 'Assoc-acdm', 'Some-college',
'Bachelors', 'Masters', 'Doctorate']
# make sure we didn't miss anything:
assert set(education_levels) = set(census_train['education'].unique())
# 使用 ordinal encoding:
oe = OrdinalEncoder(categories=education_levels, dtype=int)
transformed = oe.fit_transform(X_train_census[['education']])
transformed = pd.DataFrame(data=transformed, columns=['education_int'], index=X_train_census.index)
pd.concat((X_train_census[['education']], transformed), axis=1)
转换后,如上图所示,学历越高,转换之后的数字越大
此处我们是学历越高数字越大
但完全可以
反过来赋值:education_levels[::-1]只不过之前逻辑回归学到一个正系数,反过来之后学到一个负系数, 即 学历数字越大(学历越低),收入可能越低所以我们编码的时候只是告诉模型,这一列有顺序的含义,其余的模型会自己学到
两种编码综合使用
综上所述,其实我们想把原始数据中的列分为以下四种:
numeric_features = ['age', 'fnlwgt','education.num',
'capital.gain', 'capital.loss', 'hours.per.week']
categorical_features = ['workclass','martial.status', 'occupation',
'relationship', 'race', 'sex', 'native.country']
ordinal_features = ['education']
target_columns = 'income'
原始数据中,其实已经有一列 education.num 列了,避免重复,可以删除这一列
OHE 深入
处理没见过的类别项 handle_unknown='ignore'
- We haven't yet scored our model, let's try that now.
- For now, to keep it simple, we'll just use the categorical features only.
- Next class we'll put everything together with some new sklearn syntax.
X_train = census_train[categorical_features]
y_train = census_train[target_column]
pipe = Pipeline([
('ohe', OneHotEncoder()),
('lr', LogisticRegression(max_iter=1000))
])
pipe.fit(X_train, y_train)
pipe.score(X_train, y_train) # 0.8214
cross_validate(pipe, X_train, y_train) # 会报错
# 原因是在整个国家列中,Holand-Netherlands只出现了一次
# 在某一次交叉验证中,荷兰进入了 validation fold
# 于是在transform时,就遇到了 unknown categories
X_train['native.country'].value_counts()
其实这很类似之前学习 CountVectorizer时的情况,在validation/test set中出现陌生的词时怎么办。
- 默认情况下,
CountVectorizer会忽略陌生的词(because many words get ignored in general). - 默认情况下,
OneHotEncoder则会报错,因为用户可能需要知悉
所以我们可以修复代码:
pipe = Pipeline([
('ohe', OneHotEncoder(handle_unknown='ignore')), # 遇到没见过的国家,就出现全0
('lr', LogisticRegression(max_iter=1000))
])
cross_validate(pipe, X_train, y_train)
- 思考: do we want this behaviour?
- 回答: it depends.
- The row of all 0s is sort of like an "other" category.
- In that case, "Holland" or "Mars" or "Hogwarts" would all be treated the same.
- Important question to ask yourself: could you get unseen values of a category during deployment?
- E.g. if the categories are provinces/territories of Canada, we know the possible values and we can just specify them in.
- If we know the categories, this might be a reasonable time to "violate the Golden Rule"(look at the test set) and just hard-code all the categories:
all_countries = census['native.country'].unique() # 实际上破坏了 golden rule
ohe_cat = OneHotEncoder(categories=all_countries)
- This syntax allows you to
pre-definethe categories - It's a little more complicated if you only want to do this for some of the categorical variables.
- Next class we'll talk about a tool to make that easier.
drop='first' 策略
在sklearn社区里,还有一种用法讨论的很多:
ohe_drop = OneHotEncoder(sparse=False, dtypeint, drop='first') # drop参数
drop用法意思是说如果某一行数据都是0,则可以推断它其实是被drop掉的那一种类别,所以说drop掉一列信息并没有丢失,仍然可以回推
关于drop的优缺点,人们各执一词:
Pros:
- In certain cases, like LinearRegression(讲师建议
永远不要使用,后续课程会讲岭回归等替代), this is really important. - 不drop一列的话,技术上来说信息是冗余的
Cons:
- It prevents you from doing the
handle_unknown='ignore', which is very often useful. - It makes the interpretation of feature importances more confusing (后续课程讲)
- Occasionally the choice of which feature gets dropped does actually matter, e.g. feature selection after OHE (后续课程讲)
综合考虑,讲师建议几乎任何时候都不使用 drop='first' 这种策略 (弊大于利, 后面的sex例子中可以使用drop='if_binary'), 这里让同学们知道,是因为经常会在别人的代码中看到
类别很多的类别型变量
在该数据集中,总共有13个人来自柬埔寨,我们不禁要问:13条数据足够学到来自柬埔寨对收入的影响吗? 毕竟在数据量一定的情况下,特征列越多,越容易过拟合
其中一种办法是,罕见的国家都归于"other":