很多数据集中都含有categorical variables，即类别型的特征列，但使用sklearn建模时，这种数据无法直接使用，需对其预处理

数据集（成人普查数据预测收入）

本次使用的也是一个经典的数据集，成人普查数据从而预测收入

census = pd.read_csv('data/adult.csv')
# 养成习惯，一开始就要 train_test_split
census_train, census_test = train_test_split(census, test_size=0.2, random_state=123)

EDA（探索性数据分析）

现实中我们拿到一个新数据集之后，EDA是很重要的一步，可以让我们对数据有个了解，可以使用传统的pandas来手动从一些维度看数据集，也可以使用近几年出现的一些工具，如pandas_profiling来更方便的做EDA

# 随后再对 train set EDA
census_train.head()

截屏2022-11-30 下午9.51.55.png

census_train.shape  
(26048, 15)


census_train.columns
# 输出：
Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')
      
      
# 要预测的目标列：  "income"
census_train["income"].value_counts()

# 输出：
<=50K    19810
>50K      6238
Name: income, dtype: int64

使用DummyClassifier获得baseline:

dc = DummyClassifier(strategy='prior')
dc.fit(None, census_train["income"])
dc.score(None, census_train["income"]) 
# 0.7605190417690417

pandas_profiling 做EDA

安装：

pip install pandas-profiling
或者
conda install -c conda-forge pandas-profiling

使用：

from pandas_profiling import ProfileReport

profile = ProfileReport(census_train, title='Pandas Profiling Report') #, minimal=True)

# profile.to_file('profile_report.html')

# This next line can take a while...
profile.to_notebook_iframe();

可自行查看pandas_profiling导出的报告

编码类别型变量

因为数据中有非数值型的列，我们无法直接在其上进行逻辑回归建模

X_train_census = census_train.drop(columns=["income"])
y_train_census = census_train["income"]

这就需要进行预处理

numeric_features = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
categorical_features = ['workclass', 'education', 'martial.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
target_columns = 'income'

我们由易到难，第一版模型，我们可以考虑将类别型的列全部删除（真实工作中不推荐, 因为类别型列中通常包含有用的信息）

census_train_numeric = X_train_census.drop(columns=categorical_features)
census_train_numeric.head()

lr = LogisticRegression()
lr.fit(census_train_numeric, y_train_census)

Ordinal encoding 顺序编码 (偶尔推荐)

可以使用sklearn中的 OrdinalEncoder，为类别型变量中的每个类别分配一个整数

from sklearn.preprocessing import OrdinalEncoder

# 就像CountVectorizer一样，也是一个 transformer 
oe = OrdinalEncoder(dtype=int)

transformed = oe.fit_transform(X_train_census[categorical_features])

截屏2022-12-01 下午4.06.47.png

transformed = pd.DataFrame(data=transformed, columns=categorical_features, index=X_train_census.index)

# we can see the mapping from strings to integers here:
oe.categories_ # workclass列中有？符号，下节课会学习处理方法

# 与数值型的列合并
census_train_ord = pd.concat((census_train_numeric, transformed), axis=1)
census_train_ord

截屏2022-12-01 下午4.19.22.png

至此，所有的列都转为了sklearn接受的格式, 再次使用逻辑回归

lr = LogisticRegression()
lr.fit(census_train_ord, y_train_census)

其实我们已经能发现问题了：在数值型列上强制赋予了顺序的含义，但很多列上其实说不通 (比如国家，sklearn会根据字典顺序给国家列分配数字，我们训练了逻辑回归模型，其实就是给每一列学一个权重，国家这列就有一个权重，但你不能说国家字典顺序靠后的就比靠前的收入更高)

OHE 独热编码

One-hot encoding会将某一列转换为多列，sklearn提供了 OneHotEncoder

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, dtype='int')

transformed_ohe = ohe.fit_transform(X_train_census[categorical_features])

transformed_ohe = pd.DataFrame(data=transformed_ohe, columns=ohe.get_feature_names(categorical_features), index=X_train_census.index)

transformed_ohe.head()

截屏2022-12-01 下午4.48.30.png

可以看到，原来的 workclass列现在变成了9个新的列

census_train_ohe = pd.concat((census_train_numeric, transformed_ohe), axis=1)

lr = LogisticRegression()
lr.fit(census_train_ohe, y_train_census) # sklearn不要求y_train_census必须是数值型, 所以可直接使用原始的 y-values

理论上，决策树是可以直接用于类别型变量的，e.g., "is the treatment equal to C?"可以是一个决策问题，只不过sklearn中的实现不支持，所以我们用sklearn中的决策树时，还需要转为数值型

既然sklearn中提供了 OneHotEncoder 和 OrdinalEncoder两种选项，就说明在某些情况下，OrdinalEncoder更合适, 比如此数据集中的 education列，仔细想想，学历确实内含了顺序，如果我们有无穷多的数据，即使对学历做了独热编码，由于数据量大，模型最终也能学到这个顺序，但现实中数据量没那么大，如果能把人为的一些直觉，一列理解提供给模型，就相当于帮了模型一把，那么模型在有限数据量下也能做的更好。

因此我们给学历一列提供顺序：

education_levels = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th',
                    '10th', '11th', '12th', 'HS-grad', 'Prof-school',
                    'Assoc-voc', 'Assoc-acdm', 'Some-college',
                    'Bachelors', 'Masters', 'Doctorate']
                    
# make sure we didn't miss anything:
assert set(education_levels) = set(census_train['education'].unique())

# 使用 ordinal encoding:
oe = OrdinalEncoder(categories=education_levels, dtype=int)
transformed = oe.fit_transform(X_train_census[['education']])
transformed = pd.DataFrame(data=transformed, columns=['education_int'], index=X_train_census.index)

pd.concat((X_train_census[['education']], transformed), axis=1)

截屏2022-12-01 下午5.49.48.png

转换后，如上图所示，学历越高，转换之后的数字越大

此处我们是学历越高数字越大

但完全可以反过来赋值： education_levels[::-1] 只不过之前逻辑回归学到一个正系数，反过来之后学到一个负系数, 即学历数字越大（学历越低），收入可能越低

所以我们编码的时候只是告诉模型，这一列有顺序的含义，其余的模型会自己学到

截屏2022-12-01 下午6.01.42.png

两种编码综合使用

综上所述，其实我们想把原始数据中的列分为以下四种：

numeric_features = ['age', 'fnlwgt','education.num', 
'capital.gain', 'capital.loss', 'hours.per.week']

categorical_features = ['workclass','martial.status', 'occupation', 
'relationship', 'race', 'sex', 'native.country']

ordinal_features = ['education']

target_columns = 'income'

原始数据中，其实已经有一列 education.num 列了，避免重复，可以删除这一列

OHE 深入

处理没见过的类别项 handle_unknown='ignore'

We haven't yet scored our model, let's try that now.
For now, to keep it simple, we'll just use the categorical features only.
Next class we'll put everything together with some new sklearn syntax.

X_train = census_train[categorical_features]
y_train = census_train[target_column]

pipe = Pipeline([
        ('ohe', OneHotEncoder()), 
        ('lr', LogisticRegression(max_iter=1000))
        ])
        
pipe.fit(X_train, y_train)
pipe.score(X_train, y_train) # 0.8214

cross_validate(pipe, X_train, y_train) # 会报错

截屏2022-12-02 下午2.02.19.png

# 原因是在整个国家列中，Holand-Netherlands只出现了一次
# 在某一次交叉验证中，荷兰进入了 validation fold
# 于是在transform时，就遇到了 unknown categories
X_train['native.country'].value_counts()

其实这很类似之前学习 CountVectorizer时的情况，在validation/test set中出现陌生的词时怎么办。

默认情况下， CountVectorizer会忽略陌生的词（because many words get ignored in general）.
默认情况下，OneHotEncoder则会报错，因为用户可能需要知悉

所以我们可以修复代码：

pipe = Pipeline([ 
    ('ohe', OneHotEncoder(handle_unknown='ignore')), # 遇到没见过的国家，就出现全0
    ('lr', LogisticRegression(max_iter=1000)) 
    ])
    
cross_validate(pipe, X_train, y_train)

截屏2022-12-02 下午2.17.08.png

思考： do we want this behaviour？
回答： it depends.
The row of all 0s is sort of like an "other" category.
In that case, "Holland" or "Mars" or "Hogwarts" would all be treated the same.
Important question to ask yourself: could you get unseen values of a category during deployment?
- E.g. if the categories are provinces/territories of Canada, we know the possible values and we can just specify them in.
If we know the categories, this might be a reasonable time to "violate the Golden Rule"(look at the test set) and just hard-code all the categories:

all_countries = census['native.country'].unique() # 实际上破坏了 golden rule

ohe_cat = OneHotEncoder(categories=all_countries)

This syntax allows you to pre-define the categories
It's a little more complicated if you only want to do this for some of the categorical variables.
Next class we'll talk about a tool to make that easier.

drop='first' 策略

在sklearn社区里，还有一种用法讨论的很多:

ohe_drop = OneHotEncoder(sparse=False, dtypeint, drop='first') # drop参数

截屏2022-12-02 下午2.42.02.png

drop用法意思是说如果某一行数据都是0，则可以推断它其实是被drop掉的那一种类别，所以说drop掉一列信息并没有丢失，仍然可以回推

关于drop的优缺点，人们各执一词：

Pros:

In certain cases, like LinearRegression(讲师建议永远不要使用，后续课程会讲岭回归等替代), this is really important.
不drop一列的话，技术上来说信息是冗余的

Cons:

It prevents you from doing the handle_unknown='ignore', which is very often useful.
It makes the interpretation of feature importances more confusing (后续课程讲)
Occasionally the choice of which feature gets dropped does actually matter, e.g. feature selection after OHE (后续课程讲)

综合考虑，讲师建议几乎任何时候都不使用 drop='first' 这种策略 (弊大于利, 后面的sex例子中可以使用drop='if_binary'), 这里让同学们知道，是因为经常会在别人的代码中看到