Lecture 06 下: encoding categorical variables 编码类别型变量

479 阅读6分钟

很多数据集中都含有categorical variables,即类别型的特征列,但使用sklearn建模时,这种数据无法直接使用,需对其预处理

数据集 (成人普查数据预测收入)

本次使用的也是一个经典的数据集,成人普查数据从而预测收入

census = pd.read_csv('data/adult.csv')
# 养成习惯,一开始就要 train_test_split
census_train, census_test = train_test_split(census, test_size=0.2, random_state=123)

EDA(探索性数据分析)

现实中我们拿到一个新数据集之后,EDA是很重要的一步,可以让我们对数据有个了解,可以使用传统的pandas来手动从一些维度看数据集,也可以使用近几年出现的一些工具,如pandas_profiling来更方便的做EDA

# 随后再对 train set EDA
census_train.head()

截屏2022-11-30 下午9.51.55.png

census_train.shape  
(26048, 15)


census_train.columns
# 输出:
Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')
      
      
# 要预测的目标列:  "income"
census_train["income"].value_counts()

# 输出:
<=50K    19810
>50K      6238
Name: income, dtype: int64

使用DummyClassifier获得baseline:

dc = DummyClassifier(strategy='prior')
dc.fit(None, census_train["income"])
dc.score(None, census_train["income"]) 
# 0.7605190417690417

pandas_profiling 做EDA

安装:

pip install pandas-profiling
或者
conda install -c conda-forge pandas-profiling

使用:

from pandas_profiling import ProfileReport

profile = ProfileReport(census_train, title='Pandas Profiling Report') #, minimal=True)

# profile.to_file('profile_report.html')

# This next line can take a while...
profile.to_notebook_iframe();

可自行查看pandas_profiling导出的报告







编码类别型变量

因为数据中有非数值型的列,我们无法直接在其上进行逻辑回归建模

X_train_census = census_train.drop(columns=["income"])
y_train_census = census_train["income"]

这就需要进行预处理

numeric_features = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
categorical_features = ['workclass', 'education', 'martial.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
target_columns = 'income'

我们由易到难,第一版模型,我们可以考虑将类别型的列全部删除(真实工作中不推荐, 因为类别型列中通常包含有用的信息)

census_train_numeric = X_train_census.drop(columns=categorical_features)
census_train_numeric.head()

lr = LogisticRegression()
lr.fit(census_train_numeric, y_train_census)

Ordinal encoding 顺序编码 (偶尔推荐)

可以使用sklearn中的 OrdinalEncoder,为类别型变量中的每个类别分配一个整数

from sklearn.preprocessing import OrdinalEncoder

# 就像CountVectorizer一样,也是一个 transformer 
oe = OrdinalEncoder(dtype=int)

transformed = oe.fit_transform(X_train_census[categorical_features])

截屏2022-12-01 下午4.06.47.png

transformed = pd.DataFrame(data=transformed, columns=categorical_features, index=X_train_census.index)

# we can see the mapping from strings to integers here:
oe.categories_ # workclass列中有?符号,下节课会学习处理方法

# 与数值型的列合并
census_train_ord = pd.concat((census_train_numeric, transformed), axis=1)
census_train_ord

截屏2022-12-01 下午4.19.22.png

至此,所有的列都转为了sklearn接受的格式, 再次使用逻辑回归

lr = LogisticRegression()
lr.fit(census_train_ord, y_train_census)

其实我们已经能发现问题了:在数值型列上强制赋予了顺序的含义,但很多列上其实说不通 (比如国家,sklearn会根据字典顺序给国家列分配数字,我们训练了逻辑回归模型,其实就是给每一列学一个权重,国家这列就有一个权重,但你不能说国家字典顺序靠后的就比靠前的收入更高)

OHE 独热编码

One-hot encoding会将某一列转换为多列,sklearn提供了 OneHotEncoder

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, dtype='int')

transformed_ohe = ohe.fit_transform(X_train_census[categorical_features])

transformed_ohe = pd.DataFrame(data=transformed_ohe, columns=ohe.get_feature_names(categorical_features), index=X_train_census.index)

transformed_ohe.head()

截屏2022-12-01 下午4.48.30.png

可以看到,原来的 workclass列现在变成了9个新的列

census_train_ohe = pd.concat((census_train_numeric, transformed_ohe), axis=1)

lr = LogisticRegression()
lr.fit(census_train_ohe, y_train_census) # sklearn不要求y_train_census必须是数值型, 所以可直接使用原始的 y-values

理论上,决策树是可以直接用于类别型变量的,e.g., "is the treatment equal to C?"可以是一个决策问题,只不过sklearn中的实现不支持,所以我们用sklearn中的决策树时,还需要转为数值型




既然sklearn中提供了 OneHotEncoderOrdinalEncoder两种选项,就说明在某些情况下,OrdinalEncoder更合适, 比如此数据集中的 education列,仔细想想,学历确实内含了顺序,如果我们有无穷多的数据,即使对学历做了独热编码,由于数据量大,模型最终也能学到这个顺序,但现实中数据量没那么大,如果能把人为的一些直觉,一列理解提供给模型,就相当于帮了模型一把,那么模型在有限数据量下也能做的更好。

因此我们给学历一列提供顺序:

education_levels = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th',
                    '10th', '11th', '12th', 'HS-grad', 'Prof-school',
                    'Assoc-voc', 'Assoc-acdm', 'Some-college',
                    'Bachelors', 'Masters', 'Doctorate']
                    
# make sure we didn't miss anything:
assert set(education_levels) = set(census_train['education'].unique())

# 使用 ordinal encoding:
oe = OrdinalEncoder(categories=education_levels, dtype=int)
transformed = oe.fit_transform(X_train_census[['education']])
transformed = pd.DataFrame(data=transformed, columns=['education_int'], index=X_train_census.index)

pd.concat((X_train_census[['education']], transformed), axis=1)

截屏2022-12-01 下午5.49.48.png

转换后,如上图所示,学历越高,转换之后的数字越大

此处我们是学历越高数字越大

但完全可以反过来赋值: education_levels[::-1] 只不过之前逻辑回归学到一个正系数,反过来之后学到一个负系数, 即 学历数字越大(学历越低),收入可能越低

所以我们编码的时候只是告诉模型,这一列有顺序的含义,其余的模型会自己学到

截屏2022-12-01 下午6.01.42.png

两种编码综合使用

综上所述,其实我们想把原始数据中的列分为以下四种:

numeric_features = ['age', 'fnlwgt','education.num', 
'capital.gain', 'capital.loss', 'hours.per.week']

categorical_features = ['workclass','martial.status', 'occupation', 
'relationship', 'race', 'sex', 'native.country']

ordinal_features = ['education']

target_columns = 'income'

原始数据中,其实已经有一列 education.num 列了,避免重复,可以删除这一列







OHE 深入

处理没见过的类别项 handle_unknown='ignore'

  • We haven't yet scored our model, let's try that now.
  • For now, to keep it simple, we'll just use the categorical features only.
  • Next class we'll put everything together with some new sklearn syntax.
X_train = census_train[categorical_features]
y_train = census_train[target_column]

pipe = Pipeline([
        ('ohe', OneHotEncoder()), 
        ('lr', LogisticRegression(max_iter=1000))
        ])
        
pipe.fit(X_train, y_train)
pipe.score(X_train, y_train) # 0.8214

cross_validate(pipe, X_train, y_train) # 会报错

截屏2022-12-02 下午2.02.19.png

# 原因是在整个国家列中,Holand-Netherlands只出现了一次
# 在某一次交叉验证中,荷兰进入了 validation fold
# 于是在transform时,就遇到了 unknown categories
X_train['native.country'].value_counts()

其实这很类似之前学习 CountVectorizer时的情况,在validation/test set中出现陌生的词时怎么办。

  • 默认情况下, CountVectorizer会忽略陌生的词(because many words get ignored in general).
  • 默认情况下,OneHotEncoder则会报错,因为用户可能需要知悉

所以我们可以修复代码:

pipe = Pipeline([ 
    ('ohe', OneHotEncoder(handle_unknown='ignore')), # 遇到没见过的国家,就出现全0
    ('lr', LogisticRegression(max_iter=1000)) 
    ])
    
cross_validate(pipe, X_train, y_train)

截屏2022-12-02 下午2.17.08.png


  • 思考: do we want this behaviour?
  • 回答: it depends.
  • The row of all 0s is sort of like an "other" category.
  • In that case, "Holland" or "Mars" or "Hogwarts" would all be treated the same.
  • Important question to ask yourself: could you get unseen values of a category during deployment?
    • E.g. if the categories are provinces/territories of Canada, we know the possible values and we can just specify them in.
  • If we know the categories, this might be a reasonable time to "violate the Golden Rule"(look at the test set) and just hard-code all the categories:
all_countries = census['native.country'].unique() # 实际上破坏了 golden rule

ohe_cat = OneHotEncoder(categories=all_countries)

  • This syntax allows you to pre-define the categories
  • It's a little more complicated if you only want to do this for some of the categorical variables.
  • Next class we'll talk about a tool to make that easier.




drop='first' 策略

在sklearn社区里,还有一种用法讨论的很多:

ohe_drop = OneHotEncoder(sparse=False, dtypeint, drop='first') # drop参数

截屏2022-12-02 下午2.42.02.png

drop用法意思是说如果某一行数据都是0,则可以推断它其实是被drop掉的那一种类别,所以说drop掉一列信息并没有丢失,仍然可以回推

关于drop的优缺点,人们各执一词:

Pros:

  • In certain cases, like LinearRegression(讲师建议永远不要使用,后续课程会讲岭回归等替代), this is really important.
  • 不drop一列的话,技术上来说信息是冗余的

Cons:

  • It prevents you from doing the handle_unknown='ignore', which is very often useful.
  • It makes the interpretation of feature importances more confusing (后续课程讲)
  • Occasionally the choice of which feature gets dropped does actually matter, e.g. feature selection after OHE (后续课程讲)

综合考虑,讲师建议几乎任何时候都不使用 drop='first' 这种策略 (弊大于利, 后面的sex例子中可以使用drop='if_binary'), 这里让同学们知道,是因为经常会在别人的代码中看到




类别很多的类别型变量

截屏2022-12-02 下午3.29.35.png

截屏2022-12-02 下午3.37.28.png

在该数据集中,总共有13个人来自柬埔寨,我们不禁要问:13条数据足够学到来自柬埔寨对收入的影响吗? 毕竟在数据量一定的情况下,特征列越多,越容易过拟合

其中一种办法是,罕见的国家都归于"other":

截屏2022-12-02 下午3.43.00.png

截屏2022-12-02 下午3.44.33.png

截屏2022-12-02 下午3.47.11.png



Binary features

截屏2022-12-02 下午3.50.06.png

截屏2022-12-02 下午3.52.09.png

截屏2022-12-02 下午3.52.54.png