Data Prepocessing 数据预处理Static Continuous Variables 静态连续变量 Di

1 Static Continuous Variables 静态连续变量

1.1 Discretization 离散化

1.1.1 Binarization 二值化

from sklearn.preprocessing import Binarizer
model = Binarizer(threshold=6)
result = model.fit_transform(sample_columns.reshape(-1,1)).reshape(-1)

1.1.2 Binning 分箱

等距分箱

from sklearn.preprocessing import KBinsDiscretizer
model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') # 设置5个箱
model.fit(train_set.reshape(-1,1)) # 在训练集上训练
transformed_train = model.transform(train_set.reshape(-1,1)).reshape(-1) # 转换训练集

分位数分箱

from sklearn.preprocessing import KBinsDiscretizer
model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') # 设置5个箱
model.fit(train_set.reshape(-1,1)) # 在训练集上训练
transformed_train = model.transform(train_set.reshape(-1,1)).reshape(-1) # 转换训练集

1.2 Scaling 缩放

1.2.1 Stardard Scaling 标准缩放（Z 分数标准化）

公式： ${X}' = \frac{X - \mu }{\sigma}$
$\mu$ 是X的均值， $\sigma$ 是X的标准差

from sklearn.preprocessing import StandardScaler
model = StandardScaler()
model.fit(train_set.reshape(-1,1)) # 在训练集上训练
transformed_train = model.transform(train_set.reshape(-1,1)).reshape(-1) # 转换训练集

1.2.2 MinMaxScaler 最大最小缩放 (按数值范围缩放)

假设我们想要将特征数值缩放到 (a, b)区间
公式： ${X}' = \frac{X - Min }{Max - Min} * (b - a) + a$
$Min$ 是X中的最小值， $Max$ 是X中的最大值
这种缩放方法同样对异常值较敏感，异常值会同时影响 $Min$ 和 $Max$

from sklearn.preprocessing import MinMaxScaler
model = MinMaxScaler(feature_range=(0,1)) # 将缩放区间定为 (0,1)
model.fit(train_set.reshape(-1,1)) # 在训练集上训练
transformed_train = model.transform(train_set.reshape(-1,1)).reshape(-1) # 转换训练

1.2.3 RobustScaler 稳健缩放 (抗异常值缩放)

使用对异常值稳健的统计（分位数）来缩放特征
假设我们要将缩放的特征分位数范围为 (a, b)
公式： ${X}' = \frac{X - Median}{X.quantile(b) - X.quantile(a)}$
这种方法对异常点鲁棒性更强

from sklearn.preprocessing import RobustScaler
model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0, 75.0))
# with_centering = True => 中心归零，变量X将会变为：X - X.median()
# with_scale = True => 数值标准化，变量X将会除以变量分位数区间（区间由用户设定）

# 不妨将变量分位数区间设置为(25%, 75%)

model.fit(train_set.reshape(-1,1)) # 在训练集上训练
# 转换缩放训练集与测试集
transformed_train = model.transform(train_set.reshape(-1,1)).reshape(-1) # 转换训练集

Data Prepocessing 数据预处理