携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第2天,点击查看活动详情
泰坦尼克号问题引入分析
使用read_csv读取数据:
使用head可以取出数据的前五个数据
train_file = "./data/titanic/train.csv"
eval_file = "./data/titanic/eval.csv"
train_df = pd.read_csv(train_file)
eval_df = pd.read_csv(eval_file)
print(train_df.head())
print(eval_df.head())
运行结果:
survived sex age n_siblings_spouses parch fare class deck \
0 0 male 22.0 1 0 7.2500 Third unknown
1 1 female 38.0 1 0 71.2833 First C
2 1 female 26.0 0 0 7.9250 Third unknown
3 1 female 35.0 1 0 53.1000 First C
4 0 male 28.0 0 0 8.4583 Third unknown
embark_town alone
0 Southampton n
1 Cherbourg n
2 Southampton y
3 Southampton n
4 Queenstown y
survived sex age n_siblings_spouses parch fare class \
0 0 male 35.0 0 0 8.0500 Third
1 0 male 54.0 0 0 51.8625 First
2 1 female 58.0 0 0 26.5500 First
3 1 female 55.0 0 0 16.0000 Second
4 1 male 34.0 0 0 13.0000 Second
deck embark_town alone
0 unknown Southampton y
1 E Southampton y
2 C Southampton y
3 unknown Southampton y
4 D Southampton y
使用pop分开x,y:
y_train = train_df.pop('survived')
y_eval = eval_df.pop('survived')
print(train_df.head())
print(eval_df.head())
print(y_train.head())
print(y_eval.head())
运行结果:
sex age n_siblings_spouses parch fare class deck \
0 male 22.0 1 0 7.2500 Third unknown
1 female 38.0 1 0 71.2833 First C
2 female 26.0 0 0 7.9250 Third unknown
3 female 35.0 1 0 53.1000 First C
4 male 28.0 0 0 8.4583 Third unknown
embark_town alone
0 Southampton n
1 Cherbourg n
2 Southampton y
3 Southampton n
4 Queenstown y
sex age n_siblings_spouses parch fare class deck \
0 male 35.0 0 0 8.0500 Third unknown
1 male 54.0 0 0 51.8625 First E
2 female 58.0 0 0 26.5500 First C
3 female 55.0 0 0 16.0000 Second unknown
4 male 34.0 0 0 13.0000 Second D
embark_town alone
0 Southampton y
1 Southampton y
2 Southampton y
3 Southampton y
4 Southampton y
0 0
1 1
2 1
3 1
4 0
Name: survived, dtype: int64
0 0
1 0
2 1
3 1
4 1
Name: survived, dtype: int64
可以发现,对于train_tf和eval_tf就没有了'survived'
对于y_train,y_eval就是原来的'survived'
使用describe函数统计一些数据
train_df.describe()
运行结果:
age | n_siblings_spouses | parch | fare |
| ----- | ---------- | ------------------ | ---------- | ---------- |
| count | 627.000000 | 627.000000 | 627.000000 | 627.000000 |
| mean | 29.631308 | 0.545455 | 0.379585 | 34.385399 |
| std | 12.511818 | 1.151090 | 0.792999 | 54.597730 |
| min | 0.750000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 23.000000 | 0.000000 | 0.000000 | 7.895800 |
| 50% | 28.000000 | 0.000000 | 0.000000 | 15.045800 |
| 75% | 35.000000 | 1.000000 | 0.000000 | 31.387500 |
| max | 80.000000 | 8.000000 | 5.000000 | 512.329200 |
用shape看一下数据的的大小:
print(train_df.shape, eval_df.shape)
运行结果:
(627, 9) (264, 9)
train_tf数据又627个样本,每个样本有9个数据
eval_tf数据又264个样本,每个样本有9个数据
我们可以看更多的统计量来详细的分析一下这个数据
首先,我们可以看一下age的分布:
# .age就是把对应的数据取出来,hist就是画一个直方图,bins 把数据分成多少份
train_df.age.hist(bins = 20)
运行结果:
这样我们可以看到,再泰坦尼克号上,所有人的年龄大部分在20-30岁左右,其中可能是27或者28的值是最高的。
然后我们统计一下泰坦尼克号上的sex的分布情况:
# 使用.sex取出数据, .value_counts统计出性别各个值得个数
# barh是横向柱状图,bar是纵向柱状图
train_df.sex.value_counts().plot(kind = 'barh')
运行结果:
这里使用['class'],而不是使用.class,是因为会冲突
train_df['class'].value_counts().plot(kind = 'barh')
运行结果:
之前统计的都是直接的值,接下来我们统计一下再男性中多少人获救了,女性中有多少人获救了
# 先将train_df和y_train的数据合并,因为y_train中有获救的信息
# groupby 是分组,将sex进行分组
# 再使用.survived将信息取出来,再用mean计算均值,这里的mean就是对这两个分组分别算均值
pd.concat([train_df, y_train], axis = 1).groupby('sex').survived.mean().plot(kind='barh')
运行结果:
对泰坦尼克号进行模型搭建
使用feature_columns对数据进行封装,离散值可以很方便的做One-hot编码,如果是离散值,也可以很方便给它做分统,分成离散特征。
# 离散值特征
categorical_columns = ['sex', 'n_siblings_spouses', 'parch', 'class',
'deck', 'embark_town', 'alone']
# 连续值特征
numeric_columns = ['age', 'fare']
feature_columns = []
for categorical_column in categorical_columns:
# 用unique获得离散列中所有可能的值
vocab = train_df[categorical_column].unique()
print(categorical_column, vocab)
feature_columns.append(
tf.feature_column.indicator_column( # 变成One-hot编码
tf.feature_column.categorical_column_with_vocabulary_list( # 定义一个feature_column
categorical_column, vocab)))
for categorical_column in numeric_columns:
feature_columns.append(
tf.feature_column.numeric_column(
categorical_column, dtype=tf.float32))
运行结果:
sex ['male' 'female']
n_siblings_spouses [1 0 3 4 2 5 8]
parch [0 1 2 5 3 4]
class ['Third' 'First' 'Second']
deck ['unknown' 'C' 'G' 'A' 'B' 'D' 'F' 'E']
embark_town ['Southampton' 'Cherbourg' 'Queenstown' 'unknown']
alone ['n' 'y']
构建一个dataset:
def make_dataset(data_df, label_df, epochs = 10, shuffle = True,
batch_size = 32):
dataset = tf.data.Dataset.from_tensor_slices(
(dict(data_df), label_df))
if shuffle:
dataset = dataset.shuffle(10000)
dataset = dataset.repeat(epochs).batch(batch_size)
return dataset
train_dataset = make_dataset(train_df, y_train, batch_size = 5)
for x, y in train_dataset.take(1):
print(x, y)
运行结果:
{'sex': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'male', b'male', b'female'], dtype=object)>, 'age': <tf.Tensor: shape=(5,), dtype=float64, numpy=array([43., 28., 49., 57., 14.])>, 'n_siblings_spouses': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 0, 1, 0, 0])>, 'parch': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 0, 0, 0, 0])>, 'fare': <tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 8.05 , 7.7292, 56.9292, 12.35 , 7.8542])>, 'class': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Third', b'Third', b'First', b'Second', b'Third'], dtype=object)>, 'deck': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'unknown', b'unknown', b'A', b'unknown', b'unknown'], dtype=object)>, 'embark_town': <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'Southampton', b'Queenstown', b'Cherbourg', b'Queenstown',
b'Southampton'], dtype=object)>, 'alone': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'y', b'y', b'n', b'y', b'y'], dtype=object)>} tf.Tensor([0 0 1 0 0], shape=(5,), dtype=int64)
keras.layers.DenseFeature:可以把feature_columns应用到dataset中去
# keras.layers.DenseFeature
for x, y in train_dataset.take(1):
age_column = feature_columns[7]
gender_column = feature_columns[0]
print(keras.layers.DenseFeatures(age_column)(x).numpy())
print(keras.layers.DenseFeatures(gender_column)(x).numpy())
运行结果:
[[28.]
[16.]
[36.]
[45.]
[28.]]
[[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[1. 0.]]
对age没有变化,但是对gender则是输出One-hot编码
# keras.layers.DenseFeature
for x, y in train_dataset.take(1):
print(keras.layers.DenseFeatures(feature_columns)(x).numpy())
运行结果:
[[23. 1. 0. 0. 1. 0. 0. 0. 0.
0. 0. 1. 0. 0. 0. 1. 0. 0.
63.3583 0. 1. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 1. 0. ]
[28. 0. 1. 1. 0. 0. 1. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 0.
7.75 0. 1. 0. 0. 0. 0. 0. 1.
0. 0. 0. 0. 0. 1. 0. ]
[52. 0. 1. 0. 1. 0. 0. 1. 0.
0. 0. 0. 0. 0. 1. 0. 0. 0.
30.5 0. 1. 0. 0. 0. 0. 0. 1.
0. 0. 0. 0. 0. 1. 0. ]
[28. 0. 1. 1. 0. 0. 1. 0. 0.
0. 0. 0. 0. 0. 1. 0. 0. 0.
7.8958 0. 1. 0. 0. 0. 0. 0. 1.
0. 0. 0. 0. 0. 1. 0. ]
[28. 1. 0. 1. 0. 0. 1. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 0.
24.15 1. 0. 0. 0. 0. 0. 0. 1.
0. 0. 0. 0. 0. 1. 0. ]]
构建keras模型:
model = keras.models.Sequential([
keras.layers.DenseFeatures(feature_columns),
keras.layers.Dense(100, activation='relu'),
keras.layers.Dense(100, activation='relu'),
keras.layers.Dense(2, activation='softmax'),
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer = keras.optimizers.SGD(lr=0.01),
metrics = ['accuracy'])
训练模型:
# 1. model.fit
# 2. model -> estimator -> train
train_dataset = make_dataset(train_df, y_train, epochs = 100)
eval_dataset = make_dataset(eval_df, y_eval, epochs = 1, shuffle = False)
history = model.fit(train_dataset,
validation_data = eval_dataset,
steps_per_epoch = 19,
validation_steps = 8,
epochs = 100)
预定义estimator使用
Estimator是tensorflow推出的一个High level的API,用于简化机器学习,封装了训练、评估、预测、导出以供使用所有的相关流程,tf.estimator定义好了训练、评估、预测的基本框架,开发者只需要关注模型结构的定义model_fn,输入数据的获取input_fn
BaselineClassifier没有什么规律,而是进行随机猜测,就是把每个类别的样本数统计一遍,然后计算出这个类别出现的比例是多少,应用这个比例对样本进行猜测
output_dir = 'baseline_model'
if not os.path.exists(output_dir):
os.mkdir(output_dir)
baseline_estimator = tf.estimator.BaselineClassifier(
model_dir=output_dir,
n_classes=2)
baseline_estimator.train(input_fn = lambda: make_dataset(
train_df, y_train, epochs=100))
评测这个estimator
baseline_estimator.evaluate(input_fn = lambda: make_dataset(
eval_df, y_eval, epochs=1, shuffle=False, batch_size = 20))
使用LinearClassifier
linear_output_dir = 'linear_model'
if not os.path.exists(linear_output_dir):
os.mkdir(linear_output_dir)
linear_estimator = tf.estimator.LinearClassifier(
model_dir = linear_output_dir,
n_classes = 2,
feature_columns = feature_columns)
linear_estimator.train(input_fn = lambda : make_dataset(
train_df, y_train, epochs = 100))
生成的模型可以用tensorboard查看:tensorboard --logdir linear_model --bind_all
使用DNNClassifier
dnn_output_dir = './dnn_model'
if not os.path.exists(dnn_output_dir):
os.mkdir(dnn_output_dir)
dnn_estimator = tf.estimator.DNNClassifier(
model_dir = dnn_output_dir,
n_classes = 2,
feature_columns=feature_columns,
hidden_units = [128, 128],
activation_fn = tf.nn.relu,
optimizer = 'Adam')
dnn_estimator.train(input_fn = lambda : make_dataset(
train_df, y_train, epochs = 100))
cross feature 交叉特征: age: [1, 2, 3, 4, 5], gender: [male, female]
age_x_gender: [(1, male), (2, male), ...., (5, female)]
# 100000: 100 -> hash(100000 values) % 100
feature_columns.append(
tf.feature_column.indicator_column(
tf.feature_column.crossed_column(
['age', 'sex'], hash_bucket_size = 100)))