Tensorflow Estimator使用

136 阅读5分钟

携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第2天,点击查看活动详情

泰坦尼克号问题引入分析

泰坦尼克号的数据下载:train数据eval数据

使用read_csv读取数据:

使用head可以取出数据的前五个数据

train_file = "./data/titanic/train.csv"
eval_file = "./data/titanic/eval.csv"

train_df = pd.read_csv(train_file)
eval_df = pd.read_csv(eval_file)

print(train_df.head())
print(eval_df.head())

运行结果:

   survived     sex   age  n_siblings_spouses  parch     fare  class     deck  \
0         0    male  22.0                   1      0   7.2500  Third  unknown   
1         1  female  38.0                   1      0  71.2833  First        C   
2         1  female  26.0                   0      0   7.9250  Third  unknown   
3         1  female  35.0                   1      0  53.1000  First        C   
4         0    male  28.0                   0      0   8.4583  Third  unknown   

   embark_town alone  
0  Southampton     n  
1    Cherbourg     n  
2  Southampton     y  
3  Southampton     n  
4   Queenstown     y  
   survived     sex   age  n_siblings_spouses  parch     fare   class  \
0         0    male  35.0                   0      0   8.0500   Third   
1         0    male  54.0                   0      0  51.8625   First   
2         1  female  58.0                   0      0  26.5500   First   
3         1  female  55.0                   0      0  16.0000  Second   
4         1    male  34.0                   0      0  13.0000  Second   

      deck  embark_town alone  
0  unknown  Southampton     y  
1        E  Southampton     y  
2        C  Southampton     y  
3  unknown  Southampton     y  
4        D  Southampton     y 

使用pop分开x,y:

y_train = train_df.pop('survived')
y_eval = eval_df.pop('survived')

print(train_df.head())
print(eval_df.head())
print(y_train.head())
print(y_eval.head())

运行结果:

      sex   age  n_siblings_spouses  parch     fare  class     deck  \
0    male  22.0                   1      0   7.2500  Third  unknown   
1  female  38.0                   1      0  71.2833  First        C   
2  female  26.0                   0      0   7.9250  Third  unknown   
3  female  35.0                   1      0  53.1000  First        C   
4    male  28.0                   0      0   8.4583  Third  unknown   

   embark_town alone  
0  Southampton     n  
1    Cherbourg     n  
2  Southampton     y  
3  Southampton     n  
4   Queenstown     y  
      sex   age  n_siblings_spouses  parch     fare   class     deck  \
0    male  35.0                   0      0   8.0500   Third  unknown   
1    male  54.0                   0      0  51.8625   First        E   
2  female  58.0                   0      0  26.5500   First        C   
3  female  55.0                   0      0  16.0000  Second  unknown   
4    male  34.0                   0      0  13.0000  Second        D   

   embark_town alone  
0  Southampton     y  
1  Southampton     y  
2  Southampton     y  
3  Southampton     y  
4  Southampton     y  
0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64
0    0
1    0
2    1
3    1
4    1
Name: survived, dtype: int64

可以发现,对于train_tf和eval_tf就没有了'survived'

对于y_train,y_eval就是原来的'survived'

使用describe函数统计一些数据

train_df.describe()

运行结果:

age        | n_siblings_spouses | parch      | fare       |
| ----- | ---------- | ------------------ | ---------- | ---------- |
| count | 627.000000 | 627.000000         | 627.000000 | 627.000000 |
| mean  | 29.631308  | 0.545455           | 0.379585   | 34.385399  |
| std   | 12.511818  | 1.151090           | 0.792999   | 54.597730  |
| min   | 0.750000   | 0.000000           | 0.000000   | 0.000000   |
| 25%   | 23.000000  | 0.000000           | 0.000000   | 7.895800   |
| 50%   | 28.000000  | 0.000000           | 0.000000   | 15.045800  |
| 75%   | 35.000000  | 1.000000           | 0.000000   | 31.387500  |
| max   | 80.000000  | 8.000000           | 5.000000   | 512.329200 |

用shape看一下数据的的大小:

print(train_df.shape, eval_df.shape)

运行结果:

(627, 9) (264, 9)

train_tf数据又627个样本,每个样本有9个数据

eval_tf数据又264个样本,每个样本有9个数据

我们可以看更多的统计量来详细的分析一下这个数据

首先,我们可以看一下age的分布:

# .age就是把对应的数据取出来,hist就是画一个直方图,bins 把数据分成多少份
train_df.age.hist(bins = 20)

运行结果:

output_5_1.png

这样我们可以看到,再泰坦尼克号上,所有人的年龄大部分在20-30岁左右,其中可能是27或者28的值是最高的。

然后我们统计一下泰坦尼克号上的sex的分布情况:

# 使用.sex取出数据, .value_counts统计出性别各个值得个数
# barh是横向柱状图,bar是纵向柱状图
train_df.sex.value_counts().plot(kind = 'barh')

运行结果:

output_6_1.png

这里使用['class'],而不是使用.class,是因为会冲突

train_df['class'].value_counts().plot(kind = 'barh')

运行结果:

output_7_1.png

之前统计的都是直接的值,接下来我们统计一下再男性中多少人获救了,女性中有多少人获救了

# 先将train_df和y_train的数据合并,因为y_train中有获救的信息
# groupby 是分组,将sex进行分组
# 再使用.survived将信息取出来,再用mean计算均值,这里的mean就是对这两个分组分别算均值
pd.concat([train_df, y_train], axis = 1).groupby('sex').survived.mean().plot(kind='barh')

运行结果:

output_8_1.png

对泰坦尼克号进行模型搭建

使用feature_columns对数据进行封装,离散值可以很方便的做One-hot编码,如果是离散值,也可以很方便给它做分统,分成离散特征。

# 离散值特征
categorical_columns = ['sex', 'n_siblings_spouses', 'parch', 'class',
                       'deck', 'embark_town', 'alone']
# 连续值特征
numeric_columns = ['age', 'fare']

feature_columns = []
for categorical_column in categorical_columns:
    # 用unique获得离散列中所有可能的值
    vocab = train_df[categorical_column].unique()
    print(categorical_column, vocab)
    feature_columns.append(
        tf.feature_column.indicator_column(   # 变成One-hot编码
            tf.feature_column.categorical_column_with_vocabulary_list(  # 定义一个feature_column
                categorical_column, vocab)))

for categorical_column in numeric_columns:
    feature_columns.append(
        tf.feature_column.numeric_column(
            categorical_column, dtype=tf.float32))

运行结果:

sex ['male' 'female']
n_siblings_spouses [1 0 3 4 2 5 8]
parch [0 1 2 5 3 4]
class ['Third' 'First' 'Second']
deck ['unknown' 'C' 'G' 'A' 'B' 'D' 'F' 'E']
embark_town ['Southampton' 'Cherbourg' 'Queenstown' 'unknown']
alone ['n' 'y']

构建一个dataset:

def make_dataset(data_df, label_df, epochs = 10, shuffle = True,
                 batch_size = 32):
    dataset = tf.data.Dataset.from_tensor_slices(
        (dict(data_df), label_df))
    if shuffle:
        dataset = dataset.shuffle(10000)
    dataset = dataset.repeat(epochs).batch(batch_size)
    return dataset
train_dataset = make_dataset(train_df, y_train, batch_size = 5)
for x, y in train_dataset.take(1):
    print(x, y)

运行结果:

{'sex': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'male', b'male', b'female'], dtype=object)>, 'age': <tf.Tensor: shape=(5,), dtype=float64, numpy=array([43., 28., 49., 57., 14.])>, 'n_siblings_spouses': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 0, 1, 0, 0])>, 'parch': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 0, 0, 0, 0])>, 'fare': <tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 8.05  ,  7.7292, 56.9292, 12.35  ,  7.8542])>, 'class': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Third', b'Third', b'First', b'Second', b'Third'], dtype=object)>, 'deck': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'unknown', b'unknown', b'A', b'unknown', b'unknown'], dtype=object)>, 'embark_town': <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'Southampton', b'Queenstown', b'Cherbourg', b'Queenstown',
       b'Southampton'], dtype=object)>, 'alone': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'y', b'y', b'n', b'y', b'y'], dtype=object)>} tf.Tensor([0 0 1 0 0], shape=(5,), dtype=int64)

keras.layers.DenseFeature:可以把feature_columns应用到dataset中去

# keras.layers.DenseFeature
for x, y in train_dataset.take(1):
    age_column = feature_columns[7]
    gender_column = feature_columns[0]
    print(keras.layers.DenseFeatures(age_column)(x).numpy())
    print(keras.layers.DenseFeatures(gender_column)(x).numpy())

运行结果:

[[28.]
 [16.]
 [36.]
 [45.]
 [28.]]
[[1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]]

对age没有变化,但是对gender则是输出One-hot编码

# keras.layers.DenseFeature
for x, y in train_dataset.take(1):
    print(keras.layers.DenseFeatures(feature_columns)(x).numpy())

运行结果:

[[23.      1.      0.      0.      1.      0.      0.      0.      0.
   0.      0.      1.      0.      0.      0.      1.      0.      0.
  63.3583  0.      1.      0.      0.      0.      0.      0.      0.
   1.      0.      0.      0.      0.      1.      0.    ]
 [28.      0.      1.      1.      0.      0.      1.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      1.      0.
   7.75    0.      1.      0.      0.      0.      0.      0.      1.
   0.      0.      0.      0.      0.      1.      0.    ]
 [52.      0.      1.      0.      1.      0.      0.      1.      0.
   0.      0.      0.      0.      0.      1.      0.      0.      0.
  30.5     0.      1.      0.      0.      0.      0.      0.      1.
   0.      0.      0.      0.      0.      1.      0.    ]
 [28.      0.      1.      1.      0.      0.      1.      0.      0.
   0.      0.      0.      0.      0.      1.      0.      0.      0.
   7.8958  0.      1.      0.      0.      0.      0.      0.      1.
   0.      0.      0.      0.      0.      1.      0.    ]
 [28.      1.      0.      1.      0.      0.      1.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      1.      0.
  24.15    1.      0.      0.      0.      0.      0.      0.      1.
   0.      0.      0.      0.      0.      1.      0.    ]]

构建keras模型:

model = keras.models.Sequential([
    keras.layers.DenseFeatures(feature_columns),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(2, activation='softmax'),
])
model.compile(loss='sparse_categorical_crossentropy',
              optimizer = keras.optimizers.SGD(lr=0.01),
              metrics = ['accuracy'])

训练模型:

# 1. model.fit 
# 2. model -> estimator -> train

train_dataset = make_dataset(train_df, y_train, epochs = 100)
eval_dataset = make_dataset(eval_df, y_eval, epochs = 1, shuffle = False)
history = model.fit(train_dataset,
                    validation_data = eval_dataset,
                    steps_per_epoch = 19,
                    validation_steps = 8,
                    epochs = 100)

预定义estimator使用

Estimator是tensorflow推出的一个High level的API,用于简化机器学习,封装了训练、评估、预测、导出以供使用所有的相关流程,tf.estimator定义好了训练、评估、预测的基本框架,开发者只需要关注模型结构的定义model_fn,输入数据的获取input_fn

BaselineClassifier没有什么规律,而是进行随机猜测,就是把每个类别的样本数统计一遍,然后计算出这个类别出现的比例是多少,应用这个比例对样本进行猜测

output_dir = 'baseline_model'
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
    
baseline_estimator = tf.estimator.BaselineClassifier(
    model_dir=output_dir,
    n_classes=2)
baseline_estimator.train(input_fn = lambda: make_dataset(
    train_df, y_train, epochs=100))

评测这个estimator

baseline_estimator.evaluate(input_fn = lambda: make_dataset(
    eval_df, y_eval, epochs=1, shuffle=False, batch_size = 20))

使用LinearClassifier

linear_output_dir = 'linear_model'
if not os.path.exists(linear_output_dir):
    os.mkdir(linear_output_dir)
linear_estimator = tf.estimator.LinearClassifier(
    model_dir = linear_output_dir,
    n_classes = 2,
    feature_columns = feature_columns)
linear_estimator.train(input_fn = lambda : make_dataset(
    train_df, y_train, epochs = 100))

生成的模型可以用tensorboard查看:tensorboard --logdir linear_model --bind_all

使用DNNClassifier

dnn_output_dir = './dnn_model'
if not os.path.exists(dnn_output_dir):
    os.mkdir(dnn_output_dir)
dnn_estimator = tf.estimator.DNNClassifier(
    model_dir = dnn_output_dir,
    n_classes = 2,
    feature_columns=feature_columns,
    hidden_units = [128, 128],
    activation_fn = tf.nn.relu,
    optimizer = 'Adam')
dnn_estimator.train(input_fn = lambda : make_dataset(
    train_df, y_train, epochs = 100))

cross feature 交叉特征: age: [1, 2, 3, 4, 5], gender: [male, female]

age_x_gender: [(1, male), (2, male), ...., (5, female)]

# 100000: 100 -> hash(100000 values) % 100
feature_columns.append(
    tf.feature_column.indicator_column(
        tf.feature_column.crossed_column(
            ['age', 'sex'], hash_bucket_size = 100)))