携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第1天,点击查看活动详情
dataset基础API的使用
从内存中去构建数据集(tf.data.Dataset.from_tensor_slices)
dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
print(dataset)
运行结果:
<TensorSliceDataset shapes: (), types: tf.int64>
对于构建的dataset可以做什么呢?
最基础的就是遍历:
用for循环遍历dataset中的数据
for item in dataset:
print(item)
运行结果:
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
1.对dataset进行重复读取
repeat epoch
2.get batch
从数据集中取一小部分数据
# 1.repeat epoch
# 2.get batch
dataset = dataset.repeat(3).batch(7)
for item in dataset:
print(item)
运行结果:
tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([8 9], shape=(2,), dtype=int64)
interleave:对现有的dataset中的数据进行处理得到新的结果,interleave把得到的结果合并起来
case:文件dataset -> 具体数据集
dataset2 = dataset.interleave(
lambda v: tf.data.Dataset.from_tensor_slices(v), # map_fn 去做一个怎样的变化
cycle_length = 5, # cycle_length 并行处理
block_length = 5, # block_length 从中取几个元素出来
)
for item in dataset2:
print(item)
运行结果:
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
每次都从都从每个元素的结果中取一小部分出来,再从另外的元素中取一部分再取一部分出来,这样就可以达到一种均匀混合的结果。
列表也支持dataset
x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
# x和y的维度一定是相等的
dataset3 = tf.data.Dataset.from_tensor_slices((x, y))
print(dataset3)
for item_x, item_y in dataset3:
print(item_x.numpy(), item_y.numpy())
运行结果:
<TensorSliceDataset shapes: ((2,), ()), types: (tf.int64, tf.string)>
[1 2] b'cat'
[3 4] b'dog'
[5 6] b'fox'
字典也支持dataset
dataset4 = tf.data.Dataset.from_tensor_slices({"feature": x,
"label": y})
for item in dataset4:
print(item["feature"].numpy(), item["label"].numpy())
运行结果:
[1 2] b'cat'
[3 4] b'dog'
[5 6] b'fox'
生成csv文件
使用sklearn中的housing数据生成csv文件
output_dir = "generate_csv"
if not os.path.exists(output_dir):
os.mkdir(output_dir)
def save_to_csv(output_dir, data, name_prefix,
header=None, n_parts=10):
path_format = os.path.join(output_dir, "{}_{:02d}.csv")
filenames = []
# enumerate给每组标记一个值
for file_idx, row_indices in enumerate(
np.array_split(np.arange(len(data)), n_parts)):
part_csv = path_format.format(name_prefix, file_idx)
filenames.append(part_csv)
with open(part_csv, "wt", encoding="utf-8") as f:
if header is not None:
f.write(header + "\n")
for row_index in row_indices:
f.write(",".join(
[repr(col) for col in data[row_index]]))
f.write('\n')
return filenames
# np.c_可以连接两个矩阵
train_data = np.c_[x_train_scaled, y_train]
valid_data = np.c_[x_valid_scaled, y_valid]
test_data = np.c_[x_test_scaled, y_test]
header_cols = housing.feature_names + ["MidianHouseValue"]
header_str = ",".join(header_cols)
train_filenames = save_to_csv(output_dir, train_data, "train",
header_str, n_parts=20)
valid_filenames = save_to_csv(output_dir, valid_data, "valid",
header_str, n_parts=10)
test_filenames = save_to_csv(output_dir, test_data, "test",
header_str, n_parts=10)
import pprint
print("train filenames:")
pprint.pprint(train_filenames)
print("valid filenames:")
pprint.pprint(valid_filenames)
print("test filenames:")
pprint.pprint(test_filenames)
运行结果:
train filenames:
['generate_csv/train_00.csv', 'generate_csv/train_01.csv', 'generate_csv/train_02.csv', 'generate_csv/train_03.csv', 'generate_csv/train_04.csv', 'generate_csv/train_05.csv', 'generate_csv/train_06.csv', 'generate_csv/train_07.csv', 'generate_csv/train_08.csv', 'generate_csv/train_09.csv', 'generate_csv/train_10.csv', 'generate_csv/train_11.csv', 'generate_csv/train_12.csv', 'generate_csv/train_13.csv', 'generate_csv/train_14.csv', 'generate_csv/train_15.csv', 'generate_csv/train_16.csv', 'generate_csv/train_17.csv', 'generate_csv/train_18.csv', 'generate_csv/train_19.csv']
valid filenames:
['generate_csv/valid_00.csv', 'generate_csv/valid_01.csv', 'generate_csv/valid_02.csv', 'generate_csv/valid_03.csv', 'generate_csv/valid_04.csv', 'generate_csv/valid_05.csv', 'generate_csv/valid_06.csv', 'generate_csv/valid_07.csv', 'generate_csv/valid_08.csv', 'generate_csv/valid_09.csv']
test filenames:
['generate_csv/test_00.csv', 'generate_csv/test_01.csv', 'generate_csv/test_02.csv', 'generate_csv/test_03.csv', 'generate_csv/test_04.csv', 'generate_csv/test_05.csv', 'generate_csv/test_06.csv', 'generate_csv/test_07.csv', 'generate_csv/test_08.csv', 'generate_csv/test_09.csv']
tf.io.decode_csv的使用
list_files是处理文件名的,把文件名生成一个dataset
# 1. filename -> dataset
# 2. read file -> dataset -> datasets -> merge
# 3. parse csv
filename_dataset = tf.data.Dataset.list_files(train_filenames)
for filename in filename_dataset:
print(filename)
运行结果:
tf.Tensor(b'generate_csv/train_16.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_05.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_04.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_00.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_11.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_13.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_19.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_01.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_18.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_17.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_09.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_03.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_06.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_07.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_10.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_08.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_12.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_15.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_02.csv', shape=(), dtype=string)
tf.Tensor(b'generate_csv/train_14.csv', shape=(), dtype=string)
把dataset给merge起来
TextLineDataset专门用来读取文本文件的API
skip省略header
n_readers = 5
dataset = filename_dataset.interleave(
lambda filename: tf.data.TextLineDataset(filename).skip(1),
cycle_length = n_readers
)
for line in dataset.take(1):
print(line.numpy())
运行结果:
b'-0.09719300311107498,-1.249743071766074,0.36232962250170797,0.026906080250728295,1.033811814747154,0.045881586971778555,1.3418334617377423,-1.6353869745909178,1.832'
解析csv文件: tf.io.decode_csv
# tf.io.decode_csv(str, record_defaults)
sample_str = '1,2,3,4,5'
record_defaults = [
tf.constant(0, dtype=tf.int32),
0,
np.nan,
"hello",
tf.constant([])
]
parsed_fields = tf.io.decode_csv(sample_str, record_defaults)
print(parsed_fields)
运行结果:
[<tf.Tensor: shape=(), dtype=int32, numpy=1>, <tf.Tensor: shape=(), dtype=int32, numpy=2>, <tf.Tensor: shape=(), dtype=float32, numpy=3.0>, <tf.Tensor: shape=(), dtype=string, numpy=b'4'>, <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]
如果传入的是不对的字符串,会报什么错误呢?
try:
parsed_fields = tf.io.decode_csv(',,,,', record_defaults)
except tf.errors.InvalidArgumentError as ex:
print(ex)
运行结果:
Field 4 is required but missing in record 0! [Op:DecodeCSV]
如果传入的字符串多了,会报什么错误呢?
try:
parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)
except tf.errors.InvalidArgumentError as ex:
print(ex)
运行结果:
解析上面生成的csv文件
先定义一个解析csv中每一行的函数
Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]
def parse_csv_line(line, n_fields = 9):
defs = [tf.constant(np.nan)] * n_fields
parsed_fields = tf.io.decode_csv(line, record_defaults=defs)
x = tf.stack(parsed_fields[0:-1])
y = tf.stack(parsed_fields[-1:])
return x, y
parse_csv_line(b'-0.9868720801669367,0.832863080552588,-0.18684708416901633,-0.14888949288707784,-0.4532302419670616,-0.11504995754593579,1.6730974284189664,-0.7465496877362412,1.138',
n_fields=9)
运行结果:
(<tf.Tensor: shape=(8,), dtype=float32, numpy=
array([-0.9868721 , 0.8328631 , -0.18684709, -0.1488895 , -0.45323023,
-0.11504996, 1.6730974 , -0.74654967], dtype=float32)>,
<tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.138], dtype=float32)>)
解析整个dataset:
# 1. filename -> dataset
# 2. read file -> dataset -> datasets -> merge
# 3. parse csv
def csv_reader_dataset(filenames, n_readers=5,
batch_size=32, n_parse_threads=5,
shuffle_buffer_size=10000):
dataset = tf.data.Dataset.list_files(filenames)
dataset = dataset.repeat()
dataset = dataset.interleave(
lambda filename: tf.data.TextLineDataset(filename).skip(1),
cycle_length = n_readers
)
dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(parse_csv_line,
num_parallel_calls=n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset
train_set = csv_reader_dataset(train_filenames, batch_size=3)
for x_batch, y_batch in train_set.take(2):
print("x:")
pprint.pprint(x_batch)
print("y:")
pprint.pprint(y_batch)
运行结果:
x:
<tf.Tensor: shape=(3, 8), dtype=float32, numpy=
array([[-1.0775077 , -0.4487407 , -0.5680568 , -0.14269263, -0.09666677,
0.12326469, -0.31448638, -0.4818959 ],
[ 0.09734604, 0.75276285, -0.20218964, -0.19547 , -0.40605137,
0.00678553, -0.81371516, 0.6566148 ],
[ 0.8015443 , 0.27216142, -0.11624393, -0.20231152, -0.5430516 ,
-0.02103962, -0.5897621 , -0.08241846]], dtype=float32)>
y:
<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[0.978],
[1.119],
[3.226]], dtype=float32)>
x:
<tf.Tensor: shape=(3, 8), dtype=float32, numpy=
array([[ 0.63636464, -1.0895426 , 0.09260903, -0.20538124, 1.2025671 ,
-0.03630123, -0.6784102 , 0.18223535],
[-0.32652634, 0.4323619 , -0.09345459, -0.08402992, 0.8460036 ,
-0.02663165, -0.56176794, 0.1422876 ],
[-1.2310716 , 0.91296333, -0.19194563, 0.12851463, -0.1873954 ,
0.1460428 , -0.785721 , 0.6566148 ]], dtype=float32)>
y:
<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[2.429],
[2.431],
[0.953]], dtype=float32)>
把训练集、测试集、验证集都读取出来
batch_size = 32
train_set = csv_reader_dataset(train_filenames,
batch_size = batch_size)
valid_set = csv_reader_dataset(valid_filenames,
batch_size = batch_size)
test_set = csv_reader_dataset(test_filenames,
batch_size = batch_size)
定义模型、训练模型
model = keras.models.Sequential([
keras.layers.Dense(30, activation='relu',
input_shape=[8]),
keras.layers.Dense(1),
])
model.compile(loss="mean_squared_error", optimizer="sgd")
callbacks = [keras.callbacks.EarlyStopping(
patience=5, min_delta=1e-2)]
history = model.fit(train_set,
validation_data = valid_set,
steps_per_epoch = 11160 // batch_size,
validation_steps = 3870 // batch_size,
epochs = 100,
callbacks = callbacks)
运行结果:
Epoch 1/100
348/348 [==============================] - 1s 3ms/step - loss: 0.9557 - val_loss: 0.6975
Epoch 2/100
348/348 [==============================] - 1s 2ms/step - loss: 0.5594 - val_loss: 0.5255
Epoch 3/100
348/348 [==============================] - 1s 2ms/step - loss: 0.4679 - val_loss: 0.4911
Epoch 4/100
348/348 [==============================] - 1s 2ms/step - loss: 0.4554 - val_loss: 0.4552
Epoch 5/100
348/348 [==============================] - 1s 2ms/step - loss: 0.4279 - val_loss: 0.4440
Epoch 6/100
348/348 [==============================] - 1s 2ms/step - loss: 0.4230 - val_loss: 0.4401
Epoch 7/100
348/348 [==============================] - 1s 2ms/step - loss: 0.4021 - val_loss: 0.4198
Epoch 8/100
348/348 [==============================] - 1s 2ms/step - loss: 0.4115 - val_loss: 0.4123
Epoch 9/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3925 - val_loss: 0.4061
Epoch 10/100
348/348 [==============================] - 1s 2ms/step - loss: 0.4007 - val_loss: 0.3974
Epoch 11/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3872 - val_loss: 0.3949
Epoch 12/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3770 - val_loss: 0.3917
Epoch 13/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3837 - val_loss: 0.3902
Epoch 14/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3715 - val_loss: 0.3838
Epoch 15/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3742 - val_loss: 0.3800
Epoch 16/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3624 - val_loss: 0.3944
Epoch 17/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3803 - val_loss: 0.3786
Epoch 18/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3580 - val_loss: 0.3745
Epoch 19/100
348/348 [==============================] - 1s 2ms/step - loss: 0.3631 - val_loss: 0.3758
对测试集数据进行训练
model.evaluate(test_set, steps = 5160 // batch_size)
运行结果:
161/161 [==============================] - 0s 2ms/step - loss: 0.3854
0.3854424059391022