tensoflow-tfrecord1. tfrecord介绍存在一种高效读取数据的方法,其思想为:对数据进行序列化并

1. tfrecord介绍

存在一种高效读取数据的方法,其思想为:对数据进行序列化并将其存储在一组可线性读取的文件(每个文件100-200MB)中,此种方法,适合于通过网络进行流式传输和缓冲数据预处理.

优点

TFRecord格式即为一种用于存储二进制记录序列的简单格式,其优点为:

更有效的利用内存,不用一次性全部读入数据
流式读入,适应网络环境

2. tfrecords相关的数据结构

1. protocol buffers

是一种与平台无关,与语言无关的可扩展机制(库),用于序列化结构数据. 其自动生成对应语言和平台的class和getter和setter方法以实现对结构化数据进行序列化,需要通过特定的编译器,protocol buffers文件编译为特定平台的二进制文件.protocol buffers使用 .proto文件来定义,相当于Java中的 .Java文件

protocol buffers机制
- message:相当于class,可能会编译为class,而message就使用.proto文件来定义,相当于Java中class与Java文件之间的关系

2. tf.train.Feature

tf.train.Example 消息(或者 protobuf)是一种表示{"String":tf.train.Feature}映射的消息类型,其专门为Tenorflow设计,其中tf.train.Feature包括以下三个类:

tf.train.BytesList:String和byte类型可以转换为该种类型
tf.train.FloatList:可以由double和float强制类型转换而来
tf.train.Int64List:可以有int相关类型,bool和enum类型强制类型转换而来 tf.train.Features对象:相当于map<string,feature>

3. tf.train.Example

1. example与feature的关系

可以理解为`example=trans(map<string,feature>)

2. 创建tf.Example

1. 创建tf.train.Example消息的基本步骤

创建tf.train.Features对象:Features=tf.train.Features(map<string,feature>),构造方法为:tf.train.Features
创建tf.train.Example对象:example=tf.train.Example(Features).可以通过其构造函数tf.train.example()方法来创建

2. 创建Example消息的示例代码

创建tf.train.example对象的示例代码为:

def serialize_example(feature0, feature1, feature2, feature3):
"""
Creates a tf.train.Example message ready to be written to a file.
"""
# Create a dictionary mapping the feature name to the tf.train.Example-compatible
# data type.
feature_map = {
    'feature0': _int64_feature(feature0),# 使用_int64_feature()方法,将int类型的数据转换为tf.train.Features对象所支持的数据类型
    'feature1': _int64_feature(feature1),
    'feature2': _bytes_feature(feature2),
    'feature3': _float_feature(feature3),
}

# Create a Features message using tf.train.Example.
tf_trian_Features=tf.train.Features(feature=feature_map) # 将map转换为tf.train.Features类型
example_proto = tf.train.Example(features=tf_train_Features) # 通过构造函数,将tf.train.Features类型转换为tf.train.Excample消息(message)
return example_proto.SerializeToString() # 对消息进行编译和序列化

4. tfrecords与几种数据结构的关系

.tfrecords文件包含一系列 example对象,该文件只能 顺序读取,相当于tfrecords=List<Example>.
example对象与features的关系:相当于example=map<string,feature>,或者说`example=trans(tf.train.Features)

classDiagram
class Record
Record:uint64 length
Record:uint32 masked_crc32_of_length # crc循环冗余码校验数据长度
Record:byte   data[length]
Record:uint32 masked_crc32_of_data # crc循环冗余码校验数据

3. 创建tfrecords文件

1. 创建tfrecords的基本步骤

定义writer对象
创建tf.train.Example消息
序列化Example消息并写入writer对象

2. 关键API

tf.data.Dataset.from_tensor_slices(feature)

return 特定类型的数据集

3. 示例代码

3. 解析tfrecord文件

1. 解析TFRecord的基本步骤

通过tf.data.TFRecordDataset方法,读取原始的TFRecord文件,获得一个tf.data.Dataset数据集对象
通过Dataset.map方法,获取数据集对象中已经序列化的tf.train.Example对象
通过tf.io.parse_single_example()函数,反序列化tf.train.Example对象

2. 关键API

1. 读取TFRecord

tf.data.TFRecordDataset类参数和返回值

filenames list:文件路径列表,一次可以读入多个tfrecord文件
return dataset对象基本用法

import tensorflow as tf
filenames=[filename]
dataset=tf.data.TFRecord(filenames)

tf.data.Dataset.map() 映射函数,根据