「这是我参与11月更文挑战的第5天，活动详情查看：2021最后一次更文挑战」

自google在2018年提出的Bert后，预训练模型成为了NLP领域的马前卒，BERT作为NLP史上具有划时代意义的深度模型，其强大毋庸置疑。一般来说，我们在实际任务中使用BERT预训练模型已经能满足大部分场景，接下来笔者将更新一套复用性极高的基于Bert预训练模型的文本分类代码详解，分为三篇文章对整套代码详细解读，本篇将详解数据读取部分。

源码下载地址：下载链接提取码：2021

代码中的数据集选用中文语言理解测评基准(CLUE)的IFLYTEK' 长文本分类
数据集简介地址
 CLUE论文被计算语言学国际会议 COLING2020高分录用

总览

代码中对数据的读取在data_loader.py 文件下，接下来对每个class和function进行解读，并理清其逻辑关系。文件内的class和function如下图所示：

简单释义：
InputExample：样本对象，为每一个样本创建对象，可以根据任务重写内部方法。
InputFeatures：特征对象，为每一个特征创建对象，可以根据任务重写内部方法。
iflytekProcessor：处理对象，对文件数据进行处理，返回InputExample类，这个类不是固定名称的，可以根据具体的task自己修改对应的读取方法。在此处以iflytek为例将Processor命名为iflytekProcessor。
convert_examples_to_features：转换函数，将InputExample类转化为InputFeatures类，输入InputExample类返回InputFeatures类
load_and_cache_examples：缓存函数，缓存convert_examples_to_features生成的InputExample类到本地，每次训练不用都加载一边。

逻辑：

InputExample

这个类比较简单，在初始化方法中定义了输入样本的几个属性：
guid: 示例的唯一id。
words: 示例句子。
label: 示例的标签。
无需进行修改，若是做文本匹配还可以添加另一个句子，如self.wordspair = wordspair，具体根据任务来。

class InputExample(object):
    """
    A single training/test example for simple sequence classification.

    Args:
        guid: Unique id for the example.
        words: list. The words of the sequence.
        label: (Optional) string. The label of the example.
    """

    def __init__(self, guid, words, label=None, ):
        self.guid = guid
        self.words = words
        self.label = label

    def __repr__(self):
        return str(self.to_json_string())

    def to_dict(self):
        """Serializes this instance to a Python dictionary."""
        output = copy.deepcopy(self.__dict__)
        return output

    def to_json_string(self):
        """Serializes this instance to a JSON string."""
        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

InputFeatures

这个类主要描述是bert的输入形式，input_ids 、attention_mask、token_type_ids在加一个label_id

如果也是使用bert模型那么无需进行修改，当然需要根据对应的预训练模型的输入方式进行修改。

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, attention_mask, token_type_ids, label_id):
        self.input_ids = input_ids
        self.attention_mask = attention_mask
        self.token_type_ids = token_type_ids
        self.label_id = label_id

    def __repr__(self):
        return str(self.to_json_string())

    def to_dict(self):
        """Serializes this instance to a Python dictionary."""
        output = copy.deepcopy(self.__dict__)
        return output

    def to_json_string(self):
        """Serializes this instance to a JSON string."""
        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

iflytekProcessor

根据任务读取文件

class iflytekProcessor(object):
    """Processor for the JointBERT data set """

    def __init__(self, args):
        self.args = args
        self.labels = get_labels(args)

        self.input_text_file = 'data.csv'

    @classmethod
    def _read_file(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        df = pd.read_csv(input_file)
        return df

    def _create_examples(self, datas, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for i, rows  in datas.iterrows():
            try:
                guid = "%s-%s" % (set_type, i)
                # 1. input_text
                words = rows["text"]
                # 2. intent
                label = rows["labels"]
            except :
                print(rows)


            examples.append(InputExample(guid=guid, words=words, label=label))
        return examples

    def get_examples(self, mode):
        """
        Args:
            mode: train, dev, test
        """
        data_path = os.path.join(self.args.data_dir, self.args.task, mode)
        logger.info("LOOKING AT {}".format(data_path))
        return self._create_examples(datas=self._read_file(os.path.join(data_path, self.input_text_file)),
                                     set_type=mode)

convert_examples_to_features

主要根据bert的编码方式对examples进行编码，转换为input_ids、attention_maskmax_seq_len、token_type_ids的形式，生成features

def convert_examples_to_features(examples, max_seq_len, tokenizer,
                                 cls_token_segment_id=0,
                                 pad_token_segment_id=0,
                                 sequence_a_segment_id=0,
                                 mask_padding_with_zero=True):
    # Setting based on the current model type
    cls_token = tokenizer.cls_token
    sep_token = tokenizer.sep_token
    unk_token = tokenizer.unk_token
    pad_token_id = tokenizer.pad_token_id

    features = []
    for (ex_index, example) in enumerate(examples):
        if ex_index % 5000 == 0:
            logger.info("Writing example %d of %d" % (ex_index, len(examples)))

        # Tokenize word by word (for NER)
        tokens = []
        for word in example.words:
            word_tokens = tokenizer.tokenize(word)
            if not word_tokens:
                word_tokens = [unk_token]  # For handling the bad-encoded word
            tokens.extend(word_tokens)

        # Account for [CLS] and [SEP]
        special_tokens_count = 2
        if len(tokens) > max_seq_len - special_tokens_count:
            tokens = tokens[:(max_seq_len - special_tokens_count)]

        # Add [SEP] token
        tokens += [sep_token]
        token_type_ids = [sequence_a_segment_id] * len(tokens)

        # Add [CLS] token
        tokens = [cls_token] + tokens
        token_type_ids = [cls_token_segment_id] + token_type_ids

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding_length = max_seq_len - len(input_ids)
        input_ids = input_ids + ([pad_token_id] * padding_length)
        attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
        token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)

        assert len(input_ids) == max_seq_len, "Error with input length {} vs {}".format(len(input_ids), max_seq_len)
        assert len(attention_mask) == max_seq_len, "Error with attention mask length {} vs {}".format(len(attention_mask), max_seq_len)
        assert len(token_type_ids) == max_seq_len, "Error with token type length {} vs {}".format(len(token_type_ids), max_seq_len)

        label_id = int(example.label)

        if ex_index < 5:
            logger.info("*** Example ***")
            logger.info("guid: %s" % example.guid)
            logger.info("tokens: %s" % " ".join([str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("attention_mask: %s" % " ".join([str(x) for x in attention_mask]))
            logger.info("token_type_ids: %s" % " ".join([str(x) for x in token_type_ids]))
            logger.info("label: %s (id = %d)" % (example.label, label_id))

        features.append(
            InputFeatures(input_ids=input_ids,
                          attention_mask=attention_mask,
                          token_type_ids=token_type_ids,
                          label_id=label_id,
                          ))

    return features

load_and_cache_examples

加载和缓存函数，顾名思义做了读取和保存两件事。

1.先根据参数生成以cached_数据集模式_数据集名称_MaxLen命名的缓存名cached_features_file

2.判断是否有当前路径下的缓存文件是否存在，若存在则直接读取保存文件，不存在则加载处理。

首先，使用processors获得examples。
然后，通过convert_examples_to_features获得features并保存缓存数据。 3.将features数据转换为张量并使用TensorDataset构建数据集

def load_and_cache_examples(args, tokenizer, mode):
    processor = processors[args.task](args)

    # Load data features from cache or dataset file
    cached_features_file = os.path.join(
        args.data_dir,
        'cached_{}_{}_{}_{}'.format(
            mode,
            args.task,
            list(filter(None, args.model_name_or_path.split("/"))).pop(),
            args.max_seq_len
        )
    )
    print(cached_features_file)

    if os.path.exists(cached_features_file):
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        # Load data features from dataset file
        logger.info("Creating features from dataset file at %s", args.data_dir)
        if mode == "train":
            examples = processor.get_examples("train")
        elif mode == "dev":
            examples = processor.get_examples("dev")
        elif mode == "test":
            examples = processor.get_examples("test")
        else:
            raise Exception("For mode, Only train, dev, test is available")

        # Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
        features = convert_examples_to_features(examples,
                                                args.max_seq_len,
                                                tokenizer,
                                                )
        logger.info("Saving features into cached file %s", cached_features_file)
        torch.save(features, cached_features_file)

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor(
        [f.input_ids for f in features],
        dtype=torch.long
    )
    all_attention_mask = torch.tensor(
        [f.attention_mask for f in features],
        dtype=torch.long
    )
    all_token_type_ids = torch.tensor(
        [f.token_type_ids for f in features],
        dtype=torch.long
    )
    all_label_ids = torch.tensor(
        [f.label_id for f in features],
        dtype=torch.long
    )

    dataset = TensorDataset(all_input_ids, all_attention_mask,
                            all_token_type_ids, all_label_ids)
    return dataset

输出展示

预告：后续介绍模型构建和训练部分未完待续....
NLP萌新，才疏学浅，有错误或者不完善的地方，请批评指正！！

高复用Bert模型文本分类代码(一)数据读取

总览