R&W: "IEPILE: Unearthing Large-Scale Schema-Based Information Extraction Corpus"

Part1

Title:

IEPILE：发掘大规模的（Large-Scale）、基于模式的（Schema-Based）、信息提取（Information Extraction）语料库（Corpus）

Abstract:

LLMs在信息提取（IE）上的表现出a significant performance gap。

现有的IE datasets（1）scale小（2）分裂的（fragmented）（3）缺乏标准模式。

为此，引入IEPILE，一个理解性的（comprehensive）双语的（English and Chinese）IE指令（instruction）语料库，包含大约0.32B词元（token）。

我们通过收集并且清洗（collect and clean）33个现有的IE datasets构建IEPILE，并且引入schema-based指令生成（instruction generation）来发掘一个大规模语料库。

LLaMA，Baichuan和Qwen上的实验结果表明，使用IEPILE能够增强LLMs用于IE的表现，尤其是零样本（zero-shot）泛化（generalization）。

Introduction:

最近的研究表明，当使用大模型时，在IE任务中有a significant performance gap，进一步证实主要原因可能藏在有限的（limited）高质量的大规模data corpus。具体地，大部分IE datasets时常limited in size, scattered in distribution, and lack standardization in schema

(the schema is referred to as pre-defined types of entities, relations, events(arguments and roles), etc.)

对此，收集并清洗广大的现有的IE datasets来得到a comprehensive bilingual IE instruction dataset命名为IEPILE。

在corpus construction中，发现现有的用来构建IE instruction data的方法遭受两个对泛化的（generalizable）IE的问题：（1）schema询问（Query）差异（Disparity）：在训练和评估（training and evaluation）间的指令中可能有schema queries的数量不一致，能够损害模型泛化；（2）语义的（Semantic）混乱（Confusion）：指令中语义相似的（semantically similar）schemas的共现（co-occurence）可能使模型混乱。

因此，引入一个基于schema的指令生成策略（a schema-based instruction generation strategy）。首先，构建a hard negative schema dictionary来促进指令中semantically similar schema更时常的出现。接着，引入batched instruction generation，动态地限制对split_num的每个指令中schemas queried的数量，不仅解决（address）由于在training and evaluation间schema queries数量不一致导致的表现损害问题，而且增强在处理semantically confusing schema时的鲁棒性（robustness）。最后，得到包含大约0.32B词元的IEPILE。

通过微调（fine-tune）Baichuan, LLaMA, and Qwen，展示LLMs with IEPILE能够产生更好的zero-shot performance相比基准模型（baseline）。这个成果不仅验证IEPILE dataset的有效性，而且为创造其他领域中IE datasets提供一个框架。

Method:

data collection and cleaning

corpus主要涉及bilingual data并focuses on主要3类IE任务：Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE)。采用标准步骤来保持data quality and format uniformity，涉及format unification, instance deduplication, and the exclusion of low-quality data。

schema-based instruction generation

集中在instruction-based IE，依赖于3个元素来组成一个instruction：（1）任务描述（Task Description）：一个用来区分不同IE任务的模版（template）；（2）文本输入（Input Text）：模型期望提取信息的来源文本；（3）schema顺序（Schema sequence）：定义模型应该提取的信息，包括entity types, relations, events, etc。在上述（信息）间，schema sequence is critical，因为它反映特定提取要求，并且是动态变化的。因此，在一个指令中构建schema sequence极度重要。

positive and negative schema mechanism in instructions

首先，定义在input text中实际存在的schemas为positive schemas，并且那些没有出现（的schemas）为negative schemas。在图1中呈现在annotation的“location contains”是一个positive schema，然而其他全部来自预定义标签集（predefined label set）L的schemas是negative schemas。

传统的IE框架，被视为顺序分类任务（sequence labeling tasks），将文本作为输入并且为每个token产生一个标签（label）作为输出，在模型输入中不涉及positive or negative schemas的概念。然而，在generative IE时代，以像UIE的模型为代表，在模型输入中引入整合（integrate）schema sequence的概念（称为Structural Schema Instructor, or SSI）来指引模型输出，对SSI限制输出范围。这个概念使包括像SSI在推断（inference）中指引模型输出的entire predefined label set of a dataset成为必要。因此，如果SSI在训练中只包含positive schemas，模型会趋于为每个推断中的SSI内的label泛化相应回答。

因此，为了使模型明确拒绝为negative schemas泛化输出，有必要将negative schemas并入（incorporate）SSI。

在这篇论文中，包括在指令内的schema sequence遵循SSI的概念。然而，观察到现有的研究在构建指令时趋于采用一个相当粗糙的schema processing strategy，意味着一个predefined label set中的全部schemas被用来build the instructions。这个方法潜在承担两个significant问题：（1）inconsistency in the number of schema queries within instruction between training and evaluation：例如，如果模型在大约20个schema queries上训练但tested with either 10 or 30，即使training and evaluation schemas内容相似，模型表现将降低。（2）inadequate differentiation among schemas in the instructions：例如，像“layoffs”, “depart” and “dismissals”的semantically similar schemas，可能呈现能够使LLMs混乱的共现不明确（co-occurrence ambiguities）。这些schemas应该更时常地在指令中共现（co-occur）。

因此，引入（1）Hard Negative Schema Construction；（2）Batched Instruction Generation。

hard negative schema construction

在图1中呈现，假设dataset D possesses a predefined label set L。对于a given text S，呈现在S的anotation中的schemas构成positive schema set Pos_L，然而其他的schemas组成negative schema set Neg_L。在分析中，发现模型错误的主要原因源于the semantic ambiguity of the schema。在传统方法中，Neg_L被简单定义为L − Pos_L。然而，忽视a critical aspect：对negative schemas that are semantically close to positive schemas给予特别关注是重要的。

受对比学习（contrastive learning）理论启发，构建a hard negative schema dictionary K，（在K中）每个键（key）代表一个独特的schema并且相关的值（value）是与key schema语义相似的a collection of schemas。基于这样，定义hard negative schema set为Hard_L = K[Pos_L]，并且另一部分negative schema set为Other_L = L − Pos_L − Hard_L。最后的Neg_L由Hard_L和Other_L的一个小的子集构成。通过这个策略，不仅在指令中更时常地呈现semantically similar schemas，而且在不牺牲模型表现的情况下减少训练实例（training instances）的数量。

batched instruction generation

随后，得到final schema set L' = Pos_L + Neg_L。采用a batched instruction generation method，动态地限制对范围在4～6的split_num的每个指令中的schemas inquired的数量。因此，L'将会被分为（|L'|/split_num）批（batch）for querying，每批querying split_num schemas。

因此，即使在评估阶段schemas inquired的数量与训练阶段的不同，batched mechanism允许将inquiries按split_num schemas分布，从而减轻（mitigate）在泛化表现中的减少。

Experiments:

基于IEPILE，（通过LoRa）微调模型如Baichuan2, LLaMA2, and Qwen1.5，随后将他们的零样本泛化能力与一系列基准模型对比。

experimental settings

Evaluation Metrics:

采用span-based Micro-F1作为度量来衡量模型表现

Baselines:

选择一系列strong models来比较分析，包括UIE, LLaMA2-12B-Chat, Baichuan2-13B-Chat, Mistral-7B-Instruct-v0.2, Qwen1.5-14B-Chat, ChatGPT, GPT-4, InstructUIE, and YAYI-UIE。

Zero-shot Benchmark:

收集13个没有呈现在training set中的datasets。

main results

总之，在with the IEPILE训练后，模型在任务主体中实现更好的结果。相信成功是由于hard negative schema construction and batched instruction generation strategy，能够减轻train-eval mismatch and semantic ambiguity for the diverse schema。同样观察在English NER中Baichuan2-IEPILE, LLaMA2-IEPILE and Qwen1.5-IEPILE稍微落后GPT-4。假设微小的gap可能归因于在GPT-4训练中GPT-4’s exposure to a vast corpus of similar data。

inconsistency in the number of schema queries hurt generalization

调查在training and evaluation中当不同数量的schema queries被使用时在模型表现上的影响。在3个datasets：Ontonotes(18 schemas), DuIE2.0(49 schemas), and ACE2005(33 schemas)上使用full-schema instructions训练Baichuan2。对于评估，使用两个策略测试模型：one with the full set of schema queries and another with a fixed set of 10 schema queries。描述在图3（a）中的结果表明在评估中schema queries数量的不匹配signifcantly减少模型表现。模型输出的进一步分析揭示对每个query模型总是趋于泛化输出。假设schema queries的数量是影响泛化能力的关键因素之一。模型首先需要适应在训练中schema inquiries数量的稀少，并且接着适应unseen schema。

analysis

inadequate differentiation among schemas lead to semantic similar confusion

同样评估removing the "Hard Negative Schema Dictionary"对Baichuan2-IEPILE表现的影响，给予难以区分的schemas格外关注。根据图3（b）中的结果，注意到在NER任务中hard negative schema dictionary plays a relatively limited role，可能由于entity recognition固有的（inherent）clear boundaries。然而，hard negative schema dictionary的使用notably增强在DuIE2.0 and DuEE1.0 datasets中模型表现。观察到semantically similar and easily confused schemas时常出现在模型输出中，比如在event of "layoff"中预测"dismissal" and "resignation"。因此，处理语义趋于混淆的指令提出signifcant挑战，并且在增强模型鲁棒性和改进预测精确性negative schema dictionary plays a crucial role。

Conclusion and Fututre Work:

try to integrate new resources including open-domain IE, and document-level IE.

limitations

从数据方面，研究主要focuses on schema-based IE，限制泛化不遵循特定格式要求的人类指令的能力。此外，没有探索Open Information Extraction(OpenIE)领域。然而，如果移除schema限制，dataset would be suitable for OpenIE scenario。此外，IEPILE被限制在English and Chinese data，并且在未来，希望包括更多语言的data。

从模型方面，由于算力资源，研究仅评估Baichuan and LLaMA两个模型与一些baselins相比。IEPILE能被应用于任何其他LLMs如ChatGLM and Gemma。