AMR3.0数据集介绍,官网编号LDC2020T02

257 阅读3分钟

Abstract Meaning Representation (AMR) Annotation Release 3.0

Item Name:Abstract Meaning Representation (AMR) Annotation Release 3.0
Author(s):Kevin Knight, Bianca Badarau, Laura Baranescu, Claire Bonial, Madalina Bardocz, Kira Griffitt, Ulf Hermjakob, Daniel Marcu, Martha Palmer, Tim O'Gorman, Nathan Schneider
LDC Catalog No.:LDC2020T02
ISBN:1-58563-915-X
ISLRN:676-697-177-821-8
DOI:
Release Date:January 15, 2020
Member Year(s):2020
DCMI Type(s):Text
Data Source(s):broadcast conversation, discussion forum, newswire, web collection, weblogs
Project(s):ACE, BOLT, DEFT, GALE, LORELEI
Application(s):coreference resolution, entity extraction, information extraction, semantic role labelling
Language(s):English
Language ID(s):eng
License(s):LDC User Agreement for Non-Members
Online Documentation:LDC2020T02 Documents
Licensing Instructions:Subscription & Standard Members, and Non-Members
Citation:Knight, Kevin , et al. Abstract Meaning Representation (AMR) Annotation Release 3.0 LDC2020T02. Web Download. Philadelphia: Linguistic Data Consortium, 2020.
Related Works:HideisVersionOfLDC2014T12 Abstract Meaning Representation (AMR) Annotation Release 1.0LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0isAnnotationOfLDC2007T02 English Chinese Translation Treebank v 1.0LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test SetsisSimilarWithLDC2019T07 Chinese Abstract Meaning Representation 1.0LDC2021T13 Chinese Abstract Meaning Representation 2.0

Introduction
Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.
AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).

Data
The source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities.
The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:

DatasetTrainingDevTestTotals
BOLT DF MT10611331331327
Broadcast conversation21400214
Weblog and WSJ0100100200
BOLT DF English73792102297818
DEFT DF English329150032915
Aesop fables490049
Guidelines AMRs97000970
LORELEI44413545275322
2009 Open MT20400204
Proxy reports66038268238252
Weblog86600866
Wikipedia19200192
Xinhua MT7419986926
Totals556351722189859255

Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.