AMR3.0数据集介绍，官网编号LDC2020T02Abstract Meaning Representation (A

Abstract Meaning Representation (AMR) Annotation Release 3.0

Item Name:	Abstract Meaning Representation (AMR) Annotation Release 3.0
Author(s):	Kevin Knight, Bianca Badarau, Laura Baranescu, Claire Bonial, Madalina Bardocz, Kira Griffitt, Ulf Hermjakob, Daniel Marcu, Martha Palmer, Tim O'Gorman, Nathan Schneider
LDC Catalog No.:	LDC2020T02
ISBN:	1-58563-915-X
ISLRN:	676-697-177-821-8
DOI:
Release Date:	January 15, 2020
Member Year(s):	2020
DCMI Type(s):	Text
Data Source(s):	broadcast conversation, discussion forum, newswire, web collection, weblogs
Project(s):	ACE, BOLT, DEFT, GALE, LORELEI
Application(s):	coreference resolution, entity extraction, information extraction, semantic role labelling
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2020T02 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Knight, Kevin , et al. Abstract Meaning Representation (AMR) Annotation Release 3.0 LDC2020T02. Web Download. Philadelphia: Linguistic Data Consortium, 2020.
Related Works:Hide	isVersionOfLDC2014T12 Abstract Meaning Representation (AMR) Annotation Release 1.0LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0isAnnotationOfLDC2007T02 English Chinese Translation Treebank v 1.0LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test SetsisSimilarWithLDC2019T07 Chinese Abstract Meaning Representation 1.0LDC2021T13 Chinese Abstract Meaning Representation 2.0

Introduction
Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.
AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).

Data
The source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities.
The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:

Dataset	Training	Dev	Test	Totals
BOLT DF MT	1061	133	133	1327
Broadcast conversation	214	0	0	214
Weblog and WSJ	0	100	100	200
BOLT DF English	7379	210	229	7818
DEFT DF English	32915	0	0	32915
Aesop fables	49	0	0	49
Guidelines AMRs	970	0	0	970
LORELEI	4441	354	527	5322
2009 Open MT	204	0	0	204
Proxy reports	6603	826	823	8252
Weblog	866	0	0	866
Wikipedia	192	0	0	192
Xinhua MT	741	99	86	926
Totals	55635	1722	1898	59255

Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.