Abstract Meaning Representation (AMR) Annotation Release 3.0
| Item Name: | Abstract Meaning Representation (AMR) Annotation Release 3.0 |
|---|---|
| Author(s): | Kevin Knight, Bianca Badarau, Laura Baranescu, Claire Bonial, Madalina Bardocz, Kira Griffitt, Ulf Hermjakob, Daniel Marcu, Martha Palmer, Tim O'Gorman, Nathan Schneider |
| LDC Catalog No.: | LDC2020T02 |
| ISBN: | 1-58563-915-X |
| ISLRN: | 676-697-177-821-8 |
| DOI: | |
| Release Date: | January 15, 2020 |
| Member Year(s): | 2020 |
| DCMI Type(s): | Text |
| Data Source(s): | broadcast conversation, discussion forum, newswire, web collection, weblogs |
| Project(s): | ACE, BOLT, DEFT, GALE, LORELEI |
| Application(s): | coreference resolution, entity extraction, information extraction, semantic role labelling |
| Language(s): | English |
| Language ID(s): | eng |
| License(s): | LDC User Agreement for Non-Members |
| Online Documentation: | LDC2020T02 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Knight, Kevin , et al. Abstract Meaning Representation (AMR) Annotation Release 3.0 LDC2020T02. Web Download. Philadelphia: Linguistic Data Consortium, 2020. |
| Related Works:Hide | isVersionOfLDC2014T12 Abstract Meaning Representation (AMR) Annotation Release 1.0LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0isAnnotationOfLDC2007T02 English Chinese Translation Treebank v 1.0LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test SetsisSimilarWithLDC2019T07 Chinese Abstract Meaning Representation 1.0LDC2021T13 Chinese Abstract Meaning Representation 2.0 |
Introduction
Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.
AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).
Data
The source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities.
The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:
| Dataset | Training | Dev | Test | Totals |
|---|---|---|---|---|
| BOLT DF MT | 1061 | 133 | 133 | 1327 |
| Broadcast conversation | 214 | 0 | 0 | 214 |
| Weblog and WSJ | 0 | 100 | 100 | 200 |
| BOLT DF English | 7379 | 210 | 229 | 7818 |
| DEFT DF English | 32915 | 0 | 0 | 32915 |
| Aesop fables | 49 | 0 | 0 | 49 |
| Guidelines AMRs | 970 | 0 | 0 | 970 |
| LORELEI | 4441 | 354 | 527 | 5322 |
| 2009 Open MT | 204 | 0 | 0 | 204 |
| Proxy reports | 6603 | 826 | 823 | 8252 |
| Weblog | 866 | 0 | 0 | 866 |
| Wikipedia | 192 | 0 | 0 | 192 |
| Xinhua MT | 741 | 99 | 86 | 926 |
| Totals | 55635 | 1722 | 1898 | 59255 |
Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.