Treebank-3数据集介绍,编号LDC99T42

154 阅读2分钟

Treebank-3

Item Name:Treebank-3
Author(s):Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor
LDC Catalog No.:LDC99T42
ISBN:1-58563-163-9
ISLRN:141-282-691-413-2
Member Year(s):1999
DCMI Type(s):Text
Data Source(s):telephone speech, newswire, microphone speech, transcribed speech, varied
Project(s):TIDES, GALE
Application(s):parsing, natural language processing, tagging
Language(s):English
Language ID(s):eng
License(s):LDC User Agreement for Non-Members
Online Documentation:LDC99T42 Documents
Licensing Instructions:Subscription & Standard Members, and Non-Members
Citation:Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999.
Related Works:HideisAnnotationOfLDC93T3A TIPSTER CompletehasAnnotationLDC2002T07 RST Discourse TreebankLDC2004T25 Prague Czech-English Dependency Treebank 1.0LDC2008T23 NomBank v 1.0LDC2009T12 2008 CoNLL Shared Task DataLDC2012T04 2009 CoNLL Shared Task Part 2LDC2012T08 Prague Czech-English Dependency Treebank 2.0LDC2014T27 Benchmarks for Open Relation ExtractionLDC2015T08 Coordination Annotation for the Penn TreebankLDC2015T10 RST Signalling CorpusLDC2018T08 2007 CoNLL Shared Task - Arabic & EnglishhasOutcomeLDC2008T24 COMNOM v 1.0LDC2015T13 English News Text Treebank: Penn Treebank RevisedLDC2020S09 Global TIMIT Learner Treebank EnglishisContinuationOfLDC93T1 ACL/DCILDC95T7 Treebank-2isSimilarWithLDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1LDC2009T24 OntoNotes Release 3.0LDC2011T03 OntoNotes Release 4.0LDC2013T19 OntoNotes Release 5.0LDC2018T12 Concretely Annotated New York TimesrelatesToLDC2008T20 PennBioIE CYP 1.0LDC2010T05 NPS Internet Chatroom Conversations, Release 1.0

Introduction
This release contains the following Treebank-2 Material:

  • One million words of 1989 Wall Street Journal material annotated in Treebank II style.
  • A small sample of ATIS-3 material annotated in Treebank II style.
  • A fully tagged version of the Brown Corpus.

and the following new material:

  • Switchboard tagged, dysfluency-annotated, and parsed text
  • Brown parsed text

The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.
Data
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
Samples
Please view the following samples:

  • Part-of-Speech Tags
  • Dysfluency Annotation
  • Dysfluency Annotation & Part-of-Speech Tags
  • Dysfluency Annotation, Part-of-Speech Tags & Turns Joined
  • Syntactic Annotation
  • Syntactic Annotation & Part-of-Speech Tags

Updates
After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.
As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing.
As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7).
Corpus downoads after these dates will include these missing files.