Treebank-3
| Item Name: | Treebank-3 |
|---|---|
| Author(s): | Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor |
| LDC Catalog No.: | LDC99T42 |
| ISBN: | 1-58563-163-9 |
| ISLRN: | 141-282-691-413-2 |
| Member Year(s): | 1999 |
| DCMI Type(s): | Text |
| Data Source(s): | telephone speech, newswire, microphone speech, transcribed speech, varied |
| Project(s): | TIDES, GALE |
| Application(s): | parsing, natural language processing, tagging |
| Language(s): | English |
| Language ID(s): | eng |
| License(s): | LDC User Agreement for Non-Members |
| Online Documentation: | LDC99T42 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999. |
| Related Works:Hide | isAnnotationOfLDC93T3A TIPSTER CompletehasAnnotationLDC2002T07 RST Discourse TreebankLDC2004T25 Prague Czech-English Dependency Treebank 1.0LDC2008T23 NomBank v 1.0LDC2009T12 2008 CoNLL Shared Task DataLDC2012T04 2009 CoNLL Shared Task Part 2LDC2012T08 Prague Czech-English Dependency Treebank 2.0LDC2014T27 Benchmarks for Open Relation ExtractionLDC2015T08 Coordination Annotation for the Penn TreebankLDC2015T10 RST Signalling CorpusLDC2018T08 2007 CoNLL Shared Task - Arabic & EnglishhasOutcomeLDC2008T24 COMNOM v 1.0LDC2015T13 English News Text Treebank: Penn Treebank RevisedLDC2020S09 Global TIMIT Learner Treebank EnglishisContinuationOfLDC93T1 ACL/DCILDC95T7 Treebank-2isSimilarWithLDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1LDC2009T24 OntoNotes Release 3.0LDC2011T03 OntoNotes Release 4.0LDC2013T19 OntoNotes Release 5.0LDC2018T12 Concretely Annotated New York TimesrelatesToLDC2008T20 PennBioIE CYP 1.0LDC2010T05 NPS Internet Chatroom Conversations, Release 1.0 |
Introduction
This release contains the following Treebank-2 Material:
- One million words of 1989 Wall Street Journal material annotated in Treebank II style.
- A small sample of ATIS-3 material annotated in Treebank II style.
- A fully tagged version of the Brown Corpus.
and the following new material:
- Switchboard tagged, dysfluency-annotated, and parsed text
- Brown parsed text
The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied.
Data
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
Samples
Please view the following samples:
- Part-of-Speech Tags
- Dysfluency Annotation
- Dysfluency Annotation & Part-of-Speech Tags
- Dysfluency Annotation, Part-of-Speech Tags & Turns Joined
- Syntactic Annotation
- Syntactic Annotation & Part-of-Speech Tags
Updates
After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available.
As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing.
As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7).
Corpus downoads after these dates will include these missing files.