Chinese Treebank 9.0数据集介绍,编号LDC2016T13

129 阅读2分钟

Chinese Treebank 9.0

Item Name:Chinese Treebank 9.0
Author(s):Nianwen Xue, Xiuhong Zhang, Zixin Jiang, Martha Palmer, Fei Xia, Fu-Dong Chiou, Meiyu Chang
LDC Catalog No.:LDC2016T13
ISBN:1-58563-757-2
ISLRN:219-696-236-485-2
DOI:
Release Date:June 15, 2016
Member Year(s):2016
DCMI Type(s):Text
Data Source(s):newswire, news magazine, broadcast conversation, broadcast news, weblogs, discussion forum, text chat conversations, telephone conversations
Project(s):BOLT, GALE
Application(s):syntactic parsing, information extraction, machine translation, linguistic analysis
Language(s):Chinese, Mandarin Chinese
Language ID(s):zho, cmn
License(s):LDC User Agreement for Non-Members
Online Documentation:LDC2016T13 Documents
Licensing Instructions:Subscription & Standard Members, and Non-Members
Citation:Xue, Nianwen, et al. Chinese Treebank 9.0 LDC2016T13. Web Download. Philadelphia: Linguistic Data Consortium, 2016.
Related Works:HideisVersionOfLDC2001T11 Chinese Treebank 2.0LDC2004T05 Chinese Treebank 4.0LDC2005T01 Chinese Treebank 5.0LDC2007T36 Chinese Treebank 6.0LDC2010T07 Chinese Treebank 7.0hasAnnotationLDC2021T07 BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone SpeechrelatesToLDC2020T09 BOLT English Translation Treebank - Chinese Discussion ForumLDC2021T19 BOLT English Translation Treebank - Chinese SMS/Chat

Introduction
Chinese Treebank 9.0 consists of approximately two million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech.
The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and then moved to Brandeis University. The project's goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T07), released in 2010, added new annotated newswire data, broadcast material and web text to the approximate total of one million words. Chinese Treebank 8.0 (LDC2013T21) included new annotated data from newswire, magazine articles and government documents. Chinese Treebank 9.0 adds more annotated web data and two new genres - chat messages and transcribed conversational telephone speech.

Data
There are 3,726 text files in this release, containing 132,076 sentences, 2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is provided in the UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked.