Aminer-学术社交网络

890 阅读3分钟

数据集地址:www.aminer.cn/aminernetwo…

1. 数据集基本信息

这些数据的内容包括论文信息、论文引文信息、作者信息和作者合作信息。

AMiner-Paper.rar 文件中共有 2,092,356 篇论文和 8,024,869 条引文。
AMiner-Author.zip 文件中共有 1,712,433 个作者信息。
AMiner-Coauthor.zip 文件中有 4,258,615 个协作关系。
AMiner-Author2Paper.zip 文件中存有作者 id 和论文 id 之间的关系。

2. 数据描述

这个数据集由四个文件组成:

2.1 AMiner-Paper.rar

2.1.1 字段及含义

该文件保存了论文信息和引文网络,格式如下:

字段含义--
#index论文 idindex id of this paper
#*论文标题paper title
#@作者(分号隔开)authors (separated by semicolons)
#o机构(以分号分隔,每个机构按顺序对应一个作者)affiliations (separated by semicolons, and each affiliaiton corresponds to an author in order)
#t年份year
#c发表地publication venue
#%本文参考文献的编号the id of references of this paper (there are multiple lines, with each indicating a reference)
#!摘要abstract

2.1.2 例子

#index 1083734
#* ArnetMiner: extraction and mining of academic social networks
#@ Jie Tang;Jing Zhang;Limin Yao;Juanzi Li;Li Zhang;Zhong Su
#o Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;IBM, Beijing, China;IBM, Beijing, China
#t 2008
#c Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
#% 197394
#% 220708
#% 280819
#% 387427
#% 464434
#% 643007
#% 722904
#% 760866
#% 766409
#% 769881
#% 769906
#% 788094
#% 805885
#% 809459
#% 817555
#% 874510
#% 879570
#% 879587
#% 939393
#% 956501
#% 989621
#% 1117023
#% 1250184
#! This paper addresses several key issues in the ArnetMiner system, which aims at extracting and mining academic social networks. Specifically, the system focuses on: 1) Extracting researcher profiles automatically from the Web; 2) Integrating the publication data into the network from existing digital libraries; 3) Modeling the entire academic network; and 4) Providing search services for the academic network. So far, 448,470 researcher profiles have been extracted using a unified tagging approach. We integrate publications from online Web databases and propose a probabilistic framework to deal with the name ambiguity problem. Furthermore, we propose a unified modeling approach to simultaneously model topical aspects of papers, authors, and publication venues. Search services such as expertise search and people association search have been provided based on the modeling results. In this paper, we describe the architecture and main features of the system. We also present the empirical evaluation of the proposed methods.

2.2 AMiner-Author.zip

2.2.1 字段及含义

该文件保存作者信息。格式如下:

字段含义--
#index作者 idindex id of this author
#n作者姓名(以分号分隔)name (separated by semicolons)
#a机构affiliations (separated by semicolons)
#pc该作者发表的论文数量the count of published papers of this author
#cn该作者被引用的次数the total number of citations of this author
#hi该作者的 h 指数the H-index of this author
#pi----the P-index with equal A-index of this author
#upi----the P-index with unequal A-index of this author
#t该作者的研究兴趣research interests of this author (separated by semicolons)

注: 关于 p 指数和 a 指数的概念,请参阅[J. Stallings et al, Determining scientific impact using a collaboration index]。

2.2.2 例子

#index 1488277
#n Juanzi Li
#a Tsinghua University;Department of Computer Science & Technology, Tsinghua, University, Beijing, China 100084
#pc 70
#cn 370
#hi 9
#pi 76.3254
#upi 73.7573
#t semantic web;social network;Semantic Annotation;ontology caching;semantic information;knowledge base

2.3 AMiner-Coauthor.zip

该文件保存了作者之间的协作网络。格式如下:

2.3.1 字段及含义

如:00 11 22

  • 00表示一个作者的 id
  • 11表示另一个作者的 id
  • 22表示他们合作的次数

2.3.2 例子

#693708 1658058 2

2.4 AMiner-Author2Paper.zip.

该文件保存了作者 id 与 论文 id 之间的关系。

字段及含义

第一列是编号,第二列是作者 id,第三列是论文 id,第四列是作者的排名(一作/二作/三作)。

例子

1 381617 1 1