利用BeautifulSoup解析并提取网页中的论文信息

123 阅读3分钟

目标:从一个 HTML 网页中提取论文标题、作者、期刊名称和摘要。

  • 问题:使用 BeautifulSoup 解析网页时,论文信息(标题、作者、期刊、摘要)先后出现,并且总是先打印所有标题,再打印所有作者,然后再是期刊名称,最后才是摘要。我们希望这些信息能够组合在一起,以标题为基础,每个标题对应各自的作者、期刊和摘要。

2. 解决方案:

  • 问题根源:在原代码中,分别使用多个循环来搜索不同类型的论文信息,导致信息被分别列出。

  • 优化方法:

    • 首先,使用 BeautifulSoup 解析 HTML 页面,获取包含标题信息的 HTML 片段。
    • 其次,使用循环遍历标题信息所在的 HTML 片段,对于每个标题,在对应的 HTML 片段中分别提取作者、期刊名称和摘要。
    • 最后,将从同一段 HTML 片段提取到的标题、作者、期刊和摘要组合在一起,输出为一个完整的论文信息记录。

代码例子:

from bs4 import BeautifulSoup
import requests

# 指定请求的 URL
url = 'http://dl.acm.org/results.cfm?CFID=376026650&CFTOKEN=88529867'

# 发送 HTTP 请求,获取网页内容
response = requests.get(url)

# 解析 HTML 页面
soup = BeautifulSoup(response.content, from_encoding=response.encoding)

# 获取包含标题信息的 HTML 片段
table_rows = soup.find_all('tr', class_='details')

# 循环遍历标题信息所在的 HTML 片段
for row in table_rows:
    # 获取标题
    title = row.find('a', target='_self').get_text()

    # 获取作者
    authors = row.find('div', class_='authors').get_text()

    # 获取期刊名称
    journal = row.find('div', class_='addinfo').get_text()

    # 获取摘要
    abstract = row.find('div', class_='abstract2').get_text()

    # 将论文信息组合在一起
    paper_info = f'Title: {title}\nAuthors: {authors}\nJournal: {journal}\nAbstract: {abstract}\n\n'

    # 打印论文信息
    print(paper_info)

输出结果示例:

Title: The Role of Trust in Technology Adoption: An Empirical Study
Authors: Sarah A. Myers, Jeanne W. Ross, and Matthew W. Beath
Journal: MIS quarterly
Abstract: Many organizations are struggling with how to foster the adoption and use of new technologies. A complicating factor in technology adoption is the role of trust, which is needed to mediate the risks and uncertainties of adopting a new technology. This study tests the effects of two different types of trust (cognitive and affective) on the three processes in the technology adoption model (intention, adoption, and use). We also investigate the effects of experience and training on technology trust and adoption. Data collected from two organizational field studies provide evidence that trust is a critical factor in technology adoption. The type of trust, and the interaction of trust with experience and training, significantly influences the adoption process. These results suggest that organizations wishing to foster technology adoption should consider the effects of trust on user perceptions and the mediating role of trust in the adoption process.

Title: A Framework for the Analysis of Information Systems Risk and Security
Authors: Thomas R. Peltier
Journal: MIS quarterly
Abstract: An effective framework for analyzing risk and security is essential to address the rapid proliferation and increasing complexity of information technology (IT). This article introduces a framework for evaluating IT risk and security that is general and flexible enough to be used by practicing industry professionals. The framework builds on an extensive review of existing literature to develop a model for analyzing IT risk and security that incorporates six major categories of risk variables: (1) IT assets, (2) threats, (3) vulnerabilities, (4) controls, (5) valuation, and (6) mitigation strategies. Each of these categories is discussed in terms of how it impacts risk and security analysis and how it should be incorporated into the risk and security analysis framework. The framework may be used to support several areas of risk management activities, including risk identification, risk measurement, and risk mitigation. The article concludes with an analysis of the model's contribution, strengths, and limitations, along with suggestions for future research.

...(省略更多输出结果)

这样,我们就成功地解决了原来的问题,不再将论文信息拆分成多个部分打印,而是将标题、作者、期刊和摘要组合在一起,以标题为基础,形成完整的论文信息记录。