30天学习Python👨‍💻第二十三天——网络爬虫

基础概念

网络爬虫是一种技术或者概念，通过抓取的方式从一个网站获取数据。它主要是用来从网站收集有意义的数据，特别是当没有可用的API提取数据时。今天我探索了Python网络爬虫的基础，我想分享我的经验。

爬虫是一种脚本形式，它可以让我们自动的从网站提取大量的分结构化数据，并且用结构化的方式组织起来，用以很多不同的用途，比如收集邮件、产品价格、股票价格、航班数据或者其他相关信息。手动做这些会花费很多的时间和精力。Python有一些令人惊叹的库，可以使网络爬取变得非常简单和有趣。我主要探索的是最基本和最受欢迎的库Beautiful Soup，以此来让我自己熟悉这个概念。

开始实践

网络爬取是非常强大的，对于它的使用有很大的争论。大多数网站都有一个robots.txt文件，其中提到哪些特定的URL能够抓取，哪些不能够抓取。这个文件是不同的搜索引擎机器人的说明，比如谷歌机器人、雅虎机器人、必应机器人等，他们应该抓取哪些特定的页面来优化搜索引擎。所以所有的搜索引擎爬虫都是从网站抓取数据根据它们的相关关键词进行排名。但是，即便网站在robots.txt文件中禁止抓取，但是并不能完全阻止网络数据抓取程序抓取它们的数据。浏览网站的robots.txt文件是一种良好的、道德的方式，如果文件存在就只从它上面提到的URL抓取数据，避免任意的数据泄露问题。

使用Beautiful Soup抓取

为了今天的学习，我决定从Hacker News——一个非常受欢迎的开发者社区，尝试抓取数据。这些是在它的robots.txt文件中定义的规则。

User-Agent: * 
Disallow: /x?
Disallow: /vote?
Disallow: /reply?
Disallow: /submitted?
Disallow: /submitlink?
Disallow: /threads?
Crawl-delay: 30

所以我们是被允许从这个新闻网页news.ycombinator.com/newest 上抓取数据的，它罗列了开发领域的最新文章。我的目标是抓取前五页，并且抓取点赞在100以上的文章以及它们的链接。自动的获取高赞文章以及从终端阅读它们是非常有用的，不用浏览hacker news的网站，也不用手动的去寻找受欢迎的文章。

首先有两个库使需要下载的，requests用来做HTTP请求，beautifulsoup4用来抓取网站。

下载命令：pip install requests``pip install beautifulsoup4

hacker_news_scraper.py

import requests
from bs4 import BeautifulSoup

BASE_URL = 'https://news.ycombinator.com'
response = requests.get(BASE_URL)
# extract the text content of the web page
response_text = response.text
# parse HTML
soup = BeautifulSoup(response_text, 'html.parser')
print(soup.prettify()) # prints the html content in a readable format

这篇文章www.crummy.com/software/Be… Soup的各种用法。使用浏览器的元素检查工具，可以查看元素的选择器，使用它们获取数据。在这个案例中，所有的文章都有一个storylink类，并且它们有一个相关联的score类。这些选择器现在可以用来抓取各自的数据并且将它们组合起来。

# extract all the links using the class selector
links_list = soup.select('.storylink')

# extract all the points using the class selector
points_list = soup.select('.score')

循环链接之后，关联的、标题、链接、获赞数可以组合成一个字典对象，然后将它们添加到一个热门的帖子列表当中。

需要注意的是，enumerate函数用于获取每个元素的索引，以获取相应的获赞数，因为这些获赞数不包含在links容器中。

只有至少100个赞的文章才能被添加到热门列表中。

# 循环所有的链接
for idx, link in enumerate(links_list):
    # 获取文章的标题
    post_title = link.get_text()
    # 获取文章的链接
    post_href = link.get('href')
    # 使用链接的索引获取文章的获赞数
    # 将获赞数转化为数字
    post_points = int(points_list[idx].get_text().replace(' points', ''))
    # 如果赞数超过100作为一个字典对象添加到热门文章列表中
    if post_points >= 100:
        popular_posts.append(
        {'title': post_title, 'link': post_href, 'points': post_points})

有一个非常有用的Python内置库pprint，它可以在控制台以更可读的格式将数据打印出来。

import pprint

然后可以使用它来查看热门列表

# 循环链接
for idx, link in enumerate(links_list):
    # 获取文章标题
    post_title = link.get_text()
    # 获取文章链接
    post_href = link.get('href')
    # 使用链接的索引获取文章的获赞数
    # 将获赞数转化为数字
    post_points = int(points_list[idx].get_text().replace(' points', ''))
    # 如果赞数超过100作为一个字典对象添加到热门文章列表中
    if post_points >= 100:
        popular_posts.append(
        {'title': post_title, 'link': post_href, 'points': post_points})

pprint.pprint(popular_posts) # 以更可读的格式打印出来

上面的脚本只是从Hacker News第一页抓取了热门文章。但是，根据期望的目标，我们需要获取前五页的列表，或者根据给定的页数获取。所以脚本应该根据需求进行修改。

下面是抓取热门列表的最终脚本。代码可以在这个Github仓库找到github.com/arindamdawn…

import requests
from bs4 import BeautifulSoup
import pprint
import time

BASE_URL = 'https://news.ycombinator.com'
# response = requests.get(BASE_URL)

def get_lists_and_points(soup):
    # extract all the links using the class selector
    links_list = soup.select('.storylink')

    # extract all the points using the class selector
    points_list = soup.select('.score')

    return (links_list, points_list)

def parse_response(response):
    # extract the text content of the web page
    response_text = response.text
    # parse HTML
    soup = BeautifulSoup(response_text, 'html.parser')
    return soup

def get_paginated_data(pages):
    total_links_list = []
    total_points_list = []
    for page in range(pages):
        URL = BASE_URL + f'?p={page+1}'
        response = requests.get(URL)
        soup = parse_response(response)
        links_list, points_list = get_lists_and_points(soup)
        for link in links_list:
            total_links_list.append(link)
        for point in points_list:
            total_points_list.append(point)
        # add 30 seconds delay as per hacker news robots.txt rules
        time.sleep(30)
    return (total_links_list, total_points_list)

def generate_popular_posts(links_list, points_list):
    # create an empty popular posts list
    popular_posts = []

    # loop though all links
    for idx, link in enumerate(links_list):
        # fetch the title of the post
        post_title = link.get_text()
        # fetch the link of the post
        post_href = link.get('href')
        # fetch the point text using the index of the link
        # convert the point to integer
        # if points data is not available, assign it a default of 0
        try:
            post_points = int(
                points_list[idx].get_text().replace(' points', ''))
        except:
            points_list = 0
        # append to popular posts as a dictionary object if points is atleast 100
        if post_points >= 100:
            popular_posts.append(
                {'title': post_title, 'link': post_href, 'points': post_points})
    return popular_posts

def sort_posts_by_points(posts):
    return sorted(posts, key=lambda x: x['points'], reverse=True)

def main():
    total_links_list, total_points_list = get_paginated_data(5)
    popular_posts = generate_popular_posts(total_links_list, total_points_list)
    sorted_posts = sort_posts_by_points(popular_posts)
    # print posts sorted by highest to lowest
    pprint.pprint(sorted_posts)

if(__name__ == '__main__'):
    main()

现在使用脚本，我们甚至不需要去浏览Hacker News网站了，也不需要找热门文章了。我们可以在控制台运行脚本，得到最新的消息。你可以根据需要调整脚本并进行试验，或者尝试从任何你喜欢的网站抓取数据。

我们使用上面的数据可以做很多事情，比如：

创建一个API来使用它的网站应用程序
用关键词分析趋势
创建一个新闻聚合网站和更多

流行的爬虫库

当从网站爬取数据时Beautiful Soup有它自己的限制。Beautiful Soup使用起来非常简单，但是从客户端渲染（Angular，React）的复杂网站抓取数据，网站加载时HTML文档标记是获取不到的。下面是一些流行的Python库和框架。

lxml
Selenium
Scrapy - 用于网站爬取的完整框架

参考

网络爬虫是一个非常广阔的领域。使用Beautiful Soup我们可能只是触及了一点皮毛。在这个领域还存在非常大的可能性，在使用Python进行数据分析时我将会探索更多。希望我己经涵盖了进一步探索所需要的所有基础概念。

明天我将回顾使用Python进行Web开发的概念。

原文链接

30 Days of Python 👨‍💻 - Day 23 - Web Scraping

（翻译）30天学习Python👨‍💻第二十三天——网络爬虫