这是我参与11月更文挑战的第5天，活动详情查看：2021最后一次更文挑战

前几天受一个粉丝所托，爬取《南方周末》网站上的新闻文章。

要求也并不复杂，跟人民日报爬虫和解放日报爬虫类似。

话不多说，我们直接开始。

1. 分析网站

南方周末，网站地址为：www.infzm.com/contents?te…

观察网站主页，我们可以了解到，网站左侧为 频道列表 ，中间为 新闻列表 。

鼠标点击切换左侧的频道时，观察到浏览器地址栏中 term_id 的值同步发生变化，说明 term_id 参数表示频道的 id 。

将网页滚动条往下滑，观察到会不断有新的新闻文章加载进来，但是浏览器地址栏中的网址全程没有变化，说明新闻列表采用 瀑布流 的加载形式，数据通过 Ajax 动态加载。

简单分析之后，我们打开 开发者工具 ，切换到 Network 页签开始抓包分析。

1.1 新闻列表分析

在页面下滑的过程中，不断有新的请求出现。

请求的 URL 形如： www.infzm.com/contents?te…

请求的内容如图所示：

到这里我们知道了，这个便是我们要找的 新闻列表 的数据接口。

观察接口 URL：www.infzm.com/contents?te…

有 3 个参数：term_id ，page 和 format 。

term_id 前面分析过了表示频道的 id，其他两个根据字面含义，page 表示页数，format 表示数据格式。

返回的数据格式是标准的 json ，文章列表数据位于 data -> contents ，包括文章标题，文章id，作者名字，发布时间等信息。

1.2 新闻详情页分析

随便打开一篇新闻文章的详情页，如：www.infzm.com/contents/21… 。

我们观察到详情页链接的构成方式为 http://www.infzm.com/contents/ + 文章id 。

通过开发者工具查看，了解到新闻正文内容渲染在 HTML 源码中。

如图所示，新闻内容在 <div class="nfzm-content__content"> 标签中。其中 引言 部分位于 <blockquote class="nfzm-bq"> 标签下；正文内容位于 <div class="nfzm-content__fulltext"> 标签下的 p 标签中。

网页结构示意如下：

<div class="nfzm-content__content">
    <blockquote class="nfzm-bq">引言</blockquote>
    <div class="nfzm-content__fulltext">
        <p>第一段</p>
        <p>第二段</p>
        <p>第三段</p>
    </div>
</div>

1.3 反爬机制分析

我们用 Python 简单编写一段代码，测试一下网站的反爬机制。

1.3.1 新闻列表

简单伪造一下 headers ，发起网络请求。

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
url = "http://www.infzm.com/contents?term_id=1&page=2&format=json"
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)

发现可以正常获取到数据。

1.3.2 新闻正文

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
url = "http://www.infzm.com/contents/217973"
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)

新闻正文内容也可以成功获取到。

不过并不是所有新闻正文都可以无障碍爬取到，有些新闻正文仅展示部分内容，全文需要登录账号之后才能查看。

而当我注册好账号之后刷新界面，发现查看全文居然还要订阅会员。

这里我就先不开通会员了。

如果有需要的同学，可以自行开通会员后，将登录后的 cookies 填入代码中的 headers 中，进行爬取。

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Cookie': "你自己的cookie"
}

Cookie 可以在开发者工具中查看。

2. 编码环节

接下来，开始正式编码。

首先导入这个爬虫程序需要用到的库

import requests
import json
from bs4 import BeautifulSoup
import os

然后是网络请求函数 fetchUrl

def fetchUrl(url):
    '''
    功能：访问 url 的网页，获取网页内容并返回
    参数：目标网页的 url
    返回：目标网页的 html 内容
    '''
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        print(e)

解析新闻列表函数 parseNewsList

def parseNewsList(html):
    '''
    功能：解析新闻列表页，提取新闻列表数据并依次返回
    参数：列表数据（json 格式）
    返回：新闻的id，标题，发布时间
    '''
    try:
        jsObj = json.loads(html)
        contents = jsObj["data"]["contents"]
        for cnt in contents:
            pid = cnt["id"]
            subject = cnt["subject"]
            publish_time = cnt["publish_time"]
            yield pid, subject, publish_time

    except Exception as e:
        print("parseNewsList error!")
        print(e)

解析新闻正文内容函数 parseNewsContent

def parseNewsContent(html):
    '''
    功能：解析新闻详情页，提取新闻正文内容并返回
    参数：网页源码（html 格式）
    返回：新闻正文内容的字符串
    '''
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        cntDiv = bsObj.find("div", attrs={"class": "nfzm-content__content"})
        blockQuote = cntDiv.find("blockquote", attrs={"class": "nfzm-bq"})
        fulltextDiv = cntDiv.find("div", attrs={"class": "nfzm-content__fulltext"})
        pList = fulltextDiv.find_all("p")
        
        ret = blockQuote.text + "\n" if blockQuote else ""
        ret += "\n".join([p.text for p in pList if len(p.text) > 1])
        return ret
        
    except Exception as e:
        print("parseNewsContent error!")
        print(e)

保存文件函数 saveFile

def saveFile(path, filename, content):
    '''
    功能：将文章内容 content 保存到本地文件中
    参数：要保存的内容，路径，文件名
    '''
    # 如果没有该文件夹，则自动生成
    if not os.path.exists(path):
        os.makedirs(path)
    # 保存文件
    with open(path + filename, 'w', encoding='utf-8') as f:
        f.write(content)

爬虫调度器 download_nfzm

def download_nfzm(termId, page, savePath):
    '''
    功能：爬取 termId 频道，第 page 页的所有新闻，并保存至 savePath 路径下
    参数：termId 频道 Id
          page 页码
          savePath 保存路径
    '''
    url = f"http://www.infzm.com/contents?term_id={termId}&page={page}&format=json"
    html = fetchUrl(url)
    try:
        for pid, title, publish_time in parseNewsList(html):
            print(pid, publish_time, title)
            pLink = f"http://www.infzm.com/contents/{pid}"
            content = parseNewsContent(fetchUrl(pLink))
            content = title + "\n\n" + publish_time + "\n\n" + content
            date = publish_time.split(" ")[0]
            filename = f"{date}-{pid}.txt"
            
            saveFile(savePath, filename, content)
            
    except Exception as e:
        print("download_nfzm Error")
        print(e)

最后是主函数，用来启动爬虫。

if __name__ == '__main__':
    '''
    主函数：程序入口
    '''
    beginPage = 1
    endPage = 10
    term_id = 1

    for page in range(beginPage, endPage + 1):
        download_nfzm(term_id, page, 'infzm_News/')
    
    print("爬取完成")

3. 关键词筛选

有的同学可能有这样的需求，就是根据关键词来筛选要爬取的新闻文章，而非全部爬取。

于是我试了一下网站的搜索功能。

在搜索结果页尝试抓包，方法跟前面一样。

然后发现，南方周末网站的关键词搜索功能，其实是在之前的数据接口的基础上，新加了一个参数 k

http://www.infzm.com/search?term_id=&page=2&k=%E7%BB%8F%E6%B5%8E&format=json

其中 %E7%BB%8F%E6%B5%8E 就是 url 编码后的关键词 经济 。

于是，我们可以在前面代码的基础上，略微调整一下主函数和 download_nfzm 函数，即可将普通的新闻文章爬虫改造成带 关键词筛选 的新闻文章爬虫。

def download_nfzm(termId, page, kw, savePath):
    '''
    功能：爬取 termId 频道，第 page 页的所有新闻，并保存至 savePath 路径下
    参数：termId 频道 Id
          page 页码
          savePath 保存路径
    '''
    url = f"http://www.infzm.com/contents?term_id={termId}&page={page}&k={kw}&format=json"
    html = fetchUrl(url)
    try:
        for pid, title, publish_time in parseNewsList(html):
            print(pid, publish_time, title)
            pLink = f"http://www.infzm.com/contents/{pid}"
            content = parseNewsContent(fetchUrl(pLink))
            content = title + "\n\n" + publish_time + "\n\n" + content
            date = publish_time.split(" ")[0]
            filename = f"{date}-{pid}.txt"
            
            saveFile(savePath, filename, content)
            
    except Exception as e:
        print("download_nfzm Error")
        print(e)

if __name__ == '__main__':
    '''
    主函数：程序入口
    '''
    beginPage = 1
    endPage = 10
    term_id = 1
    kw = "经济"

    for page in range(beginPage, endPage + 1):
        download_nfzm(term_id, page, kw, 'infzm_News/')
    
    print("爬取完成")

4. 运行效果

运行代码，爬取前10页进行测试

运行结果

保存好的新闻文章文件

爬取好的新闻内容

如果文章中有哪里没有讲明白，或者讲解有误的地方，欢迎在评论区批评指正，或者扫描下面的二维码，加我微信，大家一起学习交流，共同进步。

Python 爬虫实战：爬取南方周末新闻文章