自动解析库详解之前的文章说到更一个自动解析库，这里使用，划水摸鱼网站做一个案例，其实现在自动解析的库挺多的，鄙人用的比较

之前的文章说到更一个自动解析库，这里使用，划水摸鱼网站做一个案例，其实现在自动解析的库挺多的，鄙人用的比较多的就是gne的GeneralNewsExtractor方法，这是一个专注于解析新闻类型数据的模块，对于新闻数据的解析完整率还是挺高的。

先打开网站分析一下网站结构，网站地址

可见映入眼帘的就是一个热榜列表页，那就先抓一下包看一下什么数据接口，打开F12刷新一下页面。

可见数据在html中那这个列表页就需要解析得到，先往下看详情页数据，选择一篇文章进去，额..怎么说呢哈哈，这个详情页数据不在html中，就是换一个源，列表页连接换为www.huashuimoyu.com/detail/52然后咱们进到文章页看一下

可见这里是没有页面数据的但是吧html取出来放到单独一个html中浏览器打开是有数据的

ok，这个页面还算是标准一些那就可以进行采集了 2.先请求列表页xpath取数据列表中题目和文章连接

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

response = requests.get('https://www.huashuimoyu.com/detail/52', headers=headers)
res = etree.HTML(response.text)
data_list = res.xpath("//table[@class='table']//tr/td/a[@rel='nofollow']")
for data in data_list:
    try:
        title = data.xpath("./text()")[0]
    except:
        continue
    url = data.xpath("./@href")[0]
    print(title,url)

在请求详情页获取源码

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

response = requests.get(url, headers=headers)
print(response.text)

ok，没什么问题，然后就是使用自动解析库解析文章标题和正文

extractor = GeneralNewsExtractor()
result = extractor.extract(response.text, noise_node_list=['//div[@class="comment-list"]'])
print(result)

要是说这个文章格式不太标准能行不。大家自行选择使用吧，（其实之前使用的其他网站效果还是差不多的哈哈）

1，之后会导读更一篇36氪的爬虫看看他这个数据到底是个什么

2，对于一些解析网站度较高的网站来说搞一个自动爬虫可以的，就要数量不要质量，之后有机会也会更一篇