爬取中华网新闻要闻数据1，话不多说直接进入主题，网站地址 1.2，我们就采集这个位置数据，先抓一下包看一下数据在哪里 1

1，话不多说直接进入主题，网站地址

1.2，我们就采集这个位置数据，先抓一下包看一下数据在哪里

1.3，可见数据就在html中。这里也没有接口，也没有参数，那就简单了显示curl转换一下requests

1.4，发现也没有cookies那咱们就拿走直接发请求

1.5，发现源码编码不对，response.encoding = response.apparent_encoding转一下码

2，正常取到了源码。下一步就是解析到要闻数据位置，这里使用xpath先拿到要闻的列表

res = etree.HTML(response.text)
data_list = res.xpath("//ul[@class='item_list']/li/h3")

2.1,遍历列表获取左右的文章题目和文章地址

2.2，然后再来一个函数以同样的方式请求文章的详情页

        get_conent(url)

def get_conent(url):
    headers = {}

    response = requests.get(url, headers=headers)
    response.encoding = response.apparent_encoding
    res = etree.HTML(response.text)

2.3，然后我们获取文章正文,还是使用xpath获取到原文内容，测试一下没有什么问题

3，完善一下代码，还是加一个是否浏览选择，另外再加入一个中断阅读功能

3.1，跑一下看看效果

4，没啥问题，但是这里有一个弊端就正文里面看不到图签，自己解析图片的话也会挺麻烦，正文的解析其实也挺麻烦，改天更一个自动解析的。最后贴一下代码主题部分

def get_data():
    headers = {
        'authority': 'news.china.com',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'accept-language': 'zh-CN,zh;q=0.9',
        'cache-control': 'no-cache',
        'pragma': 'no-cache',
        'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'document',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-site': 'none',
        'sec-fetch-user': '?1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    }

    response = requests.get('https://news.china.com/', headers=headers)
    response.encoding = response.apparent_encoding
    res = etree.HTML(response.text)
    data_list = res.xpath("//ul[@class='item_list']/li/h3")
    for data in data_list:
        title = data.xpath("./a/text()")[0]
        url = data.xpath("./a/@href")[0]
        print(title)
        yes = input('是否浏览文章（1 or 2）：')
        if yes == '1':
            print('******输入2可退出本文*******')
            get_conent(url)


def get_conent(url):
    headers = {
        'authority': 'news.china.com',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'accept-language': 'zh-CN,zh;q=0.9',
        'cache-control': 'no-cache',
        'pragma': 'no-cache',
        'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'document',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-site': 'none',
        'sec-fetch-user': '?1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    }

    response = requests.get(url, headers=headers)
    response.encoding = response.apparent_encoding
    res = etree.HTML(response.text)
    conent_list = res.xpath("//div[@class='article_content']/p//text()")
    for con in conent_list:
        yes = input()
        if yes == '2':
            return ''
        print(con)