携手创作，共同成长！这是我参与「掘金日新计划 · 8 月更文挑战」的第26天,点击查看活动详情 >>

续

上文说到，有时候我们直接请求拿不到网页源码，那是为什么呢？这是因为服务端发现了我们不是浏览器进行访问的，所以不返回数据。所以在请求时我们需要加上请求头。使用谷歌F12查看请求我们也可以看到请求头。

画红框的几个最好加上

self.headers = {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Cache-Control": "no-cache",
    "Connection": "keep-alive",
    "Host": "mapi.yiche.com",
    "Pragma": "no-cache",
    "Referer": "https://car.yiche.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
}

这是我在模拟浏览器请求添加的请求头。

解析

当我们拿到数据后我们解析成元素树的形式，使用xpath解析，xpath解析实际需要我们分析我们所需要的元素在什么位置，可以将返回的数据格式化后进行分析（注意：有时候在控制台看到的元素和实际请求的会有不一致的情况，所以导致获取不到我们需要的数据，在此处我将此网站获取放在这

init = (html.xpath('/html/body/div[7]/div[1]/div[2]/div[1]/div[{}]/div/text()'.format(i)))[0]
brandlist = html.xpath('/html/body/div[7]/div[1]/div[2]/div[1]/div[{}]/*/a/div/text()'.format(i))
dataindexlist = html.xpath('/html/body/div[7]/div[1]/div[2]/div[1]/div[{}]/*/a/div/@data-id'.format(i))
image = html.xpath('/html/body/div[7]/div[1]/div[2]/div[1]/div[{}]/*/img/@data-original'.format(i))

ex：div[7]即代表源码中第7个div元素，就是这样一层一层找到了我们需要的元素。

完整源码

在此处我将完整源码放上来，大家可以学习。

class carBrand:
    def __init__(self):
        self.cf = 0
    
        self.headers = {
            "Accept": "*/*",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Host": "mapi.yiche.com",
            "Pragma": "no-cache",
            "Referer": "https://car.yiche.com/",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
        }

    def get_brand(self):
        print("爬取车辆信息")
        resp = requests.get('https://car.yiche.com/xuanchegongju/')
        resp.encoding = 'utf-8'
        html = etree.HTML(resp.text)

        for i in range(1,27):
            init = (html.xpath('/html/body/div[7]/div[1]/div[2]/div[1]/div[{}]/div/text()'.format(i)))[0]
            brandlist = html.xpath('/html/body/div[7]/div[1]/div[2]/div[1]/div[{}]/*/a/div/text()'.format(i))
            dataindexlist = html.xpath('/html/body/div[7]/div[1]/div[2]/div[1]/div[{}]/*/a/div/@data-id'.format(i))
            image = html.xpath('/html/body/div[7]/div[1]/div[2]/div[1]/div[{}]/*/img/@data-original'.format(i))
            
        resp.close()


if __name__ == '__main__':
    c = carBrand()
    c.get_brand()
    print("YES")

【爬虫】爬取某车网全部汽车品牌信息（二）

续

解析

完整源码