node+puppeteer爬虫某车网

130 阅读1分钟

核心包:puppeteer-core(由于使用puppeteer-core需要本机自带chrome浏览器)

具体步骤

1.自动打开浏览器并跳转到目标网站

const browser = await puppeteer.launch({      executablePath,      //是否不显示浏览器界面-      headless: false,});

const page = await browser.newPage();await page.goto(URL);await page.setViewport({ width: 1080, height: 1024 });

2.根据dom元素获取到所有的类别

F12获取到外层dom元素的类名,利用puppeteer的page.evaluate执行脚本,获取到车辆类别的href和具体的名称,具体代码如下:

const carTypes = await page.evaluate(() => {      const elements = document.getElementsByClassName(        "type-list_whole__ErdRL"      );      const types = [];      for (let i = 0; i < elements.length; i++) {        const element = elements[i];        const href = element.getAttribute("href");        const name = element.querySelector("span:nth-child(2)").textContent;        types.push({          href,          name,        });      }      return types;    });

3.跳转到具体的页面获取信息

根据上一步获得的连接跳转到具体的类别页面获取车辆的信息,这里需要注意的是网站是有懒加载的,需要用puppeteer模拟页面滚动,获取到所有的信息,具体代码如下:

 for (let i = 0; i < carTypes.length; i++) {    const car = carTypes[i];    spinner.startLoading(`开始获取类别为【${car.name}】的数据...`);    const currentPage = await browser.newPage();    currentPage.on("response", handleRes);    await currentPage.goto(URL + car.href);    await currentPage.setViewport({ width: 1080, height: 1024 });    spinner.stopLoading(true, `【${car.name}】数据获取成功`);    let flag = true;    while (flag) {      const prevPageHeight = await currentPage.evaluate(() => {        return document.documentElement.scrollHeight;      });      await currentPage.evaluate((height) => {        window.scrollBy(0, height);        // 等待一小段时间,以确保滚动操作完成        return new Promise((resolve) => {          setTimeout(resolve, 1000);        });      }, prevPageHeight);      const currentPageHeight = await currentPage.evaluate(() => {        return document.documentElement.scrollHeight;      });      if (currentPageHeight === prevPageHeight) {        flag = false;      }    }    currentPage.close();  }

4.监听页面所有的响应,拦截到具体车辆信息的接口

使用currentPage.on("response", handleRes);监听,具体处理代码如下:

const handleRes = async (res) => {  const url = res.url();  if (url.includes("select_series")) {    const { data } = await res.json();    const { series } = data;    series.forEach(async (car) => {      spinner.succeed(        `名称:${car.outter_name},品牌: ${car.brand_name},官方指导价:${car.official_price}`      );      const imageUrl = car.cover_url;      const response = await fetch(imageUrl);      const bufferData = await response.arrayBuffer();      const buffer = Buffer.from(bufferData);      const imgId = workbook.addImage({        buffer,        extension: "png",      });      const row = {        outter_name: car.outter_name,        brand_name: car.brand_name,        official_price: car.official_price,      };      // car.cover_url,      worksheet.addRow(row);      worksheet.addImage(imgId, {        tl: { col: 3, row: currentRow + 1 },        br: {          col: 3 + 1,          row: currentRow + 2,        },      });      currentRow++;    });  }};

这里获取到具体的响应数据并且把他写入到了xlsx文件里面,图片是转成buffer插入进去的,不是放的链接,xlsx使用的包是exceljs

具体细节点击:链接