核心包:puppeteer-core(由于使用puppeteer-core需要本机自带chrome浏览器)
具体步骤
1.自动打开浏览器并跳转到目标网站
const browser = await puppeteer.launch({ executablePath, //是否不显示浏览器界面- headless: false,});
const page = await browser.newPage();await page.goto(URL);await page.setViewport({ width: 1080, height: 1024 });
2.根据dom元素获取到所有的类别
F12获取到外层dom元素的类名,利用puppeteer的page.evaluate执行脚本,获取到车辆类别的href和具体的名称,具体代码如下:
const carTypes = await page.evaluate(() => { const elements = document.getElementsByClassName( "type-list_whole__ErdRL" ); const types = []; for (let i = 0; i < elements.length; i++) { const element = elements[i]; const href = element.getAttribute("href"); const name = element.querySelector("span:nth-child(2)").textContent; types.push({ href, name, }); } return types; });
3.跳转到具体的页面获取信息
根据上一步获得的连接跳转到具体的类别页面获取车辆的信息,这里需要注意的是网站是有懒加载的,需要用puppeteer模拟页面滚动,获取到所有的信息,具体代码如下:
for (let i = 0; i < carTypes.length; i++) { const car = carTypes[i]; spinner.startLoading(`开始获取类别为【${car.name}】的数据...`); const currentPage = await browser.newPage(); currentPage.on("response", handleRes); await currentPage.goto(URL + car.href); await currentPage.setViewport({ width: 1080, height: 1024 }); spinner.stopLoading(true, `【${car.name}】数据获取成功`); let flag = true; while (flag) { const prevPageHeight = await currentPage.evaluate(() => { return document.documentElement.scrollHeight; }); await currentPage.evaluate((height) => { window.scrollBy(0, height); // 等待一小段时间,以确保滚动操作完成 return new Promise((resolve) => { setTimeout(resolve, 1000); }); }, prevPageHeight); const currentPageHeight = await currentPage.evaluate(() => { return document.documentElement.scrollHeight; }); if (currentPageHeight === prevPageHeight) { flag = false; } } currentPage.close(); }
4.监听页面所有的响应,拦截到具体车辆信息的接口
使用currentPage.on("response", handleRes);监听,具体处理代码如下:
const handleRes = async (res) => { const url = res.url(); if (url.includes("select_series")) { const { data } = await res.json(); const { series } = data; series.forEach(async (car) => { spinner.succeed( `名称:${car.outter_name},品牌: ${car.brand_name},官方指导价:${car.official_price}` ); const imageUrl = car.cover_url; const response = await fetch(imageUrl); const bufferData = await response.arrayBuffer(); const buffer = Buffer.from(bufferData); const imgId = workbook.addImage({ buffer, extension: "png", }); const row = { outter_name: car.outter_name, brand_name: car.brand_name, official_price: car.official_price, }; // car.cover_url, worksheet.addRow(row); worksheet.addImage(imgId, { tl: { col: 3, row: currentRow + 1 }, br: { col: 3 + 1, row: currentRow + 2, }, }); currentRow++; }); }};
这里获取到具体的响应数据并且把他写入到了xlsx文件里面,图片是转成buffer插入进去的,不是放的链接,xlsx使用的包是exceljs
具体细节点击:链接