5种方法使用Puppeteer爬虫优雅保存网页图片想保存新浪博客的历史文章备份到电脑本地，使用Puppeteer编写爬虫

起因

想保存新浪博客的历史文章备份到电脑本地，使用Puppeteer编写爬虫进行尝试。
之前喜欢用Python3+Selenium，但是发现Puppeteer似乎安装很容易，功能也很强大。
第一次使用Puppeteer，查询了一些资料。
本文参考了：SAVING IMAGES FROM A HEADLESS BROWSER
这篇教程的方法已经很丰富了，但是在爬取新浪博客的过程中，文中的诸多方法都遇到了阻碍，最终自己发现和实践了一种更好的办法。

第1次尝试：node.js直接请求下载文件

这是我自己的想法，最简单的方法就是拿到图片src后，直接请求下载到本地就好了。

const downloadFile = async (url, filePath) => {
    return axios({
      method: "get",
      url: url,
      responseType: "stream",
    }).then((response) => {
      response.data.pipe(fs.createWriteStream(filePath));
    });
  };

通过axios下载并保存图片，失败，新浪返回了防止盗链的图片，原理暂时不清楚。

看来需要想办法绕过。

第2次尝试：从新Canvas中提取图像

Extract Image from a New Canvas，这种方法是创建一个空白canvas元素，将目标图像写入到其中，然后将图像数据提取为DataURL。
如果图像数据来自不同的源，这将导致画布受到污染，并且尝试从画布中获取图像数据将引发错误。

const getDataUrlThroughCanvas = async (selector) => {
  // Create a new image element with unconstrained size.
  const originalImage = document.querySelector(selector);
  const image = document.createElement('img');
  image.src = originalImage.src;

  // Create a canvas and context to draw onto.
  const canvas = document.createElement('canvas');
  const context = canvas.getContext('2d');
  canvas.width = image.width;
  canvas.height = image.height;

  // Ensure the image is loaded.
  await new Promise((resolve) => {
    if (image.complete || (image.width) > 0) resolve();
    image.addEventListener('load', () => resolve());
  });

  context.drawImage(image, 0, 0);
  return canvas.toDataURL();
};

我试了一下，把这个改写为批量selector版本的，运行了一下，果然因为跨域的问题报错了： Uncaught DOMException: Failed to execute 'toDataURL' on 'HTMLCanvasElement': Tainted canvases may not be exported 然后查询到可以image.setAttribute("crossOrigin",'Anonymous')解决，但是还是报了另一个跨域的错误，方案失败。

第3次尝试：调用浏览器执行fetch方法下载图片

(Re-)Fetch the Image from the Server，这个方法是把JS代码注入到浏览器中执行，直接在浏览器层面进行操作。

const assert = require('assert');

const getDataUrlThroughFetch = async (selector, options = {}) => {
  const image = document.querySelector(selector);
  const url = image.src;

  const response = await fetch(url, options);
  if (!response.ok) {
    throw new Error(`Could not fetch image, (status ${response.status}`);
  }
  const data = await response.blob();
  const reader = new FileReader();
  return new Promise((resolve) => {
    reader.addEventListener('loadend', () => resolve(reader.result));
    reader.readAsDataURL(data);
  });
};

try {
  const options = { cache: 'no-cache' };
  const dataUrl = await page.evaluate(getDataUrlThroughFetch, '#svg', options);
  const { mime, buffer } = parseDataUrl(dataUrl);
  assert.equal(mime, 'image/svg+xml');
  fs.writeFileSync('logo-fetch.svg', buffer, 'base64');
} catch (error) {
  console.log(error);
}

这个方法的实践也产生了错误，看了Console的提示，也是因为跨域。
应该是新浪做了跨域限制，注入的JS不是新浪的，无法正常地进行fetch新浪的图片。

第4次尝试：通过Chromium的DevTools协议获取图片

Extracting Images Using the DevTools Protocol，也就是说我们通过Puppeteer的api访问了其控制的浏览器，原理类似于我们F12打开开发者工具界面，在Sources里面找到对应下载到本地了的图片。

const assert = require('assert');

const getImageContent = async (page, url) => {
  const { content, base64Encoded } = await page._client.send(
    'Page.getResourceContent',
    { frameId: String(page.mainFrame()._id), url },
  );
  assert.equal(base64Encoded, true);
  return content;
};

try {
  const url = await page.evaluate(() => document.querySelect('#svg').src)
  const content = await getImageContent(page, url);
  const contentBuffer = Buffer.from(content, 'base64');
  fs.writeFileSync('logo-extracted.svg', contentBuffer, 'base64');
} catch (e) {
  console.log(e);
}

这种方式，我也跑起来了，而且第一次看到图片一个个被我下载到本地，特别欣慰。
但是这种方法有一定几率出现图片内容的损坏，出现绿色的底，怀疑是因为读取时图片还没有下载加载完全，因此我又放弃了。

第5次尝试：【原创终极方案】通过page.waitForResponse方法直接得到响应

结合对第4次尝试图片可能加载不完全的问题的思考，根据对Puppeteer文档的查询，我发现了那篇教程中没有提到的方法：page.waitForResponse。
page.waitForResponse这个方法，是等待某个响应结束才继续执行，返回Puppeteer的Response实例。
使用示例：

const firstResponse = await page.waitForResponse('https://example.com/resource');
const finalResponse = await page.waitForResponse(response => response.url() === 'https://example.com' && response.status() === 200);
return finalResponse.ok();

所以，我们得到图片的url后，可以直接通过这个方法，等待图片在浏览器完成下载和加载，并且在加载后返回的Response对象中，得到图片数据，比之前教程中的那些方法更简单高效。

const imgResp = await detailPage.waitForResponse(img.real_src, {
            timeout: 10000,
          });
const buffer = await imgResp.buffer();
const imgBase64 = buffer.toString("base64");
fs.writeFileSync(
  `data/images/${index}_${img.index}.jpg`,
  imgBase64,
  "base64"
);

对我来说，这就是最终的最推荐的解决方案了，是我自己发现并且实践成功的。
踩了那么多坑，记录下来，供大家参考。