Node 实战课程目标通过命令行，输入关键词、数量，实现下载百度搜索来的图片。课程大纲使用 node 实现爬虫，爬

课程目标

通过命令行，输入关键词、数量，实现下载百度搜索来的图片。

课程大纲

使用 node 实现爬虫，爬百度图片；
封装成一个 CLI；

Node 爬虫

什么是爬虫？

爬虫就是一个探测机器，自动模拟人的行为去各个网站，点点按钮，查查数据，或者把看到的信息背回来，就像一只虫子在一幢楼里不知疲倦地爬来爬去。

网络引擎：更新自己的网站内容，以及对其他网站的索引，是良性的；
APP：抢票软件， 1s 成千上万次，对 12306 服务器压力过大，是恶性的；
个人：使用爬虫，获取网站的内容，建议只用作学习等用途；

爬虫是否可以肆无忌惮地爬取所有内容么？

不是的，爬取访问的网络，会消耗网站的流量、带宽或者其他服务器资源；

Q：那小型网站如何避免被爬虫爬取消耗带宽？

服务端校验：限流、校验用户身份（cookie、页面原数据）；
robots.txt；
meta 参数；

robots.txt：是一种存放于网站根目录下的 ASCII 编码的文本文件，它通常告诉爬虫此网站中的哪些内容是不应被爬的，或者哪些搜索引擎是被允许爬取的；

这个协议也不是一个规范，而只是约定俗成的，有些搜索引擎会遵守这一规范，有些则不然，通常搜索引擎会识别这个元数据，不索引这个页面；

User-agent: // 指定的身份才可以爬取
Disallow: // 为空，不禁止；为 /，禁止目录为根目录，禁止所有爬虫爬取
Allow: // 为空，不禁止；为 /，禁止目录为根目录，禁止所有爬虫爬取

// 允许所有机器人
User-agent: *
Disallow:

User-agent: *
Allow: /

// 允许特定的机器人
User-agent: name_spider（用其他爬虫的名称代替）
Allow:

// 拦截所有爬虫
User-agent: *
Disallow: /

// 禁止所有机器人访问特定的目录
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

// 禁止所有机器人访问特定文件
User-agent: *
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

或者其他形式：

<meta name='robots' content='noindex,nofollow' />

如何实现一个爬虫应用？

明确要爬取的网站、页面；
分析网站的数据及 DOM；
确定技术选型：
1. 模拟浏览器端请求：
  1. request：已经不维护了，不建议使用；
  2. superagent：是在 node 层或服务端实现代理请求的模块，支持全量的 ajax methods，可以设置 headers 等（使用）；
2. 解析 DOM：
  1. cheerio：api 类似 jQuery（使用）；
  2. jsDOM：可以解析 Dom 文本；
3. 模拟用户行为操作：
  1. puppeteer：相当于在 node 端启动一个浏览器，用来模拟 chrome 浏览器的各种运行，常用来定期去巡检页面功能等；

CLI

Command Line Interface，命令行交互界面，像我们经常使用的 create-react-app、vue-cli 都是我们耳熟能详的 cli。

作用：代替人实现重复劳动，提升开发效率

快速生成应用模板，如 vue-cli 等根据与开发者的一些交互式问答生成应用框架
创建 module 模板文件，如 angular-cli，创建 component,module；sequelize-cli 创建与 mysql 表映射的 model 等
服务启动，如 ng serve
eslint，代码校验，如 vue,angular，基本都具备此功能
自动化测试如 vue,angular，基本都具备此功能
编译 build，如 vue,angular，基本都具备此功能

与 npm scripts 对比

npm scripts 也可以实现开发工作流，通过在 package.json 中的 scripts 对象上配置相关 npm 命令，执行相关 js 来达到相同的目的；但是 cli 工具与 npm scripts 相比有什么优势呢?

npm scripts 是某个具体项目的，只能在该项目内使用，cli 可以是全局安装的，多个项目使用；
使用 npm scripts 在业务工程里面嵌入工作流，耦合太高；使用 cli 可以让业务代码工作流相关代码剥离，业务代码专注业务；
cli 工具可以不断迭代开发，演进，沉淀；

如何实现一个 CLI？

涉及库：

commander：为 cli 提供命令行接入的方案

inquirer：提供交互的 GUI

实战

课程上以搜索“柯基”为例子。

初始化项目

npm init 初始化项目；
安装 superagent、cheerio；
指定项目运行入口；

npm init

// 安装基础依赖
npm i -S superagent cheerio

// 修改项目 npm scripts
start": "node index.js"

// 运行
npm run start

superagent、cheerio 使用测试

superagent 测试获取资源 html

const superagent = require('superagent');
const cheerio = require('cheerio');

const URL = 'http://www.baidu.com';
superagent.get(URL).end((err, res) => {
  if (err) {
    console.log(`访问失败，原因为：${err}`);
    return;
  }

  console.log(res); // 输出的是页面 DOM
});

cheerio 解析获取到的 html

const superagent = require('superagent');
const cheerio = require('cheerio');

const URL = 'http://www.baidu.com';
superagent.get(URL).end((err, res) => {
  if (err) {
    console.log(`访问失败，原因为：${err}`);
    return;
  }

  // console.log(res); // 输出的是页面 DOM

  const htmlText = res.text;
  const $ = cheerio.load(htmlText);

  $('meta').each((index, ele) => {
    console.log(`${index}: ${$(ele).attr('content')}`);
  });
});

解析获取到的 html，抓取百度图片

搜索：柯基，在 DOC 栏下查看；

完整的 url：

image.baidu.com/search/inde…

有用的 url：

image.baidu.com/search/inde…

tn：百度图片；
ie：编码格式；
word：encode(keyword)；

查询到相关资源为：objURL，存在于 JSON；

抓取数据源：查看网页源代码里，会看到是资源类型都是放到 JSON 里的，通过 JS 将图片渲染进页面中，没法通过 cheerio，解析 dom；

获 objURL 后发现，图片都是经过转义的，格式为：

https:\/\/gimg2.baidu.com\/image_search\/src=http%3A%2F%2Fimg11.51tietu.net%2Fpic%2F2019112217%2F3fhqw1z53c33fhqw1z53c3.jpg&refer=http%3A%2F%2Fimg11.51tietu.net&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1659948766&t=9de4acba8fd9610a9f8bee2374bd42e3

"objURL": "","XX":"XXX"

Q：如何获取 objURL？

使用正则表达式

/"objURL":"(.*?)",/;

将图片获取处理方法放到另一个文件 img.handler.js 中

keyword 要 encode；
访问连接为 http，避免 https 导致的各种安全性问题；

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .end((err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;
    console.log(htmlText);
  });

Q：为什么会走进百度的安全验证？

单位时间内大量访问百度图片；
没有携带 header；

正确请求的 header 结构：

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

注意：尽量保证所带的 header 跟浏览器保持一致，尤其是 Accept、UA；

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;
    console.log(htmlText);
  });

获取图片链接列表

通过正则获取 objURL 的数组；

const htmlText = res.text;
const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);

再通过正则只获取链接；

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;

    const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);

    const imageUrlList = imageMatches.map(item => {
      const imageUrl = item.match(/:"(.*?)"/g);
      return RegExp.$1;
    });

    console.log(imageUrlList);
  });

获取图片的标题

根据页面 source 获取图片名称，为：fromPageTitle

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;

    const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);

    const imageUrlList = imageMatches.map(item => {
      const imageUrl = item.match(/:"(.*?)"/g);
      return RegExp.$1;
    });

    const titleMatches = htmlText.match(/"fromPageTitle":"(.*?)",/g);

    const titleList = titleMatches.map(item => {
      const title = item.match(/:"(.*?)"/g);
      return RegExp.$1;
    });

    console.log(imageUrlList, titleList);
  });

无法直接使用，需要对 title 进行处理，且需要封装正则；

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);
  });

创建目录，存储图片

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 创建目录，存储图片
function mkImageDir(pathname) {
  const fullPath = path.resolve(__dirname, pathname);

  // 判断文件目录是否存在
  if (fs.existsSync(fullPath)) {
    console.log(`${pathname} 已存在，跳过此步骤`);
    return;
  }

  // 创建目录
  fs.mkdirSync(fullPath);
  console.log(`目录创建成功！目录为：${pathname}`);
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);

    mkImageDir('bd-images');
  });

下载图片到本地

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');

const word = '柯基搞笑图片';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 创建目录，存储图片
function mkImageDir(pathname) {
  const fullPath = path.resolve(__dirname, pathname);

  // 判断文件目录是否存在
  if (fs.existsSync(fullPath)) {
    console.log(`${pathname} 已存在，跳过此步骤`);
    return;
  }

  // 创建目录
  fs.mkdirSync(fullPath);
  console.log(`目录创建成功！目录为：${pathname}`);
}

// 下载图片到本地
function downloadImage(url, name, index) {
  const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);

  // 判断文件
  if (fs.existsSync(fullPath)) {
    console.log(`已存在，${fullPath}`);
    return;
  }

  superagent.get(url).end((err, res) => {
    if (err) {
      console.log(err, `获取链接出错，内容为：${res}`);
      return;
    }

    // 判断文件是否为空
    if (JSON.stringify(res.body) === '{}') {
      console.log(`第 ${index + 1} 图片内容为空`);
      return;
    }

    // binary：文件格式，二进制格式
    fs.writeFile(fullPath, res.body, 'binary', err => {
      if (err) {
        console.log(`第 ${index + 1} 张图片下载失败，错误信息为：${err}`);
        return;
      }
      console.log(`第 ${index + 1} 张图片下载成功，链接为：${url}`);
    });
  });
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);

    mkImageDir('bd-images');

    imageUrlList.forEach((url, index) => {
      downloadImage(url, titleList[index], index);
    });
  });

加进度条

安装 cli-progress；

创建文件、下载图片及转为 promise，改为链式调用；

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条

const word = '柯基可爱';

// 进度条显示
const bar = new cliProgress.SingleBar(
  {
    clearOnComplete: false,
  },
  cliProgress.Presets.shades_classic
);

let total = 0;
let finished = 0;

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 创建目录，存储图片
function mkImageDir(pathname) {
  return new Promise((resolve, reject) => {
    const fullPath = path.resolve(__dirname, pathname);

    // 判断文件目录是否存在
    if (fs.existsSync(fullPath)) {
      return reject(`${pathname} 已存在，跳过此步骤`);
    }

    // 创建目录
    fs.mkdirSync(fullPath);
    console.log(`目录创建成功！目录为：${pathname}`);
    return resolve();
  });
}

// 下载图片到本地
function downloadImage(url, name, index) {
  return new Promise((resolve, reject) => {
    const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);

    // 判断文件
    if (fs.existsSync(fullPath)) {
      return reject(`已存在，${fullPath}`);
    }

    superagent.get(url).end((err, res) => {
      if (err) {
        return reject(err, `获取链接出错，内容为：${res}`);
      }

      // 判断文件是否为空
      if (JSON.stringify(res.body) === '{}') {
        return resolve(`第 ${index + 1} 图片内容为空`);
      }

      // binary：文件格式，二进制格式
      fs.writeFile(fullPath, res.body, 'binary', err => {
        if (err) {
          return reject(`第 ${index + 1} 张图片下载失败，错误信息为：${err}`);
        }
        return resolve(`第 ${index + 1} 张图片下载成功，链接为：${url}`);
      });
    });
  });
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end(async (err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);

    total = imageUrlList.length;

    await mkImageDir('bd-images');

    bar.start(total, 0);

    try {
      imageUrlList.forEach((url, index) => {
        downloadImage(url, titleList[index], index)
          .then(() => {
            finished++;
            bar.update(finished);
          })
          .then(() => {
            if (finished === total) {
              bar.stop();
              console.log('恭喜你，图片已经全部下载完成');
            }
          });
      });
    } catch (error) {
      console.log('error >>>>> ', error);
    }
  });

图片存在自动删除

避免每次判断有文件存在时退出执行。

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条

const word = '柯基可爱';

// 进度条显示
const bar = new cliProgress.SingleBar(
  {
    clearOnComplete: false,
  },
  cliProgress.Presets.shades_classic
);

let total = 0;
let finished = 0;

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 图片存在自动删除
function removeDir(pathname) {
  const fullPath = path.resolve(__dirname, pathname);
  console.log(`${pathname} 目录已存在，准备执行删除`);

  fs.rmdirSync(fullPath, {
    force: true, // 强制删除
    recursive: true, // 循环删除
  });

  console.log(`目录 ${pathname} 已删除！`);

  /* 
    第二种删除方法：
    const process = require('child_process');
    process.execSync(`rm -rf ${fullPath}`);
  */
}

// 创建目录，存储图片
function mkImageDir(pathname) {
  return new Promise((resolve, reject) => {
    const fullPath = path.resolve(__dirname, pathname);

    // 判断文件目录是否存在
    if (fs.existsSync(fullPath)) {
      // return reject(`${pathname} 已存在，跳过此步骤`);
      removeDir(pathname);
    }

    // 创建目录
    fs.mkdirSync(fullPath);
    console.log(`目录创建成功！目录为：${pathname}`);
    return resolve();
  });
}

// 下载图片到本地
function downloadImage(url, name, index) {
  return new Promise((resolve, reject) => {
    const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);

    // 判断文件
    if (fs.existsSync(fullPath)) {
      return reject(`已存在，${fullPath}`);
    }

    superagent.get(url).end((err, res) => {
      if (err) {
        return reject(err, `获取链接出错，内容为：${res}`);
      }

      // 判断文件是否为空
      if (JSON.stringify(res.body) === '{}') {
        return resolve(`第 ${index + 1} 图片内容为空`);
      }

      // binary：文件格式，二进制格式
      fs.writeFile(fullPath, res.body, 'binary', err => {
        if (err) {
          return reject(`第 ${index + 1} 张图片下载失败，错误信息为：${err}`);
        }
        return resolve(`第 ${index + 1} 张图片下载成功，链接为：${url}`);
      });
    });
  });
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end(async (err, res) => {
    if (err) {
      console.log(`访问失败，原因为：${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);

    total = imageUrlList.length;

    await mkImageDir('bd-images');

    bar.start(total, 0);

    try {
      imageUrlList.forEach((url, index) => {
        downloadImage(url, titleList[index], index)
          .then(() => {
            finished++;
            bar.update(finished);
          })
          .then(() => {
            if (finished === total) {
              bar.stop();
              console.log('恭喜你，图片已经全部下载完成');
            }
          });
      });
    } catch (error) {
      console.log('error >>>>> ', error);
    }
  });

使用 CLI 输入关键词

在 img.handle.js 中，将 superagent export 出去。

#!/usr/bin/env node

const inquirer = require('inquirer'); // 这个包指定为 ^7.x.x 的版本，高版本 windows 可能会不支持 esModule
const commander = require('commander');

const { runImg } = require('./img.handler.js');

const question = [
  {
    type: 'checkbox',
    name: 'channels',
    message: '请选择想要搜索的渠道',
    choices: [
      {
        name: '百度图片',
        value: 'images',
      },
      {
        name: '百度视频',
        value: 'videos',
      },
    ],
  },
  {
    type: 'input',
    name: 'keyword',
    message: '请输入想要搜索的关键词',
  },
  {
    type: 'number',
    name: 'counts',
    message: '请输入要下载的图片张数（最小30张）',
  },
];

inquirer.prompt(question).then(result => {
  const { keyword, channels, counts } = result;

  for (let channel of channels) {
    switch (channel) {
      case 'images':
        runImg(keyword, counts);
        break;
    }
  }
});

设置指定图片张数

继续查看百度图片，翻页

完整请求：

image.baidu.com/search/acjs…

处理后请求：

image.baidu.com/search/acjs…

是 JSON 结构，取 middleURL

完整代码

./index.js

#!/usr/bin/env node

const inquirer = require('inquirer'); // 这个包指定为 ^7.x.x 的版本，高版本 windows 可能会不支持 esModule
const commander = require('commander');

const { runImg } = require('./img.handler.js');

const question = [
  {
    type: 'checkbox',
    name: 'channels',
    message: '请选择想要搜索的渠道',
    choices: [
      {
        name: '百度图片',
        value: 'images',
      },
      {
        name: '百度视频',
        value: 'videos',
      },
    ],
  },
  {
    type: 'input',
    name: 'keyword',
    message: '请输入想要搜索的关键词',
  },
  {
    type: 'number',
    name: 'counts',
    message: '请输入要下载的图片张数（最小30张）',
  },
];

inquirer.prompt(question).then(result => {
  const { keyword, channels, counts } = result;

  for (let channel of channels) {
    switch (channel) {
      case 'images':
        runImg(keyword, counts);
        break;
    }
  }
});

./img.handler.js

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条

const word = '柯基可爱';

// 进度条显示
const bar = new cliProgress.SingleBar(
  {
    clearOnComplete: false,
  },
  cliProgress.Presets.shades_classic
);

let total = 0;
let finished = 0;

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 图片存在自动删除
function removeDir(pathname) {
  const fullPath = path.resolve(__dirname, pathname);
  console.log(`${pathname} 目录已存在，准备执行删除`);

  fs.rmdirSync(fullPath, {
    force: true, // 强制删除
    recursive: true, // 循环删除
  });

  console.log(`目录 ${pathname} 已删除！`);

  /* 
    第二种删除方法：
    const process = require('child_process');
    process.execSync(`rm -rf ${fullPath}`);
  */
}

// 创建目录，存储图片
function mkImageDir(pathname) {
  return new Promise((resolve, reject) => {
    const fullPath = path.resolve(__dirname, pathname);

    // 判断文件目录是否存在
    if (fs.existsSync(fullPath)) {
      // return reject(`${pathname} 已存在，跳过此步骤`);
      removeDir(pathname);
    }

    // 创建目录
    fs.mkdirSync(fullPath);
    console.log(`目录创建成功！目录为：${pathname}`);
    return resolve();
  });
}

const errorImgList = [];

// 下载图片到本地
function downloadImage(url, name, index) {
  return new Promise((resolve, reject) => {
    const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);

    // 判断文件
    if (fs.existsSync(fullPath)) {
      return reject(`已存在，${fullPath}`);
    }

    superagent.get(url).end((err, res) => {
      if (err) {
        return reject(err, `获取链接出错，内容为：${res}`);
      }

      // 判断文件是否为空
      if (JSON.stringify(res.body) === '{}') {
        return resolve(`第 ${index + 1} 图片内容为空`);
      }

      // binary：文件格式，二进制格式
      fs.writeFile(fullPath, res.body, 'binary', err => {
        if (err) {
          return reject(`第 ${index + 1} 张图片下载失败，错误信息为：${err}`);
          // errorImgList.push(url);
          // return resolve(`第 ${index + 1} 张图片下载失败，错误信息为：${err}`);
        }
        return resolve(`第 ${index + 1} 张图片下载成功，链接为：${url}`);
      });
    });
  });
}

function runImg(keyword) {
  superagent
    .get(
      `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(
        keyword
      )}`
    )
    .set('Accept', header['Accept'])
    .set('Accept-Encoding', header['Accept-Encoding'])
    .set('Accept-Language', header['Accept-Language'])
    .set('Cache-Control', header['Cache-Control'])
    .set('Connection', header['Connection'])
    .set('User-Agent', header['User-Agent'])
    .set('sec-ch-ua', header['sec-ch-ua'])
    .end(async (err, res) => {
      if (err) {
        console.log(`访问失败，原因为：${err}`);
        return;
      }

      const htmlText = res.text;

      const imageUrlList = getValueListByReg(htmlText, 'objURL');
      const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
        item => item.replace('<strong>', '').replace('<\\/strong>', '')
        // .replace('\\', '')
        // .replace('//', '')
        // .replace('/', '')
        // .replace('|', '')
        // .replace(':', '')
        // .replace('*', '')
        // .replace('?', '')
        // .replace('<', '')
        // .replace('>', '')
      );

      console.log(imageUrlList, titleList);

      total = imageUrlList.length;

      await mkImageDir('bd-images');

      bar.start(total, 0);

      try {
        imageUrlList.forEach((url, index) => {
          downloadImage(url, titleList[index], index)
            .then(() => {
              finished++;
              bar.update(finished);
            })
            .then(() => {
              if (finished === total) {
                bar.stop();
                console.log('恭喜你，图片已经全部下载完成');
                console.log(errorImgList);
              }
            });
        });
      } catch (error) {
        console.log('error >>>>> ', error);
      }
    });
}

module.exports = {
  runImg,
};