Node 实战

159 阅读12分钟

课程目标

通过命令行,输入关键词、数量,实现下载百度搜索来的图片。

课程大纲

  • 使用 node 实现爬虫,爬百度图片;
  • 封装成一个 CLI

Node 爬虫

什么是爬虫?

爬虫就是一个探测机器,自动模拟人的行为去各个网站,点点按钮,查查数据,或者把看到的信息背回来,就像一只虫子在一幢楼里不知疲倦地爬来爬去。

  • 网络引擎:更新自己的网站内容,以及对其他网站的索引,是良性的;
  • APP:抢票软件, 1s 成千上万次,对 12306 服务器压力过大,是恶性的;
  • 个人:使用爬虫,获取网站的内容,建议只用作学习等用途;

爬虫是否可以肆无忌惮地爬取所有内容么?

不是的,爬取访问的网络,会消耗网站的流量、带宽或者其他服务器资源;

Q:那小型网站如何避免被爬虫爬取消耗带宽?

  • 服务端校验:限流、校验用户身份(cookie、页面原数据);
  • robots.txt
  • meta 参数;

robots.txt:是一种存放于网站根目录下的 ASCII 编码的文本文件,它通常告诉爬虫此网站中的哪些内容是不应被爬的,或者哪些搜索引擎是被允许爬取的;

这个协议也不是一个规范,而只是约定俗成的,有些搜索引擎会遵守这一规范,有些则不然,通常搜索引擎会识别这个元数据,不索引这个页面;

User-agent: // 指定的身份才可以爬取
Disallow: // 为空,不禁止;为 /,禁止目录为根目录,禁止所有爬虫爬取
Allow: // 为空,不禁止;为 /,禁止目录为根目录,禁止所有爬虫爬取

// 允许所有机器人
User-agent: *
Disallow:

User-agent: *
Allow: /

// 允许特定的机器人
User-agent: name_spider(用其他爬虫的名称代替)
Allow:

// 拦截所有爬虫
User-agent: *
Disallow: /

// 禁止所有机器人访问特定的目录
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

// 禁止所有机器人访问特定文件
User-agent: *
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

或者其他形式:

<meta name='robots' content='noindex,nofollow' />

如何实现一个爬虫应用?

  1. 明确要爬取的网站、页面;
  2. 分析网站的数据及 DOM
  3. 确定技术选型:
    1. 模拟浏览器端请求:

      1. request:已经不维护了,不建议使用;

      2. superagent:是在 node 层或服务端实现代理请求的模块,支持全量的 ajax methods,可以设置 headers 等(使用);

    2. 解析 DOM

      1. cheerioapi 类似 jQuery(使用);

      2. jsDOM:可以解析 Dom 文本;

    3. 模拟用户行为操作:

      1. puppeteer:相当于在 node 端启动一个浏览器,用来模拟 chrome 浏览器的各种运行,常用来定期去巡检页面功能等;

CLI

Command Line Interface,命令行交互界面,像我们经常使用的 create-react-appvue-cli 都是我们耳熟能详的 cli

作用:代替人实现重复劳动,提升开发效率

  1. 快速生成应用模板,如 vue-cli 等根据与开发者的一些交互式问答生成应用框架
  2. 创建 module 模板文件,如 angular-cli,创建 component,module;sequelize-cli 创建与 mysql 表映射的 model 等
  3. 服务启动,如 ng serve
  4. eslint,代码校验,如 vue,angular,基本都具备此功能
  5. 自动化测试 如 vue,angular,基本都具备此功能
  6. 编译 build,如 vue,angular,基本都具备此功能

与 npm scripts 对比

npm scripts 也可以实现开发工作流,通过在 package.json 中的 scripts 对象上配置相关 npm 命令,执行相关 js 来达到相同的目的;但是 cli 工具与 npm scripts 相比有什么优势呢?

  1. npm scripts 是某个具体项目的,只能在该项目内使用,cli 可以是全局安装的,多个项目使用;
  2. 使用 npm scripts 在业务工程里面嵌入工作流,耦合太高;使用 cli 可以让业务代码工作流相关代码剥离,业务代码专注业务;
  3. cli 工具可以不断迭代开发,演进,沉淀;

如何实现一个 CLI?

涉及库:

commander:为 cli 提供命令行接入的方案

inquirer:提供交互的 GUI

实战

课程上以搜索“柯基”为例子。

初始化项目

  1. npm init 初始化项目;
  2. 安装 superagentcheerio
  3. 指定项目运行入口;
npm init

// 安装基础依赖
npm i -S superagent cheerio

// 修改项目 npm scripts
start": "node index.js"

// 运行
npm run start

superagent、cheerio 使用测试

superagent 测试获取资源 html

const superagent = require('superagent');
const cheerio = require('cheerio');

const URL = 'http://www.baidu.com';
superagent.get(URL).end((err, res) => {
  if (err) {
    console.log(`访问失败,原因为:${err}`);
    return;
  }

  console.log(res); // 输出的是页面 DOM
});

cheerio 解析获取到的 html

const superagent = require('superagent');
const cheerio = require('cheerio');

const URL = 'http://www.baidu.com';
superagent.get(URL).end((err, res) => {
  if (err) {
    console.log(`访问失败,原因为:${err}`);
    return;
  }

  // console.log(res); // 输出的是页面 DOM

  const htmlText = res.text;
  const $ = cheerio.load(htmlText);

  $('meta').each((index, ele) => {
    console.log(`${index}: ${$(ele).attr('content')}`);
  });
});

解析获取到的 html,抓取百度图片

搜索:柯基,在 DOC 栏下查看;

完整的 url

image.baidu.com/search/inde…

有用的 url

image.baidu.com/search/inde…

image.baidu.com/search/inde…

  • tn:百度图片;
  • ie:编码格式;
  • wordencode(keyword)

查询到相关资源为:objURL,存在于 JSON

抓取数据源:查看网页源代码里,会看到是资源类型都是放到 JSON 里的,通过 JS 将图片渲染进页面中,没法通过 cheerio,解析 dom

objURL 后发现,图片都是经过转义的,格式为:

https:\/\/gimg2.baidu.com\/image_search\/src=http%3A%2F%2Fimg11.51tietu.net%2Fpic%2F2019112217%2F3fhqw1z53c33fhqw1z53c3.jpg&refer=http%3A%2F%2Fimg11.51tietu.net&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1659948766&t=9de4acba8fd9610a9f8bee2374bd42e3

"objURL": "","XX":"XXX"

Q:如何获取 objURL

使用正则表达式

/"objURL":"(.*?)",/;

将图片获取处理方法放到另一个文件 img.handler.js

  • keywordencode
  • 访问连接为 http,避免 https 导致的各种安全性问题;
const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .end((err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;
    console.log(htmlText);
  });

image-20221003042311769.png

Q:为什么会走进百度的安全验证?

  • 单位时间内大量访问百度图片;

  • 没有携带 header

正确请求的 header 结构:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

注意:尽量保证所带的 header 跟浏览器保持一致,尤其是 AcceptUA

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;
    console.log(htmlText);
  });

获取图片链接列表

通过正则获取 objURL 的数组;

const htmlText = res.text;
const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);

再通过正则只获取链接;

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;

    const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);

    const imageUrlList = imageMatches.map(item => {
      const imageUrl = item.match(/:"(.*?)"/g);
      return RegExp.$1;
    });

    console.log(imageUrlList);
  });

获取图片的标题

根据页面 source 获取图片名称,为:fromPageTitle

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;

    const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);

    const imageUrlList = imageMatches.map(item => {
      const imageUrl = item.match(/:"(.*?)"/g);
      return RegExp.$1;
    });

    const titleMatches = htmlText.match(/"fromPageTitle":"(.*?)",/g);

    const titleList = titleMatches.map(item => {
      const title = item.match(/:"(.*?)"/g);
      return RegExp.$1;
    });

    console.log(imageUrlList, titleList);
  });

无法直接使用,需要对 title 进行处理,且需要封装正则;

const superagent = require('superagent');
const cheerio = require('cheerio');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);
  });

创建目录,存储图片

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');

const word = '柯基';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 创建目录,存储图片
function mkImageDir(pathname) {
  const fullPath = path.resolve(__dirname, pathname);

  // 判断文件目录是否存在
  if (fs.existsSync(fullPath)) {
    console.log(`${pathname} 已存在,跳过此步骤`);
    return;
  }

  // 创建目录
  fs.mkdirSync(fullPath);
  console.log(`目录创建成功!目录为:${pathname}`);
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);

    mkImageDir('bd-images');
  });

下载图片到本地

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');

const word = '柯基搞笑图片';

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 创建目录,存储图片
function mkImageDir(pathname) {
  const fullPath = path.resolve(__dirname, pathname);

  // 判断文件目录是否存在
  if (fs.existsSync(fullPath)) {
    console.log(`${pathname} 已存在,跳过此步骤`);
    return;
  }

  // 创建目录
  fs.mkdirSync(fullPath);
  console.log(`目录创建成功!目录为:${pathname}`);
}

// 下载图片到本地
function downloadImage(url, name, index) {
  const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);

  // 判断文件
  if (fs.existsSync(fullPath)) {
    console.log(`已存在,${fullPath}`);
    return;
  }

  superagent.get(url).end((err, res) => {
    if (err) {
      console.log(err, `获取链接出错,内容为:${res}`);
      return;
    }

    // 判断文件是否为空
    if (JSON.stringify(res.body) === '{}') {
      console.log(`第 ${index + 1} 图片内容为空`);
      return;
    }

    // binary:文件格式,二进制格式
    fs.writeFile(fullPath, res.body, 'binary', err => {
      if (err) {
        console.log(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
        return;
      }
      console.log(`第 ${index + 1} 张图片下载成功,链接为:${url}`);
    });
  });
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end((err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);

    mkImageDir('bd-images');

    imageUrlList.forEach((url, index) => {
      downloadImage(url, titleList[index], index);
    });
  });

加进度条

安装 cli-progress

创建文件、下载图片及转为 promise,改为链式调用;

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条

const word = '柯基可爱';

// 进度条显示
const bar = new cliProgress.SingleBar(
  {
    clearOnComplete: false,
  },
  cliProgress.Presets.shades_classic
);

let total = 0;
let finished = 0;

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 创建目录,存储图片
function mkImageDir(pathname) {
  return new Promise((resolve, reject) => {
    const fullPath = path.resolve(__dirname, pathname);

    // 判断文件目录是否存在
    if (fs.existsSync(fullPath)) {
      return reject(`${pathname} 已存在,跳过此步骤`);
    }

    // 创建目录
    fs.mkdirSync(fullPath);
    console.log(`目录创建成功!目录为:${pathname}`);
    return resolve();
  });
}

// 下载图片到本地
function downloadImage(url, name, index) {
  return new Promise((resolve, reject) => {
    const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);

    // 判断文件
    if (fs.existsSync(fullPath)) {
      return reject(`已存在,${fullPath}`);
    }

    superagent.get(url).end((err, res) => {
      if (err) {
        return reject(err, `获取链接出错,内容为:${res}`);
      }

      // 判断文件是否为空
      if (JSON.stringify(res.body) === '{}') {
        return resolve(`第 ${index + 1} 图片内容为空`);
      }

      // binary:文件格式,二进制格式
      fs.writeFile(fullPath, res.body, 'binary', err => {
        if (err) {
          return reject(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
        }
        return resolve(`第 ${index + 1} 张图片下载成功,链接为:${url}`);
      });
    });
  });
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end(async (err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);

    total = imageUrlList.length;

    await mkImageDir('bd-images');

    bar.start(total, 0);

    try {
      imageUrlList.forEach((url, index) => {
        downloadImage(url, titleList[index], index)
          .then(() => {
            finished++;
            bar.update(finished);
          })
          .then(() => {
            if (finished === total) {
              bar.stop();
              console.log('恭喜你,图片已经全部下载完成');
            }
          });
      });
    } catch (error) {
      console.log('error >>>>> ', error);
    }
  });

图片存在自动删除

避免每次判断有文件存在时退出执行。

const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条

const word = '柯基可爱';

// 进度条显示
const bar = new cliProgress.SingleBar(
  {
    clearOnComplete: false,
  },
  cliProgress.Presets.shades_classic
);

let total = 0;
let finished = 0;

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 图片存在自动删除
function removeDir(pathname) {
  const fullPath = path.resolve(__dirname, pathname);
  console.log(`${pathname} 目录已存在,准备执行删除`);

  fs.rmdirSync(fullPath, {
    force: true, // 强制删除
    recursive: true, // 循环删除
  });

  console.log(`目录 ${pathname} 已删除!`);

  /* 
    第二种删除方法:
    const process = require('child_process');
    process.execSync(`rm -rf ${fullPath}`);
  */
}

// 创建目录,存储图片
function mkImageDir(pathname) {
  return new Promise((resolve, reject) => {
    const fullPath = path.resolve(__dirname, pathname);

    // 判断文件目录是否存在
    if (fs.existsSync(fullPath)) {
      // return reject(`${pathname} 已存在,跳过此步骤`);
      removeDir(pathname);
    }

    // 创建目录
    fs.mkdirSync(fullPath);
    console.log(`目录创建成功!目录为:${pathname}`);
    return resolve();
  });
}

// 下载图片到本地
function downloadImage(url, name, index) {
  return new Promise((resolve, reject) => {
    const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);

    // 判断文件
    if (fs.existsSync(fullPath)) {
      return reject(`已存在,${fullPath}`);
    }

    superagent.get(url).end((err, res) => {
      if (err) {
        return reject(err, `获取链接出错,内容为:${res}`);
      }

      // 判断文件是否为空
      if (JSON.stringify(res.body) === '{}') {
        return resolve(`第 ${index + 1} 图片内容为空`);
      }

      // binary:文件格式,二进制格式
      fs.writeFile(fullPath, res.body, 'binary', err => {
        if (err) {
          return reject(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
        }
        return resolve(`第 ${index + 1} 张图片下载成功,链接为:${url}`);
      });
    });
  });
}

superagent
  .get(
    `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
  )
  .set('Accept', header['Accept'])
  .set('Accept-Encoding', header['Accept-Encoding'])
  .set('Accept-Language', header['Accept-Language'])
  .set('Cache-Control', header['Cache-Control'])
  .set('Connection', header['Connection'])
  .set('User-Agent', header['User-Agent'])
  .set('sec-ch-ua', header['sec-ch-ua'])
  .end(async (err, res) => {
    if (err) {
      console.log(`访问失败,原因为:${err}`);
      return;
    }

    const htmlText = res.text;

    const imageUrlList = getValueListByReg(htmlText, 'objURL');
    const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
      item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
      // .replace('\\', '')
      // .replace('\/', '')
      // .replace('/', '')
      // .replace(':', '')
      // .replace('*', '')
      // .replace('?', '')
      // .replace('<', '')
      // .replace('>', '')
      // .replace('|', '')
    );

    console.log(imageUrlList, titleList);

    total = imageUrlList.length;

    await mkImageDir('bd-images');

    bar.start(total, 0);

    try {
      imageUrlList.forEach((url, index) => {
        downloadImage(url, titleList[index], index)
          .then(() => {
            finished++;
            bar.update(finished);
          })
          .then(() => {
            if (finished === total) {
              bar.stop();
              console.log('恭喜你,图片已经全部下载完成');
            }
          });
      });
    } catch (error) {
      console.log('error >>>>> ', error);
    }
  });

使用 CLI 输入关键词

img.handle.js 中,将 superagent export 出去。

#!/usr/bin/env node

const inquirer = require('inquirer'); // 这个包指定为 ^7.x.x 的版本,高版本 windows 可能会不支持 esModule
const commander = require('commander');

const { runImg } = require('./img.handler.js');

const question = [
  {
    type: 'checkbox',
    name: 'channels',
    message: '请选择想要搜索的渠道',
    choices: [
      {
        name: '百度图片',
        value: 'images',
      },
      {
        name: '百度视频',
        value: 'videos',
      },
    ],
  },
  {
    type: 'input',
    name: 'keyword',
    message: '请输入想要搜索的关键词',
  },
  {
    type: 'number',
    name: 'counts',
    message: '请输入要下载的图片张数(最小30张)',
  },
];

inquirer.prompt(question).then(result => {
  const { keyword, channels, counts } = result;

  for (let channel of channels) {
    switch (channel) {
      case 'images':
        runImg(keyword, counts);
        break;
    }
  }
});

设置指定图片张数

继续查看百度图片,翻页

完整请求:

image.baidu.com/search/acjs…

处理后请求:

image.baidu.com/search/acjs…

是 JSON 结构,取 middleURL

完整代码

  • ./index.js
#!/usr/bin/env node

const inquirer = require('inquirer'); // 这个包指定为 ^7.x.x 的版本,高版本 windows 可能会不支持 esModule
const commander = require('commander');

const { runImg } = require('./img.handler.js');

const question = [
  {
    type: 'checkbox',
    name: 'channels',
    message: '请选择想要搜索的渠道',
    choices: [
      {
        name: '百度图片',
        value: 'images',
      },
      {
        name: '百度视频',
        value: 'videos',
      },
    ],
  },
  {
    type: 'input',
    name: 'keyword',
    message: '请输入想要搜索的关键词',
  },
  {
    type: 'number',
    name: 'counts',
    message: '请输入要下载的图片张数(最小30张)',
  },
];

inquirer.prompt(question).then(result => {
  const { keyword, channels, counts } = result;

  for (let channel of channels) {
    switch (channel) {
      case 'images':
        runImg(keyword, counts);
        break;
    }
  }
});
  • ./img.handler.js
const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条

const word = '柯基可爱';

// 进度条显示
const bar = new cliProgress.SingleBar(
  {
    clearOnComplete: false,
  },
  cliProgress.Presets.shades_classic
);

let total = 0;
let finished = 0;

// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"

const header = {
  Accept:
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Accept2: 'text/plain, */*; q=0.01',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Cache-Control': 'max-age=0',
  Connection: 'keep-alive',
  'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};

function getValueListByReg(str, key) {
  const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
  const matchResult = str.match(reg);

  const resultList = matchResult.map(item => {
    const result = item.match(/:"(.*?)"/g);
    return RegExp.$1;
  });

  return resultList;
}

// 图片存在自动删除
function removeDir(pathname) {
  const fullPath = path.resolve(__dirname, pathname);
  console.log(`${pathname} 目录已存在,准备执行删除`);

  fs.rmdirSync(fullPath, {
    force: true, // 强制删除
    recursive: true, // 循环删除
  });

  console.log(`目录 ${pathname} 已删除!`);

  /* 
    第二种删除方法:
    const process = require('child_process');
    process.execSync(`rm -rf ${fullPath}`);
  */
}

// 创建目录,存储图片
function mkImageDir(pathname) {
  return new Promise((resolve, reject) => {
    const fullPath = path.resolve(__dirname, pathname);

    // 判断文件目录是否存在
    if (fs.existsSync(fullPath)) {
      // return reject(`${pathname} 已存在,跳过此步骤`);
      removeDir(pathname);
    }

    // 创建目录
    fs.mkdirSync(fullPath);
    console.log(`目录创建成功!目录为:${pathname}`);
    return resolve();
  });
}

const errorImgList = [];

// 下载图片到本地
function downloadImage(url, name, index) {
  return new Promise((resolve, reject) => {
    const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);

    // 判断文件
    if (fs.existsSync(fullPath)) {
      return reject(`已存在,${fullPath}`);
    }

    superagent.get(url).end((err, res) => {
      if (err) {
        return reject(err, `获取链接出错,内容为:${res}`);
      }

      // 判断文件是否为空
      if (JSON.stringify(res.body) === '{}') {
        return resolve(`第 ${index + 1} 图片内容为空`);
      }

      // binary:文件格式,二进制格式
      fs.writeFile(fullPath, res.body, 'binary', err => {
        if (err) {
          return reject(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
          // errorImgList.push(url);
          // return resolve(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
        }
        return resolve(`第 ${index + 1} 张图片下载成功,链接为:${url}`);
      });
    });
  });
}

function runImg(keyword) {
  superagent
    .get(
      `https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(
        keyword
      )}`
    )
    .set('Accept', header['Accept'])
    .set('Accept-Encoding', header['Accept-Encoding'])
    .set('Accept-Language', header['Accept-Language'])
    .set('Cache-Control', header['Cache-Control'])
    .set('Connection', header['Connection'])
    .set('User-Agent', header['User-Agent'])
    .set('sec-ch-ua', header['sec-ch-ua'])
    .end(async (err, res) => {
      if (err) {
        console.log(`访问失败,原因为:${err}`);
        return;
      }

      const htmlText = res.text;

      const imageUrlList = getValueListByReg(htmlText, 'objURL');
      const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
        item => item.replace('<strong>', '').replace('<\\/strong>', '')
        // .replace('\\', '')
        // .replace('//', '')
        // .replace('/', '')
        // .replace('|', '')
        // .replace(':', '')
        // .replace('*', '')
        // .replace('?', '')
        // .replace('<', '')
        // .replace('>', '')
      );

      console.log(imageUrlList, titleList);

      total = imageUrlList.length;

      await mkImageDir('bd-images');

      bar.start(total, 0);

      try {
        imageUrlList.forEach((url, index) => {
          downloadImage(url, titleList[index], index)
            .then(() => {
              finished++;
              bar.update(finished);
            })
            .then(() => {
              if (finished === total) {
                bar.stop();
                console.log('恭喜你,图片已经全部下载完成');
                console.log(errorImgList);
              }
            });
        });
      } catch (error) {
        console.log('error >>>>> ', error);
      }
    });
}

module.exports = {
  runImg,
};