课程目标
通过命令行,输入关键词、数量,实现下载百度搜索来的图片。
课程大纲
- 使用
node实现爬虫,爬百度图片; - 封装成一个
CLI;
Node 爬虫
什么是爬虫?
爬虫就是一个探测机器,自动模拟人的行为去各个网站,点点按钮,查查数据,或者把看到的信息背回来,就像一只虫子在一幢楼里不知疲倦地爬来爬去。
- 网络引擎:更新自己的网站内容,以及对其他网站的索引,是良性的;
APP:抢票软件,1s成千上万次,对12306服务器压力过大,是恶性的;- 个人:使用爬虫,获取网站的内容,建议只用作学习等用途;
爬虫是否可以肆无忌惮地爬取所有内容么?
不是的,爬取访问的网络,会消耗网站的流量、带宽或者其他服务器资源;
Q:那小型网站如何避免被爬虫爬取消耗带宽?
- 服务端校验:限流、校验用户身份(
cookie、页面原数据); robots.txt;meta参数;
robots.txt:是一种存放于网站根目录下的 ASCII 编码的文本文件,它通常告诉爬虫此网站中的哪些内容是不应被爬的,或者哪些搜索引擎是被允许爬取的;
这个协议也不是一个规范,而只是约定俗成的,有些搜索引擎会遵守这一规范,有些则不然,通常搜索引擎会识别这个元数据,不索引这个页面;
User-agent: // 指定的身份才可以爬取
Disallow: // 为空,不禁止;为 /,禁止目录为根目录,禁止所有爬虫爬取
Allow: // 为空,不禁止;为 /,禁止目录为根目录,禁止所有爬虫爬取
// 允许所有机器人
User-agent: *
Disallow:
User-agent: *
Allow: /
// 允许特定的机器人
User-agent: name_spider(用其他爬虫的名称代替)
Allow:
// 拦截所有爬虫
User-agent: *
Disallow: /
// 禁止所有机器人访问特定的目录
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
// 禁止所有机器人访问特定文件
User-agent: *
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
或者其他形式:
<meta name='robots' content='noindex,nofollow' />
如何实现一个爬虫应用?
- 明确要爬取的网站、页面;
- 分析网站的数据及
DOM; - 确定技术选型:
-
模拟浏览器端请求:
-
request:已经不维护了,不建议使用;
-
superagent:是在
node层或服务端实现代理请求的模块,支持全量的ajax methods,可以设置headers等(使用);
-
-
解析
DOM: -
模拟用户行为操作:
- puppeteer:相当于在
node端启动一个浏览器,用来模拟chrome浏览器的各种运行,常用来定期去巡检页面功能等;
- puppeteer:相当于在
-
CLI
Command Line Interface,命令行交互界面,像我们经常使用的 create-react-app、vue-cli 都是我们耳熟能详的 cli。
作用:代替人实现重复劳动,提升开发效率
- 快速生成应用模板,如 vue-cli 等根据与开发者的一些交互式问答生成应用框架
- 创建 module 模板文件,如 angular-cli,创建 component,module;sequelize-cli 创建与 mysql 表映射的 model 等
- 服务启动,如 ng serve
- eslint,代码校验,如 vue,angular,基本都具备此功能
- 自动化测试 如 vue,angular,基本都具备此功能
- 编译 build,如 vue,angular,基本都具备此功能
与 npm scripts 对比
npm scripts 也可以实现开发工作流,通过在 package.json 中的 scripts 对象上配置相关 npm 命令,执行相关 js 来达到相同的目的;但是 cli 工具与 npm scripts 相比有什么优势呢?
- npm scripts 是某个具体项目的,只能在该项目内使用,cli 可以是全局安装的,多个项目使用;
- 使用 npm scripts 在业务工程里面嵌入工作流,耦合太高;使用 cli 可以让业务代码工作流相关代码剥离,业务代码专注业务;
- cli 工具可以不断迭代开发,演进,沉淀;
如何实现一个 CLI?
涉及库:
commander:为 cli 提供命令行接入的方案
inquirer:提供交互的 GUI
实战
课程上以搜索“柯基”为例子。
初始化项目
npm init初始化项目;- 安装
superagent、cheerio; - 指定项目运行入口;
npm init
// 安装基础依赖
npm i -S superagent cheerio
// 修改项目 npm scripts
start": "node index.js"
// 运行
npm run start
superagent、cheerio 使用测试
superagent 测试获取资源 html
const superagent = require('superagent');
const cheerio = require('cheerio');
const URL = 'http://www.baidu.com';
superagent.get(URL).end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
console.log(res); // 输出的是页面 DOM
});
cheerio 解析获取到的 html
const superagent = require('superagent');
const cheerio = require('cheerio');
const URL = 'http://www.baidu.com';
superagent.get(URL).end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
// console.log(res); // 输出的是页面 DOM
const htmlText = res.text;
const $ = cheerio.load(htmlText);
$('meta').each((index, ele) => {
console.log(`${index}: ${$(ele).attr('content')}`);
});
});
解析获取到的 html,抓取百度图片
搜索:柯基,在 DOC 栏下查看;
完整的 url:
有用的 url:
tn:百度图片;ie:编码格式;word:encode(keyword);
查询到相关资源为:objURL,存在于 JSON;
抓取数据源:查看网页源代码里,会看到是资源类型都是放到 JSON 里的,通过 JS 将图片渲染进页面中,没法通过 cheerio,解析 dom;
获 objURL 后发现,图片都是经过转义的,格式为:
https:\/\/gimg2.baidu.com\/image_search\/src=http%3A%2F%2Fimg11.51tietu.net%2Fpic%2F2019112217%2F3fhqw1z53c33fhqw1z53c3.jpg&refer=http%3A%2F%2Fimg11.51tietu.net&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1659948766&t=9de4acba8fd9610a9f8bee2374bd42e3
"objURL": "","XX":"XXX"
Q:如何获取 objURL?
使用正则表达式
/"objURL":"(.*?)",/;
将图片获取处理方法放到另一个文件 img.handler.js 中
keyword要encode;- 访问连接为
http,避免https导致的各种安全性问题;
const superagent = require('superagent');
const cheerio = require('cheerio');
const word = '柯基';
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
console.log(htmlText);
});
Q:为什么会走进百度的安全验证?
-
单位时间内大量访问百度图片;
-
没有携带
header;
正确请求的 header 结构:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
注意:尽量保证所带的 header 跟浏览器保持一致,尤其是 Accept、UA;
const superagent = require('superagent');
const cheerio = require('cheerio');
const word = '柯基';
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
console.log(htmlText);
});
获取图片链接列表
通过正则获取 objURL 的数组;
const htmlText = res.text;
const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);
再通过正则只获取链接;
const superagent = require('superagent');
const cheerio = require('cheerio');
const word = '柯基';
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
// Accept2: 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);
const imageUrlList = imageMatches.map(item => {
const imageUrl = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
console.log(imageUrlList);
});
获取图片的标题
根据页面 source 获取图片名称,为:fromPageTitle
const superagent = require('superagent');
const cheerio = require('cheerio');
const word = '柯基';
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
// Accept2: 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
const imageMatches = htmlText.match(/"objURL":"(.*?)",/g);
const imageUrlList = imageMatches.map(item => {
const imageUrl = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
const titleMatches = htmlText.match(/"fromPageTitle":"(.*?)",/g);
const titleList = titleMatches.map(item => {
const title = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
console.log(imageUrlList, titleList);
});
无法直接使用,需要对 title 进行处理,且需要封装正则;
const superagent = require('superagent');
const cheerio = require('cheerio');
const word = '柯基';
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
// Accept2: 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
function getValueListByReg(str, key) {
const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
const matchResult = str.match(reg);
const resultList = matchResult.map(item => {
const result = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
return resultList;
}
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
const imageUrlList = getValueListByReg(htmlText, 'objURL');
const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
// .replace('\\', '')
// .replace('\/', '')
// .replace('/', '')
// .replace(':', '')
// .replace('*', '')
// .replace('?', '')
// .replace('<', '')
// .replace('>', '')
// .replace('|', '')
);
console.log(imageUrlList, titleList);
});
创建目录,存储图片
const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const word = '柯基';
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
// Accept2: 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
function getValueListByReg(str, key) {
const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
const matchResult = str.match(reg);
const resultList = matchResult.map(item => {
const result = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
return resultList;
}
// 创建目录,存储图片
function mkImageDir(pathname) {
const fullPath = path.resolve(__dirname, pathname);
// 判断文件目录是否存在
if (fs.existsSync(fullPath)) {
console.log(`${pathname} 已存在,跳过此步骤`);
return;
}
// 创建目录
fs.mkdirSync(fullPath);
console.log(`目录创建成功!目录为:${pathname}`);
}
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
const imageUrlList = getValueListByReg(htmlText, 'objURL');
const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
// .replace('\\', '')
// .replace('\/', '')
// .replace('/', '')
// .replace(':', '')
// .replace('*', '')
// .replace('?', '')
// .replace('<', '')
// .replace('>', '')
// .replace('|', '')
);
console.log(imageUrlList, titleList);
mkImageDir('bd-images');
});
下载图片到本地
const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const word = '柯基搞笑图片';
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
// Accept2: 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
function getValueListByReg(str, key) {
const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
const matchResult = str.match(reg);
const resultList = matchResult.map(item => {
const result = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
return resultList;
}
// 创建目录,存储图片
function mkImageDir(pathname) {
const fullPath = path.resolve(__dirname, pathname);
// 判断文件目录是否存在
if (fs.existsSync(fullPath)) {
console.log(`${pathname} 已存在,跳过此步骤`);
return;
}
// 创建目录
fs.mkdirSync(fullPath);
console.log(`目录创建成功!目录为:${pathname}`);
}
// 下载图片到本地
function downloadImage(url, name, index) {
const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);
// 判断文件
if (fs.existsSync(fullPath)) {
console.log(`已存在,${fullPath}`);
return;
}
superagent.get(url).end((err, res) => {
if (err) {
console.log(err, `获取链接出错,内容为:${res}`);
return;
}
// 判断文件是否为空
if (JSON.stringify(res.body) === '{}') {
console.log(`第 ${index + 1} 图片内容为空`);
return;
}
// binary:文件格式,二进制格式
fs.writeFile(fullPath, res.body, 'binary', err => {
if (err) {
console.log(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
return;
}
console.log(`第 ${index + 1} 张图片下载成功,链接为:${url}`);
});
});
}
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end((err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
const imageUrlList = getValueListByReg(htmlText, 'objURL');
const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
// .replace('\\', '')
// .replace('\/', '')
// .replace('/', '')
// .replace(':', '')
// .replace('*', '')
// .replace('?', '')
// .replace('<', '')
// .replace('>', '')
// .replace('|', '')
);
console.log(imageUrlList, titleList);
mkImageDir('bd-images');
imageUrlList.forEach((url, index) => {
downloadImage(url, titleList[index], index);
});
});
加进度条
安装 cli-progress;
创建文件、下载图片及转为 promise,改为链式调用;
const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条
const word = '柯基可爱';
// 进度条显示
const bar = new cliProgress.SingleBar(
{
clearOnComplete: false,
},
cliProgress.Presets.shades_classic
);
let total = 0;
let finished = 0;
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
// Accept2: 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
function getValueListByReg(str, key) {
const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
const matchResult = str.match(reg);
const resultList = matchResult.map(item => {
const result = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
return resultList;
}
// 创建目录,存储图片
function mkImageDir(pathname) {
return new Promise((resolve, reject) => {
const fullPath = path.resolve(__dirname, pathname);
// 判断文件目录是否存在
if (fs.existsSync(fullPath)) {
return reject(`${pathname} 已存在,跳过此步骤`);
}
// 创建目录
fs.mkdirSync(fullPath);
console.log(`目录创建成功!目录为:${pathname}`);
return resolve();
});
}
// 下载图片到本地
function downloadImage(url, name, index) {
return new Promise((resolve, reject) => {
const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);
// 判断文件
if (fs.existsSync(fullPath)) {
return reject(`已存在,${fullPath}`);
}
superagent.get(url).end((err, res) => {
if (err) {
return reject(err, `获取链接出错,内容为:${res}`);
}
// 判断文件是否为空
if (JSON.stringify(res.body) === '{}') {
return resolve(`第 ${index + 1} 图片内容为空`);
}
// binary:文件格式,二进制格式
fs.writeFile(fullPath, res.body, 'binary', err => {
if (err) {
return reject(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
}
return resolve(`第 ${index + 1} 张图片下载成功,链接为:${url}`);
});
});
});
}
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end(async (err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
const imageUrlList = getValueListByReg(htmlText, 'objURL');
const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
// .replace('\\', '')
// .replace('\/', '')
// .replace('/', '')
// .replace(':', '')
// .replace('*', '')
// .replace('?', '')
// .replace('<', '')
// .replace('>', '')
// .replace('|', '')
);
console.log(imageUrlList, titleList);
total = imageUrlList.length;
await mkImageDir('bd-images');
bar.start(total, 0);
try {
imageUrlList.forEach((url, index) => {
downloadImage(url, titleList[index], index)
.then(() => {
finished++;
bar.update(finished);
})
.then(() => {
if (finished === total) {
bar.stop();
console.log('恭喜你,图片已经全部下载完成');
}
});
});
} catch (error) {
console.log('error >>>>> ', error);
}
});
图片存在自动删除
避免每次判断有文件存在时退出执行。
const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条
const word = '柯基可爱';
// 进度条显示
const bar = new cliProgress.SingleBar(
{
clearOnComplete: false,
},
cliProgress.Presets.shades_classic
);
let total = 0;
let finished = 0;
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
// Accept2: 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
function getValueListByReg(str, key) {
const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
const matchResult = str.match(reg);
const resultList = matchResult.map(item => {
const result = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
return resultList;
}
// 图片存在自动删除
function removeDir(pathname) {
const fullPath = path.resolve(__dirname, pathname);
console.log(`${pathname} 目录已存在,准备执行删除`);
fs.rmdirSync(fullPath, {
force: true, // 强制删除
recursive: true, // 循环删除
});
console.log(`目录 ${pathname} 已删除!`);
/*
第二种删除方法:
const process = require('child_process');
process.execSync(`rm -rf ${fullPath}`);
*/
}
// 创建目录,存储图片
function mkImageDir(pathname) {
return new Promise((resolve, reject) => {
const fullPath = path.resolve(__dirname, pathname);
// 判断文件目录是否存在
if (fs.existsSync(fullPath)) {
// return reject(`${pathname} 已存在,跳过此步骤`);
removeDir(pathname);
}
// 创建目录
fs.mkdirSync(fullPath);
console.log(`目录创建成功!目录为:${pathname}`);
return resolve();
});
}
// 下载图片到本地
function downloadImage(url, name, index) {
return new Promise((resolve, reject) => {
const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);
// 判断文件
if (fs.existsSync(fullPath)) {
return reject(`已存在,${fullPath}`);
}
superagent.get(url).end((err, res) => {
if (err) {
return reject(err, `获取链接出错,内容为:${res}`);
}
// 判断文件是否为空
if (JSON.stringify(res.body) === '{}') {
return resolve(`第 ${index + 1} 图片内容为空`);
}
// binary:文件格式,二进制格式
fs.writeFile(fullPath, res.body, 'binary', err => {
if (err) {
return reject(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
}
return resolve(`第 ${index + 1} 张图片下载成功,链接为:${url}`);
});
});
});
}
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(word)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end(async (err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
const imageUrlList = getValueListByReg(htmlText, 'objURL');
const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
item => item.replace('<strong>', '').replace('<\\/strong>', '').replace('//', '')
// .replace('\\', '')
// .replace('\/', '')
// .replace('/', '')
// .replace(':', '')
// .replace('*', '')
// .replace('?', '')
// .replace('<', '')
// .replace('>', '')
// .replace('|', '')
);
console.log(imageUrlList, titleList);
total = imageUrlList.length;
await mkImageDir('bd-images');
bar.start(total, 0);
try {
imageUrlList.forEach((url, index) => {
downloadImage(url, titleList[index], index)
.then(() => {
finished++;
bar.update(finished);
})
.then(() => {
if (finished === total) {
bar.stop();
console.log('恭喜你,图片已经全部下载完成');
}
});
});
} catch (error) {
console.log('error >>>>> ', error);
}
});
使用 CLI 输入关键词
在 img.handle.js 中,将 superagent export 出去。
#!/usr/bin/env node
const inquirer = require('inquirer'); // 这个包指定为 ^7.x.x 的版本,高版本 windows 可能会不支持 esModule
const commander = require('commander');
const { runImg } = require('./img.handler.js');
const question = [
{
type: 'checkbox',
name: 'channels',
message: '请选择想要搜索的渠道',
choices: [
{
name: '百度图片',
value: 'images',
},
{
name: '百度视频',
value: 'videos',
},
],
},
{
type: 'input',
name: 'keyword',
message: '请输入想要搜索的关键词',
},
{
type: 'number',
name: 'counts',
message: '请输入要下载的图片张数(最小30张)',
},
];
inquirer.prompt(question).then(result => {
const { keyword, channels, counts } = result;
for (let channel of channels) {
switch (channel) {
case 'images':
runImg(keyword, counts);
break;
}
}
});
设置指定图片张数
继续查看百度图片,翻页
完整请求:
处理后请求:
是 JSON 结构,取 middleURL
完整代码
./index.js
#!/usr/bin/env node
const inquirer = require('inquirer'); // 这个包指定为 ^7.x.x 的版本,高版本 windows 可能会不支持 esModule
const commander = require('commander');
const { runImg } = require('./img.handler.js');
const question = [
{
type: 'checkbox',
name: 'channels',
message: '请选择想要搜索的渠道',
choices: [
{
name: '百度图片',
value: 'images',
},
{
name: '百度视频',
value: 'videos',
},
],
},
{
type: 'input',
name: 'keyword',
message: '请输入想要搜索的关键词',
},
{
type: 'number',
name: 'counts',
message: '请输入要下载的图片张数(最小30张)',
},
];
inquirer.prompt(question).then(result => {
const { keyword, channels, counts } = result;
for (let channel of channels) {
switch (channel) {
case 'images':
runImg(keyword, counts);
break;
}
}
});
./img.handler.js
const superagent = require('superagent');
const cheerio = require('cheerio');
const path = require('path');
const fs = require('fs');
const cliProgress = require('cli-progress'); // 进度条
const word = '柯基可爱';
// 进度条显示
const bar = new cliProgress.SingleBar(
{
clearOnComplete: false,
},
cliProgress.Presets.shades_classic
);
let total = 0;
let finished = 0;
// Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
// Accept-Encoding: gzip, deflate, br
// Accept-Language: zh-CN,zh;q=0.9
// Cache-Control: max-age=0
// Connection: keep-alive
// User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
// sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
const header = {
Accept:
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
// Accept2: 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
Connection: 'keep-alive',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
};
function getValueListByReg(str, key) {
const reg = new RegExp(`"${key}":"(.*?)"`, 'g');
const matchResult = str.match(reg);
const resultList = matchResult.map(item => {
const result = item.match(/:"(.*?)"/g);
return RegExp.$1;
});
return resultList;
}
// 图片存在自动删除
function removeDir(pathname) {
const fullPath = path.resolve(__dirname, pathname);
console.log(`${pathname} 目录已存在,准备执行删除`);
fs.rmdirSync(fullPath, {
force: true, // 强制删除
recursive: true, // 循环删除
});
console.log(`目录 ${pathname} 已删除!`);
/*
第二种删除方法:
const process = require('child_process');
process.execSync(`rm -rf ${fullPath}`);
*/
}
// 创建目录,存储图片
function mkImageDir(pathname) {
return new Promise((resolve, reject) => {
const fullPath = path.resolve(__dirname, pathname);
// 判断文件目录是否存在
if (fs.existsSync(fullPath)) {
// return reject(`${pathname} 已存在,跳过此步骤`);
removeDir(pathname);
}
// 创建目录
fs.mkdirSync(fullPath);
console.log(`目录创建成功!目录为:${pathname}`);
return resolve();
});
}
const errorImgList = [];
// 下载图片到本地
function downloadImage(url, name, index) {
return new Promise((resolve, reject) => {
const fullPath = path.join(__dirname, 'bd-images', `${index + 1}-${name.replace('?', '')}.png`);
// 判断文件
if (fs.existsSync(fullPath)) {
return reject(`已存在,${fullPath}`);
}
superagent.get(url).end((err, res) => {
if (err) {
return reject(err, `获取链接出错,内容为:${res}`);
}
// 判断文件是否为空
if (JSON.stringify(res.body) === '{}') {
return resolve(`第 ${index + 1} 图片内容为空`);
}
// binary:文件格式,二进制格式
fs.writeFile(fullPath, res.body, 'binary', err => {
if (err) {
return reject(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
// errorImgList.push(url);
// return resolve(`第 ${index + 1} 张图片下载失败,错误信息为:${err}`);
}
return resolve(`第 ${index + 1} 张图片下载成功,链接为:${url}`);
});
});
});
}
function runImg(keyword) {
superagent
.get(
`https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=${encodeURIComponent(
keyword
)}`
)
.set('Accept', header['Accept'])
.set('Accept-Encoding', header['Accept-Encoding'])
.set('Accept-Language', header['Accept-Language'])
.set('Cache-Control', header['Cache-Control'])
.set('Connection', header['Connection'])
.set('User-Agent', header['User-Agent'])
.set('sec-ch-ua', header['sec-ch-ua'])
.end(async (err, res) => {
if (err) {
console.log(`访问失败,原因为:${err}`);
return;
}
const htmlText = res.text;
const imageUrlList = getValueListByReg(htmlText, 'objURL');
const titleList = getValueListByReg(htmlText, 'fromPageTitle').map(
item => item.replace('<strong>', '').replace('<\\/strong>', '')
// .replace('\\', '')
// .replace('//', '')
// .replace('/', '')
// .replace('|', '')
// .replace(':', '')
// .replace('*', '')
// .replace('?', '')
// .replace('<', '')
// .replace('>', '')
);
console.log(imageUrlList, titleList);
total = imageUrlList.length;
await mkImageDir('bd-images');
bar.start(total, 0);
try {
imageUrlList.forEach((url, index) => {
downloadImage(url, titleList[index], index)
.then(() => {
finished++;
bar.update(finished);
})
.then(() => {
if (finished === total) {
bar.stop();
console.log('恭喜你,图片已经全部下载完成');
console.log(errorImgList);
}
});
});
} catch (error) {
console.log('error >>>>> ', error);
}
});
}
module.exports = {
runImg,
};