用 TypeScript 写个简单的爬虫首先选择要爬取的网站，我们使用 Quotes to Scrape 来练手。该网站

世上只有两种编程语言：一种是总是被人骂的，一种是从来没人用的。

首先选择要爬取的网站，我们使用 Quotes to Scrape 来练手。该网站的数据来源是来自 GoodReads ，由 Python 的开源爬虫框架 Scrapy（现在是 Zyte）爬取并建站，也是 Scrapy 推荐的爬虫入门网站。

初始化项目

执行相应的命令。

生成 package.json 文件npm init -y
生成 tsconfig.json 配置文件 tsc —init
安装 ts-node 工具，用于直接运行 TS 文件 npm install ts-node --save-dev
安装 TypeScript 依赖 npm install typescript --save-dev
安装 superagent 依赖 npm install superagent @types/superagent --save

@types/superagent 是 superagent 的 .d.ts 翻译包，因为 superagent 是用 JS 写的，需要通过 .d.ts 文件去分析 .js 文件里的内容。相当于对 superagent 从 .js 导出的方法或属性，在 .d.ts 里做了类型补全和重写。
💡 superagent 是一个轻量的基于 Node.js 的 HTTP 请求库
安装 cheerio 依赖 npm install cheerio @types/cheerio --save
💡 cheerio 是基于 jQuery 核心功能（DOM 操作）的简单实现，主要用于服务端需要对 DOM 进行操作的地方
创建 src 目录，在目录下创建 index.ts 文件
```
// src/index.ts
console.log("hello world");
```

配置 package.json 启动项 dev

// package.json
"scripts": {
  "dev": "ts-node src/index.ts"
}

控制台执行 npm run dev，出现“hello world”就表明初始化完成

写爬虫

创建 Crawler 类

// src/index.ts
class Crawler {
  private page = 1
  private url = `http://quotes.toscrape.com/page/${this.page}/`
  constructor() {
    console.log(this.url);
  }
}

const crawler = new Crawler();

使用 superagent 发送请求

// src/index.ts
import superagent from 'superagent';

class Crawler {
  private page = 1;
  private url = `http://quotes.toscrape.com/page/${this.page}/`
  private html = '';

  async getHtml() {
    const result = await superagent.get(this.url);
    console.log(result.text);
    this.html = result.text;
  }
  constructor() {
    this.getHtml();
  }
}

const crawler = new Crawler();

因为 superagent.get() 方法返回的是 Promise 对象，所以使用到 async 和 await。

执行 npm run dev，控制台输出 html 代码则请求成功。

使用 cheerio 抓取数据

cheerio 里 API 的用法和 jQuery 的基本一致，所以也会出现和 jQuery 一样的问题，无法使用 Array.prototype.map() 循环

// src/index.ts
import superagent from 'superagent';
import cheerio from 'cheerio';

interface Quote {
  text: string;
  author: string;
  tagList: Array<string>;
}

interface QuoteData {
  time: number,
  data: Array<Quote>
}

class Crawler {
  private page = 1;
  private url = `http://quotes.toscrape.com/page/${this.page}/`;

  constructor() {
    this.init();
  }

  async init() {
    const html = await this.getHtml();
    const quoteList = this.getQuoteData(html);
    console.log(quoteList);
  }

  async getHtml() {
    const result = await superagent.get(this.url);
    return result.text
  }

  getQuoteData(html: string) {
    const $ = cheerio.load(html);
    const quotes = $('.quote');
    let quoteList: Array<Quote> = [];
    quotes.map((index, element) => {
      const text = $(element).find('.text').text();
      const author = $(element).find('.author').text();
      const tags = $(element).find('.tag');
      const tagList = tags
        .map((tagIndex, tagElement) => $(tagElement).text())
        .get() as Array<string>;
      quoteList.push({
        text,
        author,
        tagList,
      });
    });
    return {
      time: new Date().getTime(),
      data: quoteList
    };
  }
}

const crawler = new Crawler();

执行 npm run dev，控制台输出网站的名人名言数据数组则爬取成功。

保存数据到 json 文件内

在根目录下创建 data 文件夹，每次爬取的数据都会以 quotes_[timestamp].json 的文件名格式保存在 data 文件夹内

// src/index.ts
import superagent from 'superagent';
import cheerio from 'cheerio';
import fs from 'fs';
import path from 'path';

interface Quote {
  text: string;
  author: string;
  tagList: Array<string>;
}

interface QuoteData {
  time: number,
  data: Array<Quote>
}

class Crawler {
  private page = 1;
  private url = `http://quotes.toscrape.com/page/${this.page}/`;

  constructor() {
    this.init();
  }

  async init() {
    const html = await this.getHtml();
    const quoteList = this.getQuoteData(html);
    this.saveJson(quoteList);
  }

  async getHtml() {
    const result = await superagent.get(this.url);
    return result.text
  }

  getQuoteData(html: string) {
    const $ = cheerio.load(html);
    const quotes = $('.quote');
    let quoteList: Array<Quote> = [];
    quotes.map((index, element) => {
      const text = $(element).find('.text').text();
      const author = $(element).find('.author').text();
      const tags = $(element).find('.tag');
      const tagList = tags
        .map((tagIndex, tagElement) => $(tagElement).text())
        .get() as Array<string>;
      quoteList.push({
        text,
        author,
        tagList,
      });
    });
    return {
      time: new Date().getTime(),
      data: quoteList
    };
  }

  saveJson(quoteInfo: QuoteData) {
    const filePath = path.resolve(__dirname, `../data/quotes_${quoteInfo.time}.json`);
    fs.writeFileSync(filePath, JSON.stringify(quoteInfo, null, 2));
  }
}

const crawler = new Crawler();

自此，一个简单的 TS 爬虫就完成了。