从零开始，typescript搭配cheerio实现简易爬虫一起养成写作习惯！这是我参与「掘金日新计划 · 4 月更文挑

一起养成写作习惯！这是我参与「掘金日新计划 · 4 月更文挑战」的第2天，点击查看活动详情。

本文用到的技术栈

进入正文

1. 新创建一个xxx的文件夹

2. 在ide中打开该文件夹，在该目录的终端下，输入`npm init -y`回车，然后生成一个package.json的文件

3. 再运行`tsc --init`回车，它会在package.json的同级目录下，生成tsconfig.json文件，里面可以对typescript的配置做修改

4. 创建一个src目录，并且在该目录下，创建`crowller.ts`文件，作为爬虫的主入口

5. 安装各种依赖：

安装typescript： npm install typescript -D
安装ts-node: npm install -D ts-node 然后把package.json里面的scripts命令，修改成 "dev": "ts-node ./src/crowller.ts"

在crowller.ts文件下，输入console.log('hello world') 最后，在终端控制台输入npm run dev，看看是否正确输出hello world

安装cheerio：npm install -D cheerio
安装superagent: npm install -D superagent 在安装superagent的时候，当在ts中引入js的文件时，往往不可用，这时的解决办法，加入.d.ts翻译文件进行处理, 所以需要安装一下npm install @types/superagent -D

编写crowller.ts 爬虫主入口

引入superagent插件，用来做页面的请求，并编写一个Crowller类, 写好入口函数, 并确定需要爬取的网页路径

import superagent from "superagent"

class Crowller {
    async getHTMLFn() {
        const result = await superagent.get(this.url)
        return result.text
    }

    async onFn() {
        const html = await this.getHTMLFn()
    }
    
    constructor() {
        this.onFn()
    }
}

new Crowller()

编写需要爬取的分析代码

需要爬取的网页是：https://699pic.com/tupian/feiji.html

对应的标签是class="list"下的a标签包裹的img的data-original

创建analyzer.ts文件,编写分析页签代码

// 当在ts中引用js的文件时，往往不可用
// 解决方法，中间加入.d.ts翻译文件进行处理
import fs from "fs"
import cheerio from "cheerio"
import { Analyzer } from "./crowller"

interface courseResult {
    time: number
    data: string[]
}

interface Content { 
    [propName: number]: string[]
}

// implements类的函数重载需要在类中声明与所实现接口一致的函数重载声明
export default class xAnalyzy implements Analyzer {

    // 解析页面内容，并进行提取
    private getCourseInfo(html: string) {
        const $ = cheerio.load(html)                                // 读取对应的页面
        const courseItems = $('.list');
        const courseInfos: string[] = [];
        courseItems.map((index, element) => {
          const descs = $(element).find('a')
          const title = 'http:' + descs.find('img').attr('data-original') as string
          courseInfos.push(title);
        })
        console.log(courseInfos)
        return {
            time: new Date().getTime(),
            data: courseInfos
        }
    }

    // 数据存储
    generateJsonContent(courseInfo: courseResult, filePath: string) {
        let fileContent: Content = {}
        if(fs.existsSync(filePath)) {
            fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'))        // 读一个文件
        }
        fileContent[courseInfo.time] = courseInfo.data
        return fileContent
    }

    public analyze(html: string, filePath: string) {
        const courseInfo: courseResult = this.getCourseInfo(html),
              fileContent = this.generateJsonContent(courseInfo, filePath)
        return JSON.stringify(fileContent)
    }
}

最后，完善crowller.ts文件


import fs from "fs"
import path from "path"
import superagent from "superagent"

import xAnalyzer from "./analyzer"

// 定义analyzer类型
export interface Analyzer {
    analyze: (html: string, filePath: string) => string
}

class Crowller {
    private filePath = path.resolve(__dirname, "../data/course.json")     // 去寻找对应的文件路径const filePath = path.resolve(__dirname, "../data/course.json")     // 去寻找对应的文件路径

    // 获取对应的页面内容主体
    async getHTMLFn() {
        const result = await superagent.get(this.url)               // 获取url里面的内容
        return result.text
    }

    // 写入文件中
    writeFile(content: string) {
        fs.writeFileSync(this.filePath, content)
    }

    async onFn() {
        const html = await this.getHTMLFn()
        const content = this.analyzer.analyze(html, this.filePath)
        this.writeFile(content)
    }

    constructor(private url: string, private analyzer: Analyzer) {
        this.onFn()
    }
}

const url = 'https://699pic.com/tupian/feiji.html'

const analyzer = new xAnalyzer()
new Crowller(url, analyzer)

这里使用了fs模块，进行代码的写入和保存

结果：

最后

公众号：小何成长，佛系更文，都是自己曾经踩过的坑或者是学到的东西

有兴趣的小伙伴欢迎关注我哦，我是：何小玍。大家一起进步鸭

从零开始，typescript搭配cheerio实现简易爬虫