黑客笔记99:爬虫一定要写的友好,不能对厂商产生干扰,简单粗暴的爬虫是程序员公敌

114 阅读2分钟

这个是阿里云开发者社区搜索的请求包

GET /abs/search/searchCommunity?queryWord=xiaozhuqiaozhi&limit=20&pageNo=1&from=pc&loc=m_search_community_item HTTP/2
Host: t.aliyun.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
Accept: application/json, text/plain, */*
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Accept-Encoding: gzip, deflate, br
Referer: https://www.aliyun.com/search?k=xiaozhuqiaozhi&scene=community
Origin: https://www.aliyun.com
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-site
Te: trailers

这个是阿里云开发者社区搜索的返回包

HTTP/2 200 OK
Date: Thu, 26 Dec 2024 06:37:46 GMT
Content-Type: application/json;charset=UTF-8
Vary: Accept-Encoding
Pragma: no-cache
Cache-Control: no-cache
Access-Control-Allow-Headers: Content-Type, Authorization, X-Requested-With, x-xsrf-token, Eagleeye-Pappname, Eagleeye-Sessionid, Eagleeye-Traceid
Access-Control-Allow-Methods: POST, GET, OPTIONS, DELETE
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: https://www.aliyun.com
Access-Control-Max-Age: 3600
X-Application-Context: bridge-aliyun-com:7001
Set-Cookie: JSESSIONID=548A07293254C3090CFD97A0D7CA8AFD; Path=/; HttpOnly
X-Content-Type-Options: nosniff
X-Xss-Protection: 1; mode=block
Strict-Transport-Security: max-age=31536000 ; includeSubDomains
Server: Tengine/Aserver
Eagleeye-Traceid: 213e37c417351950664124408eb202
Strict-Transport-Security: max-age=31536000
Timing-Allow-Origin: *

{"success":true,"code":"200","message":"retMsg:OK","data":{"cost":12,"trackIds":null,"success":true,"pageNo":1,"errorCode":null,"extendInfo":null,"message":"retMsg:OK","totalCount":0,"info":[]},"rt":17}

集中看返回包的json内容,显然"xiaozhuqiaozhi"这个关键词没有搜索到结果,"totalCount":0。

针对这样一个爬虫的场景,一次请求是20条结果,如果有1000条结果就是50个请求,有的人把请求URL当商用的API接口来调用,全然不添加相关正常请求包含的请求头,如果爬虫频繁,被爬虫方很容易发现,结果导致反爬虫机制进一步升级,导致其他的那些友好爬虫也无法继续进行。

这里总结了几个注意点,希望都可以做一个友好的爬虫开发者,避免劣币驱逐良币,尤其是你的爬虫不是刚需,只是一种简单的需求的时候。

例如使用python的时候:

1-10个UA,每次请求随机使用其中一个
2-cookie单独提出来罗列
3-访问超时时间设置5秒钟
4-每次请求间隔10秒钟
5-程序执行一次请求不超过10个
6-每天计划任务执行一次程序

爬虫的时候能减少请求的次数就减少请求的次数,能减少请求的时间就减少请求的时间,能伪装的好就伪装的好一点。