黑客笔记99：爬虫一定要写的友好，不能对厂商产生干扰，简单粗暴的爬虫是程序员公敌这个是阿里云开发者社区搜索的请求包这个

这个是阿里云开发者社区搜索的请求包

GET /abs/search/searchCommunity?queryWord=xiaozhuqiaozhi&limit=20&pageNo=1&from=pc&loc=m_search_community_item HTTP/2
Host: t.aliyun.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
Accept: application/json, text/plain, */*
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Accept-Encoding: gzip, deflate, br
Referer: https://www.aliyun.com/search?k=xiaozhuqiaozhi&scene=community
Origin: https://www.aliyun.com
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-site
Te: trailers

这个是阿里云开发者社区搜索的返回包

HTTP/2 200 OK
Date: Thu, 26 Dec 2024 06:37:46 GMT
Content-Type: application/json;charset=UTF-8
Vary: Accept-Encoding
Pragma: no-cache
Cache-Control: no-cache
Access-Control-Allow-Headers: Content-Type, Authorization, X-Requested-With, x-xsrf-token, Eagleeye-Pappname, Eagleeye-Sessionid, Eagleeye-Traceid
Access-Control-Allow-Methods: POST, GET, OPTIONS, DELETE
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: https://www.aliyun.com
Access-Control-Max-Age: 3600
X-Application-Context: bridge-aliyun-com:7001
Set-Cookie: JSESSIONID=548A07293254C3090CFD97A0D7CA8AFD; Path=/; HttpOnly
X-Content-Type-Options: nosniff
X-Xss-Protection: 1; mode=block
Strict-Transport-Security: max-age=31536000 ; includeSubDomains
Server: Tengine/Aserver
Eagleeye-Traceid: 213e37c417351950664124408eb202
Strict-Transport-Security: max-age=31536000
Timing-Allow-Origin: *

{"success":true,"code":"200","message":"retMsg:OK","data":{"cost":12,"trackIds":null,"success":true,"pageNo":1,"errorCode":null,"extendInfo":null,"message":"retMsg:OK","totalCount":0,"info":[]},"rt":17}

集中看返回包的json内容，显然"xiaozhuqiaozhi"这个关键词没有搜索到结果，"totalCount":0。

针对这样一个爬虫的场景，一次请求是20条结果，如果有1000条结果就是50个请求，有的人把请求URL当商用的API接口来调用，全然不添加相关正常请求包含的请求头，如果爬虫频繁，被爬虫方很容易发现，结果导致反爬虫机制进一步升级，导致其他的那些友好爬虫也无法继续进行。

这里总结了几个注意点，希望都可以做一个友好的爬虫开发者，避免劣币驱逐良币，尤其是你的爬虫不是刚需，只是一种简单的需求的时候。

例如使用python的时候：

1-10个UA，每次请求随机使用其中一个
2-cookie单独提出来罗列
3-访问超时时间设置5秒钟
4-每次请求间隔10秒钟
5-程序执行一次请求不超过10个
6-每天计划任务执行一次程序

爬虫的时候能减少请求的次数就减少请求的次数，能减少请求的时间就减少请求的时间，能伪装的好就伪装的好一点。