request 功能
- 模拟浏览器上网
安装
pip install requests
流程
- 指定 url
- 发起请求
- 获取响应式数据
- 持久化存储
案例 1 : 爬取搜狗首页
import requests
url = "https://www.sogou.com/"
response = requests.get(url=url)
# text 返回字符串形式的响应数据
page_text = response.text
print(page_text)
with open("./sougou2021.html","w",encoding="utf-8") as fp:
fp.write(page_text)
案例2 制作一个简易的网页采集器
ua 伪装
key = input('enter a key word:')
#参数动态化
params = {'query':key}
#UA伪装
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}
url = 'https://www.sogou.com/web'
response = requests.get(url=url,params=params,headers=headers)
page_text = response.text
fileName = key+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
fp.write(page_text)
案例 3 爬取豆瓣电影
动态加载数据
获取: url、请求方式、请求参数和请求头信息
import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}
url = 'https://movie.douban.com/j/chart/top_list'
for i in range(1,30):
params = {
'type': i,
'interval_id': '100:90',
'action':'' ,
'start': '0',
'limit': '20',
}
json_data = requests.get(url=url,headers=headers,params=params).json()
print(json_data)
案例 4 post 请求操作
post 请求不用 params 而用 data 参数(该网页的参数是 form data)
import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
for pageNum in range(1,6):
data = {
'cname': '',
'pid': '',
'keyword': '上海',
'pageIndex': pageNum,
'pageSize': '10',
}
json_data = requests.post(url=url,headers=headers,data=data).json()['Table1']
for dic in json_data:
print(dic['addressDetail'])
案例 5 荣耀爬取
form data 是字典格式,参数是 json 而不是 data
import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}
#批量获取商铺id
main_url = 'https://openapi.vmall.com/mcp/offlineshop/getShopList'
data = {"portal":2,"lang":"zh-CN","country":"CN","brand":1,"province":"北京","city":"北京","pageNo":1,"pageSize":20}
#将最小的字典单元拆开,其中报刊下面需要的 shop_id 信息
json_data = requests.post(main_url,headers=headers,json=data).json()['shopInfos']
url = 'https://openapi.vmall.com/mcp/offlineshop/getShopById'
for dic in json_data:
shop_id = dic['id']
params = {
'portal':'2',
'version': '10',
'country': 'CN',
'shopId': shop_id,
'lang': 'zh-CN',
}
json_data = requests.get(url,headers=headers,params=params).json()
print(json_data)
json_data 得到的数据是一个 字典数据,其中 shopInfo 键对应的是一个列表,列表元素是很多个字典=> 字典套列表套字典