爬虫 ——request 请求

434 阅读2分钟

request 功能

  • 模拟浏览器上网

安装

pip install requests

流程

  1. 指定 url
  2. 发起请求
  3. 获取响应式数据
  4. 持久化存储

案例 1 : 爬取搜狗首页

import  requests
url = "https://www.sogou.com/"
response = requests.get(url=url)
# text 返回字符串形式的响应数据
page_text = response.text
print(page_text)

with open("./sougou2021.html","w",encoding="utf-8") as fp:
	fp.write(page_text)

案例2 制作一个简易的网页采集器

ua 伪装

key = input('enter a key word:')
#参数动态化
params = {'query':key}
#UA伪装
headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}
url = 'https://www.sogou.com/web'
response = requests.get(url=url,params=params,headers=headers)
page_text = response.text
fileName = key+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
   fp.write(page_text)

案例 3 爬取豆瓣电影

动态加载数据

获取: url、请求方式、请求参数和请求头信息

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}
url = 'https://movie.douban.com/j/chart/top_list'
for i in range(1,30):
    params = {
        'type': i,
        'interval_id': '100:90',
        'action':'' ,
        'start': '0',
        'limit': '20',
    }
    json_data = requests.get(url=url,headers=headers,params=params).json()
    print(json_data)

案例 4 post 请求操作

post 请求不用 params 而用 data 参数(该网页的参数是 form data)

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}

url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
for pageNum in range(1,6):
    data = {
        'cname': '',
        'pid': '',
        'keyword': '上海',
        'pageIndex': pageNum,
        'pageSize': '10',
    }
    json_data = requests.post(url=url,headers=headers,data=data).json()['Table1']
    for dic in json_data:
        print(dic['addressDetail'])

案例 5 荣耀爬取

form data 是字典格式,参数是 json 而不是 data

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}
#批量获取商铺id
main_url = 'https://openapi.vmall.com/mcp/offlineshop/getShopList'
data = {"portal":2,"lang":"zh-CN","country":"CN","brand":1,"province":"北京","city":"北京","pageNo":1,"pageSize":20}
#将最小的字典单元拆开,其中报刊下面需要的 shop_id 信息
json_data = requests.post(main_url,headers=headers,json=data).json()['shopInfos'] 



url = 'https://openapi.vmall.com/mcp/offlineshop/getShopById'

for dic in json_data:
    shop_id = dic['id']
    params = {
        'portal':'2',
        'version': '10',
        'country': 'CN',
        'shopId': shop_id,
        'lang': 'zh-CN',
    }
    json_data = requests.get(url,headers=headers,params=params).json()
    print(json_data)

json_data 得到的数据是一个 字典数据,其中 shopInfo 键对应的是一个列表,列表元素是很多个字典=> 字典套列表套字典

image.png