爬虫初体验爬虫介绍 requests模块简单使用 get请求携带参数携带请求头携带cookie 发送post请求响

爬虫介绍

爬虫：spider---网络蜘蛛

主要功能：
    # 现在大部分的软件都是基于http请求发送和获取数据的，爬虫可以：
    -模拟发送http请求，从别人的服务端获取数据
    -绕过反爬：不同程序的反爬措施不一样，比较复杂

爬虫原理：
    -发送http请求【requests、selenium】到第三方服务端---->从服务端响应的数据中解析出想要的数据【selenium、bs4】---->入库【文件、excel、mysql、redis、mongodb等】

爬虫是否合法：
    -爬虫协议：每个网站根路径下都有robots.txt，这个文件规定了该网站，哪些可以爬取，哪些不能爬

# 百度就是一个大爬虫
    -百度搜索框中输入搜索内容，回车，返回的数据，是百度数据库中的数据
    -百度一刻不停的在互联网中爬取各个页面，链接地址--》爬完存到自己的数据库
    -当你点击，跳转到真正的地址上去了
    -核心：搜索，海量数据中搜索出想要的数据
    -seo：免费的搜索，排名靠前
    -sem：花钱买关键字

requests模块简单使用

# pip3 install requests	
# 模拟发送http请求的模块：requests 不仅仅做爬虫用它，后期调用第三方接口，也是要用它。本质是封装了内置模块urlib3

import requests
res = requests.get('https://www.cnblogs.com/liuqingzheng/p/16005866.html')
print(res.text)

get请求携带参数

1. 直接在地址栏中拼接
	res=requests.get('https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3')
2. 使用params参数携带
	res=requests.get('https://www.baidu.com/s',params={
        'wd':'美女',
        'name':'lqz'
    })
    # 相当于 https://www.baidu.com/s?wd=美女&name=lqz
3. url编码和解码
# 百度搜索美女会发现地址栏中将'美女'两字编码了：%E7%BE%8E%E5%A5%B3
    from urllib import parse
    # res=parse.quote('美女')
    # print(res)
    res=parse.unquote('%E7%BE%8E%E5%A5%B3')
    print(res)

携带请求头

http请求有请求头，有的网站通过请求头来做反爬--->请求头中缺少某个信息就不能正常返回数据
1. User-Agent：指客户端类型。有浏览器、程序、爬虫。。。我们一般伪造成浏览器
2. referer：指上次访问的地址
    # 登录页面：正常用户操作必须先进首页再登录，这样就会带着如果没有携带referer。爬虫模拟向登录接口直接发送请求，如果没有携带referer，他就会认为你是恶意的请求直接拒绝
    # 图片防盗链
3. cookie：登录认证之后会有一个cookie，带着这个cookie就相当于登录了

header={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
    'referer': 'https://dig.chouti.com/',
    'Cookie': 'xxxx'
}
res=requests.get('https://dig.chouti.com/zone/news',headers=header)
print(res.text)

携带cookie

1. 直接带在请求头中
	res=requests.get('https://dig.chouti.com/zone/news',headers={
            'User-Agent': 'xxx',
            'Cookie': 'xxxx'  # 携带cookie
	})
2. 通过cookie参数：因为cookie很特殊，一般都需要携带，模块把cookie单独抽取成一个参数，是字典类型，以后可以通过参数传入
    data={
        'linkId':'36996038'
    }
    header={
        # 客户端类型
        'User-Agent': 'xxxx',
    }
	res=requests.post('https://dig.chouti.com/link/vote',
         data=data,headers=header,cookies={'key':'value'})
    print(res.text)

发送post请求

data = {
    'username': '616564099@qq.com',
    'password': 'lqz123',
    'captcha': 'cccc',
    'remember': 1,
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login'
}
res = requests.post('http://www.aa7a.cn/user.php', data=data)
print(res.text)
print(res.cookies)  # 响应头中得cookie，如果正常登录，这个cookie 就是登录后的cookie  RequestsCookieJar：当成字典

# 访问首页，携带cookie，
# res2 = requests.get('http://www.aa7a.cn/', cookies=res.cookies)
res2 = requests.get('http://www.aa7a.cn/')
print('616564099@qq.com' in res2.text)


# post请求携带数据 data={} ,json={}   drf后端，打印 request.data
data=字典是使用默认编码格式：urlencoded
json=字典是使用json 编码格式
res = requests.post('http://www.aa7a.cn/user.php', json={})


# request.session的使用：当request使用，但是它能自动维护cookie
session=requests.session()
data = {
    'username': '616564099@qq.com',
    'password': 'lqz123',
    'captcha': 'cccc',
    'remember': 1,
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login'
}
res = session.post('http://www.aa7a.cn/user.php', data=data)
res2 = session.get('http://www.aa7a.cn/')
print('616564099@qq.com' in res2.text)

响应Response

import requests

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
}
respone = requests.get('https://www.jianshu.com', params={'name': 'lqz', 'age': 19},headers=header)
# respone属性
print(respone.text)  # 响应体的文本内容
print(respone.content)  # 响应体的二进制内容
print(respone.status_code)  # 响应状态码
print(respone.headers)  # 响应头
print(respone.cookies)  # 响应cookie
print(respone.cookies.get_dict())  # cookieJar对象，获得到真正的字段
print(respone.cookies.items())  # 获得cookie的所有key和value值
print(respone.url)  # 请求地址
print(respone.history)  # 访问这个地址，可能会重定向，放了它冲定向的地址
print(respone.encoding)  # 页面编码

获取二进制数据

import requests

res = requests.get('https://upload.jianshu.io/admin_banners/web_images/5067/5c739c1fd87cbe1352a16f575d2df32a43bea438.jpg')
with open('美女.jpg', 'wb') as f:
    f.write(res.content)

res1 = requests.get('https://vd3.bdstatic.com/mda-mk21ctb1n2ke6m6m/sc/cae_h264/1635901956459502309/mda-mk21ctb1n2ke6m6m.mp4')
with open('美女.mp4', 'wb') as f:
    for line in res1.iter_content():
        f.write(line)

解析json

# 前后分离后，后端给的数据，都是json格式，
    直接res.json()解析即可
# 解析json格式

res = requests.get(
    'https://api.map.baidu.com/place/v2/search?ak=6E823f587c95f0148c19993539b99295&region=%E4%B8%8A%E6%B5%B7&query=%E8%82%AF%E5%BE%B7%E5%9F%BA&output=json')
print(res.text)
print(type(res.text))
print(res.json()['results'][0]['name'])
print(type(res.json()))