95、爬虫的介绍、长链转短链、requests模块的介绍和快速使用、get请求、post请求,携带数据、携带cookie两种方式

159 阅读6分钟

内容概要

  • 爬虫的介绍
  • 长链转短链
  • requests模块的介绍和快速使用
  • get请求携带参数
  • 编码和解码
  • 携带请求头
  • 发送post请求,携带数据
  • 携带cookie两种方式

今日内容详细

爬虫的介绍

1.爬虫:又称网络蜘蛛,spider,一堆程序,从互联网中抓取数据---->数据清洗---->入库

2.爬虫需要掌握的知识
	1.抓取数据:发送网络请求(http),获得响应(http响应,响应头,响应体--->真正重要的数据在响应体中)
    python模块:requests,selenium
	2.清洗数据:解析拿回来的数据--->json,xml,html,二进制
    json解析,xml解析
    python模块:re,json,beautifulsoup4(bs4),lxml,selenium
        
  3.入库:存文件,存mysql,redis,mongodb
    python模块:file,pymsql,redis-py,pymongo
    
3.反扒机制
	1.频率限制
  2.封ip(代理池),封账号(一堆小号:cookie池)
  3.请求头中带加密信息,referer,user-agent...
  4.响应回来的数据是加密
  5.验证码反扒(破解验证码--->第三方平台)
  6.js加密--->压缩--->加密方法其实在前端能看到--->看上去很晦涩(当前时间、md5、sha算法、某些字符串)--->js逆向
  7.手机设备:唯一id4.搜索引擎都是大爬虫
		ps:百度输入框搜索--->美女--->去百度的数据库搜索--->显示在页面上
    		原理:百度一刻不停的在互联网中爬网页,爬完存到它的数据库
    				seo优化:免费排中,排靠前
        		sem优化:搜出来的靠前的,带广告的是花钱的,买断关键词
        
5.爬虫:可见即可爬
6.爬虫协议:每个网站都会有爬虫协议,规定了哪些可以爬,哪些不能爬
	robots.txt(https://cn.bing.com/robots.txt)

长链转短链

ps:https://www.cnblogs.com/liuqingzheng/p/16005866.htm	
1.转短链服务(申请短域名:  m.tb.cn):
		https://www.cnblogs.com/liuqingzheng/p/16005866.html
	1.生成随机字符串:9QqPdHKXc2n
    id   随机字符串   真正地址
    1    9QqPdHKXc2n   ...
	2.这个地址返回给你:https://m.tb.cn/h.5bZAfFS?tk=9QqPdHKXc2n
        
	3.用户拿着短地址访问---->https://m.tb.cn/h.5bZAfFS?tk=9QqPdHKXc2n--->访问短链服务
	4.取出:9QqPdHKXc2n  去数据库一查--->真正地址:...
	5.重定向到真正地址实现了跳转
eg:淘宝转发商品链接

requests模块介绍和快速使用

1.requests是模拟发送http请求的模块
	1.不仅仅可以做爬虫
  2.后端服务,请求别人服务
    
2.下载:pip3 install requests

3.使用requests发送get请求
    import requests

    res = requests.get('https://www.cnblogs.com/luonacx/p/17334679.html')
    print(type(res))  # <class 'requests.models.Response'>
    print(res) # res 响应对象,http响应,python包装成了对象,响应头,响应头,在res中都会有
    print(res.text) # 响应体转成字符串,是一个带有html标签的内容

get请求携带参数

1.requests.get()中的源码
  def get(url, params=None, **kwargs):
      return request("get", url, params=params, **kwargs)
  def request(method, url, **kwargs):
      """
      :return: :class:`Response <Response>` object
      :rtype: requests.Response
      """
      ...
      
2.get请求携带参数
    res = requests.get('https://www.cnblogs.com/luonacx/p/17334679.html',params={'color':"red",'age':18})
    print(res.url)  # https://www.cnblogs.com/luonacx/p/17334679.html?color=red&age=18

编码和解码

1.使用到了:urllib.parse中的quote,unquote来进行编码解码
	quote编码
  quote解码
  
2.例子
  res = requests.get('https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=%E5%B8%85%E5%93%A5&fenlei=256&oq=%25E5%25B8%2585%25E5%2593%25A5&rsv_pq=f7fccd230004f1c3&rsv_t=f9cbQNG3FU1Ci8%2BrNQfSnhwBeqTm7dPEaQYr0KR1x5Nu4H7r7mDrkS8qPMw&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_btype=t&inputT=3194&rsv_sug3=22&rsv_sug1=16&rsv_sug7=101&rsv_sug2=0&rsv_sug4=3194&rsv_sug=1')

  from urllib.parse import quote,quote
  print(unquote("%E5%B8%85%E5%93%A5"))  # 帅哥
  print(quote('帅哥'))  # %E5%B8%85%E5%93%A5

携带请求头

1.有一些网址使用requests来进行爬取数据的时候,会爬取了,可能是因为需要在请求头添加一些数据:User-Agent、Referer、cookie等等(发送get请求,有的网站,拿不到数据,模拟的不像,请求头的数据没有携带)

2.携带请求头
	header={
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Referer":"http://www.aa7a.cn/user.php?&ref=http%3A%2F%2Fwww.aa7a.cn%2Farticle.php%3Fid%3D1516",
    "Cookie":"ECS_ID=edcbebd5bacf567154d1735a48662ce4501068db; ECS[visit_times]=1; _jzqa=1.2862947650398423000.1688624601.1688624601.1688624601.1; _jzqc=1; _jzqx=1.1688624601.1688624601.1.jzqsr=google%2Ecom|jzqct=/.-; _jzqckmp=1; Hm_lvt_c29657ca36c6c88e02fed9a397826038=1688624601; mediav=%7B%22eid%22%3A%22179539%22%2C%22ep%22%3A%22%22%2C%22vid%22%3A%22%22%2C%22ctn%22%3A%22%22%2C%22vvid%22%3A%22%22%2C%22_mvnf%22%3A1%2C%22_mvctn%22%3A0%2C%22_mvck%22%3A1%2C%22_refnf%22%3A0%7D; Qs_lvt_201322=1688624601%2C1688624666; cto_bundle=aLCec19Oa2QlMkJvSmRTc2FHTGpxMW5rYzhCQXVUNVlwTyUyRld2SUs3cHB2UHFHOHJYWjBGYlFSckhwd0t4VUdaUTE3eiUyRmZQUTh4T1ZtRzlJWnF1ZHYxMzQyRVZJeGlteUJEeDA3JTJCJTJCWUlMeTJQVFl0dDE3WiUyQm0za1ZsYklVWVdiUDFHMnhkODR1UlptJTJCNkNyMWI3cTlIdGNGZUYzUSUzRCUzRA; __xsptplusUT_422=1; Hm_lpvt_c29657ca36c6c88e02fed9a397826038=1688624705; Qs_pv_201322=2569050769411132400%2C3147331329153359400%2C2498690960945691600%2C2644197980350699500%2C4033125508421971000; __xsptplus422=422.1.1688624678.1688624708.4%234%7C%7C%7C%7C%7C%23%23L1aCda_egInUZIGoSu099vXK_Zt3kAjN%23; _qzja=1.1463486580.1688624600976.1688624600976.1688624600976.1688624704893.1688624708768..0.0.7.1; _qzjb=1.1688624600976.7.0.0.0; _qzjc=1; _qzjto=7.1.0; _jzqb=1.17.10.1688624601.1"
}
res = requests.get('http://www.aa7a.cn/',headers=header)
print(res.text)

发送post请求,携带数据

1.在post请求携带数据中
	data=data----->编码方式是:urlencoded-->看请求体中的Content-Type
  json=json----->编码是json格式-->看请求体中的Content-Type
header = {
    'Referer': 'http://www.aa7a.cn/user.php?&ref=http%3A%2F%2Fwww.aa7a.cn%2F',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

data = {
    'username': '616564099@qq.com',
    'password': 'lqz123',
    'captcha': 'xxxx',
    'remember': 1,
    'ref': ' http://www.aa7a.cn/',
    'act': 'act_login'
}

# data,json请求数据
# 编码方式是:urlencoded-->看请求体中的Content-Type
# res = requests.post('http://www.aa7a.cn/user.php',headers=header,data=data)
# print(res.text)  # {"error":0,"ref":"http://www.aa7a.cn/"}

# 编码是json格式-->看请求体中的Content-Type
res = requests.post('http://www.aa7a.cn/user.php',headers=header,json=data)
print(res.text) 

携带cookie两种方式

1.方式一:请求携带cookie,使用cookie参数
  cookie = res.cookies
  res1 = requests.get('http://www.aa7a.cn/',cookies=cookie)
  print('616564099@qq.com 'in res1.text)  # True
  
2.方式二:携带在请求头中

案列

header = {
    'Referer': 'http://www.aa7a.cn/user.php?&ref=http%3A%2F%2Fwww.aa7a.cn%2F',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

data = {
    'username': '616564099@qq.com',
    'password': 'lqz123',
    'captcha': 'xxxx',
    'remember': 1,
    'ref': ' http://www.aa7a.cn/',
    'act': 'act_login'
}

res = requests.post('http://www.aa7a.cn/user.php',headers=header,data=data)
print(res.text)  # {"error":0,"ref":"http://www.aa7a.cn/"}
print(res.cookies)  # <RequestsCookieJar[<Cookie ECS[password]=4a5e6ce9d1aba9de9b31abdf303bbdc2 for www.aa7a.cn/>, <Cookie ECS[user_id]=61399 for www.aa7a.cn/>, <Cookie ECS[username]=616564099%40qq.com for www.aa7a.cn/>, <Cookie ECS[visit_times]=1 for www.aa7a.cn/>, <Cookie ECS_ID=7efb1a828385435cc0e64a18e5ead1bdfbdaef67 for www.aa7a.cn/>]>

# 获取登陆成功后的cookie
# 方式一:请求携带cookie,使用cookie参数
cookie = res.cookies
res1 = requests.get('http://www.aa7a.cn/',cookies=cookie)
print('616564099@qq.com 'in res1.text)  # True

# 方式二:携带在请求头中
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
    'Cookie':'deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI3MzAyZDQ5Yy1mMmUwLTRkZGItOTZlZi1hZGFmZTkwMDBhMTEiLCJleHBpcmUiOiIxNjYxNjU0MjYwNDk4In0.4Y4LLlAEWzBuPRK2_z7mBqz4Tw5h1WeqibvkBG6GM3I; __snaker__id=ozS67xizRqJGq819; YD00000980905869%3AWM_TID=M%2BzgJgGYDW5FVFVAVQbFGXQ654xCRHj8; _9755xjdesxxd_=32; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1688616419; YD00000980905869%3AWM_NI=KpVsBKpke6xW0Ozhu07T2FswNWgn4UaBQdTBY7Z6X6f6CbQtBNZDQJ94vZ3PUgbkpIvoPsM5PprPUE2dtyRVhLPXd2cjrMI7MRtbiJtL2lJGNgPkbgbKwARmuFZz5K2JRDE%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eeabc744f4aefca3f03d89e78ab7d54a878e9fb1d468f4b28c83cc5383928daced2af0fea7c3b92a89b09a94c141b4b5abd7fb5df491b689f83cf58cfea7f27d8e89a98fb75fa39e9f85d74a9be996d7db3ea6ea8d83d480829296a5d442fbecbabbb25c97b9ae8cf343b6b1a094d03f94b0beb1d84ff2bcbc9ad15eac94a0b9e73aa1b6beaecc3d8f8ebc83ec42b894fe8dc94d8eb9a08ce16da295b9a2d167b79a969bb841f2969ea5c837e2a3; gdxidpyhxdE=8v%5CAccgIfMa0JDz9H9Ptnc%5Cc3owLo0EZ2IiWyPHAKTiu67%2BrKDu4bwfw5Wcy5as25LX3%2FpCICt2Gdf4%5CJ1WmqzAmbQaTarBZsdITdIu021%5C7VhnnOlNgw%2FcwwB5gYpzyegJoffKg6r2DfKJKKP%2FLokWEI0auXpasnO8CAetcf5ijccoZ%3A1688618634928; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjZHVfNTMyMDcwNzg0NjAiLCJleHBpcmUiOiIxNjkxMjA5NzM4NjA1In0.SZmVRy5HxKkmNTDCSQ5LEX-fkx3KyH0_jt0Wi-GEnLE; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1688617739',
}
data = {
    'linkId': '39196098'
}
res = requests.post('https://dig.chouti.com/link/vote', headers=header,data=data)
print(res.text)