静态网页爬取-Requests

558 阅读1分钟

静态网页爬取-Requests

import requests
r=requests.get('http://www.baidu.com/')
print(r.encoding)
print(r.status_code)
print(r.text)

image-20211201092126948

r.text 服务器响应的内容,会自动根据响应头部字符编码进行解码

r.encoding 服务器内容使用的文本编码

r.status_code 检测响应的状态码

r.content 字节方式的响应体

r.json() Requests中内置的JSON解码器

传递URL参数

https://so.csdn.net/so/search?q=爬虫
import requests
dict={'q':'爬虫'}
r=requests.get('https://so.csdn.net/so/search',params=dict)
print(r.url)
print(r.text)

image-20211201094912847

定制请求头

Headers提供了关于请求、响应或其他发送实体的信息

image-20211201095150860

在开发出工具中可以查看到请求头信息

设置请求头

import requests
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
    'Host':'so.csdn.net'
}
dict={'q':'爬虫'}
r=requests.get('https://so.csdn.net/so/search',params=dict,headers=headers)
print(r.status_code)

image-20211216194254148

200表示 请求成功

POST请求

get改post

params参数改data

import requests
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
    'Host':'so.csdn.net'
}
dict={'q':'爬虫'}
r=requests.post('https://so.csdn.net/so/search',data=dict,headers=headers)
print(r.status_code)
print(r.text)

设置超时

一般为20

import requests
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
}
r=requests.get('https://blog.csdn.net/',headers=headers,timeout=20)
print(r.status_code)

image-20211201101616120

如果请求超过时间 则返回异常

import requests
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
}
r=requests.get('https://blog.csdn.net/',headers=headers,timeout=0.001)
print(r.status_code)

image-20211201101859092

实战爬取豆瓣TOP电影

import requests
from bs4 import BeautifulSoup
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
    'Host':'movie.douban.com',
    'Cookie':'ll="118267"; bid=GDklS76UbIM; dbcl2="251015641:WiqhW39971k"; ck=BRDD; push_noty_num=0; push_doumail_num=0'
}
movie_list=[]
for i in range(0,10):
    url='https://movie.douban.com/top250?start='+str(i*25)+'&filter='
    if i==0:
        url = 'https://movie.douban.com/top250'
    r=requests.get(url=url,headers=headers)
    soup=BeautifulSoup(r.text,"lxml")
    list=soup.find_all('div','hd')
    for i in list:
        m=i.find_all('span')
        movie_list.append(m[0].text)
print(movie_list)

image-20211201191143218