一、requests 网络请求
1.requests 安装
python -m pip install requests
2.requests 的 get 请求
import requests
r = requests.get('https://baidu.com')
print('url', r.url)
print('状态', r.status_code)
print('编码', r.encoding)
print('文本内容', r.text)
print('服务器返回的HTTP消息头', r.headers)
print('客户端发送给服务器的消息头', r.request.headers)
3.get 请求,传递 URL 参数
import requests
url = 'https://baidu.com/s'
payload = {'wd': 'python', 'pn': 10}
r = requests.get(url, params=payload)
print('url', r.url)
4.定制请求头
如果你想为请求添加 HTTP 头部,只要简单地传递一个 dict 给 headers 参数就可以了。
import requests
url = 'https://baidu.com'
headers = {
'Content-Type': 'application/json',
'user-agent': '',
'Authorization': 'Bearer '
}
r = requests.get(url, headers=headers)
5.requests 的 post 请求
import requests
import json
headers = {'Content-Type': 'application/json'}
url = 'https://httpbin.org/post'
data = {
"data": [
{
"name": "商品名称",
"price": 100
}
]
}
r = requests.post(url, data=json.dumps(data), headers=headers)
print(r.text)
二、Beautiful Soup解析网页
1.安装Beautiful Soup与解析器
pip install beautifulsoup4
pip install lxml
2.新建 html_doc.html 文件用来解析
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
<a href="JavaScript:;" class="new">new1</a>
and they lived at the bottom of a well.
</p>
<p class="story">This is a new class...<a href="JavaScript:;" class="new">new2</a></p>
</body>
</html>
3.BeautifulSoup解析
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.html", "r", encoding="utf-8"), 'lxml')
print(soup.title) # html文档<title>标签
print(soup.title.name) # title的标签名
print(soup.title.string) # title 标签的字符
print(soup.title.parent.name) # title的标签父级标签名
print(soup.p) # html文档中第一个 p 标签
print(soup.p['class']) # html文档中第一个 p 标签, class 的值
print(soup.a) # html文档中第一个 a 标签
print(soup.find_all('a')) # html文档中所有 a 标签
print(soup.find(id="link3")) # html文档中 id="link3" 元素
# 从文档中找到所有<a>标签的链接:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
# 从文档中获取所有文字内容:
print(soup.get_text())
三、DOM节点的筛选
1、获取指定属性的标签
通过 find_all("标签名", attrs={"属性": "属性值"}) 来查找所有指定属性的DOM节点。
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.html", "r", encoding="utf-8"), 'lxml')
newList = soup.find_all("a", attrs={"class": "new"})
for new in newList:
print(new.get_text())
2、通过标签逐层查找
通过select("DOM树")这样爬虫就能精准爬取
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.html", "r", encoding="utf-8"), 'lxml')
newList = soup.find_all("a", attrs={"class": "new"})
for new in newList:
print(new.get_text())
a = soup.select("html body p.story a#link1.sister")
print(a)
3、通过 find 的方法
通过 find("标签",class_="属性值") 来查找指定属性值的标签
注意这里使用的是class_,因为class在Python中是保留字,除了class属性外,还可以是id,href,src等属性。
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.html", "r", encoding="utf-8"), 'lxml')
print("查找class属性", soup.find("p", class_="title"))
print("查找id属性", soup.find("a", id="link3"))
find 也有一些简写的方法,比如 find("div") 会返回第一个 div 标签所对应的 DOM,find(id="link1")会返回 id 的值为 link1 所对应的 DOM。find 方法可以多次调用,比如 soup.find("head").find("title")。