python 网络请求和解析网页

146 阅读2分钟

一、requests 网络请求

1.requests 安装

python -m pip install requests

2.requests 的 get 请求

import requests

r = requests.get('https://baidu.com')
print('url', r.url)
print('状态', r.status_code)
print('编码', r.encoding)
print('文本内容', r.text)
print('服务器返回的HTTP消息头', r.headers)
print('客户端发送给服务器的消息头', r.request.headers)

3.get 请求,传递 URL 参数

import requests

url = 'https://baidu.com/s'
payload = {'wd': 'python', 'pn': 10}
r = requests.get(url, params=payload)
print('url', r.url)

4.定制请求头

如果你想为请求添加 HTTP 头部,只要简单地传递一个 dict 给 headers 参数就可以了。

import requests

url = 'https://baidu.com'
headers = {
    'Content-Type': 'application/json',
    'user-agent': '',
    'Authorization': 'Bearer '
}

r = requests.get(url, headers=headers)

5.requests 的 post 请求

import requests
import json

headers = {'Content-Type': 'application/json'}
url = 'https://httpbin.org/post'
data = {
    "data": [
        {
            "name": "商品名称",
            "price": 100
        }
    ]
}
r = requests.post(url, data=json.dumps(data), headers=headers)

print(r.text)


二、Beautiful Soup解析网页

1.安装Beautiful Soup与解析器

pip install beautifulsoup4

pip install lxml

2.新建 html_doc.html 文件用来解析

<html>

<head>
    <title>The Dormouse's story</title>
</head>

<body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
        <a href="JavaScript:;" class="new">new1</a>
        and they lived at the bottom of a well.
    </p>

    <p class="story">This is a new class...<a href="JavaScript:;" class="new">new2</a></p>
</body>

</html>

3.BeautifulSoup解析

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("html_doc.html", "r", encoding="utf-8"), 'lxml')
print(soup.title)  # html文档<title>标签
print(soup.title.name)  # title的标签名
print(soup.title.string)  # title 标签的字符
print(soup.title.parent.name)  # title的标签父级标签名
print(soup.p)  # html文档中第一个 p 标签
print(soup.p['class'])  # html文档中第一个 p 标签, class 的值
print(soup.a)  # html文档中第一个 a 标签
print(soup.find_all('a'))  # html文档中所有 a 标签
print(soup.find(id="link3"))  # html文档中 id="link3" 元素

# 从文档中找到所有<a>标签的链接:

for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

# 从文档中获取所有文字内容:

print(soup.get_text())

三、DOM节点的筛选

1、获取指定属性的标签

通过 find_all("标签名", attrs={"属性": "属性值"}) 来查找所有指定属性的DOM节点。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("html_doc.html", "r", encoding="utf-8"), 'lxml')

newList = soup.find_all("a", attrs={"class": "new"})
for new in newList:
    print(new.get_text())

2、通过标签逐层查找

通过select("DOM树")这样爬虫就能精准爬取

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("html_doc.html", "r", encoding="utf-8"), 'lxml')

newList = soup.find_all("a", attrs={"class": "new"})
for new in newList:
    print(new.get_text())

a = soup.select("html body p.story a#link1.sister")
print(a)

3、通过 find 的方法

通过 find("标签",class_="属性值") 来查找指定属性值的标签

注意这里使用的是class_,因为class在Python中是保留字,除了class属性外,还可以是id,href,src等属性。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("html_doc.html", "r", encoding="utf-8"), 'lxml')

print("查找class属性", soup.find("p", class_="title"))
print("查找id属性", soup.find("a", id="link3"))

find 也有一些简写的方法,比如 find("div") 会返回第一个 div 标签所对应的 DOM,find(id="link1")会返回 id 的值为 link1 所对应的 DOM。find 方法可以多次调用,比如 soup.find("head").find("title")。