python入门系列十九（html解析）1.引言上一篇文章我们分享了关于requests库的使用，通过requests

1.引言

上一篇文章我们分享了关于requests库的使用，通过requests库可以完成web应用的请求，拿到相关数据。但是这里拿到的是原始html数据，在实际应用中，比如爬虫，我们需要提取关注的目标数据。如何提取？

正则表达式是一个解决方案，稍显麻烦！其实python给我们提供了更好的解决方案：

BeautifulSoup库：语法简洁的DOM树解析库
XPath：基于路径表达式XML/HTML查询语言
CSS选择器：类web样式的元素定位语法

这篇文章，我们来看如何完成html数据的提取，轻松完成实现基础爬虫能力。

2.案例

2.1.准备环境

需要安装相应库

pip install beautifulsoup4 lxml

2.2.BeautifulSoup示例

from bs4 import BeautifulSoup

# 准备HTML文档片段
html = """
<html>
<head>
 <title>商品列表</title>
</head>
 <body>
  <div class="products">
   <a href="/item1" class="hot">电脑</a>
   <a href="/item2">手机</a>
   <a href="/item3">PAD</a>
  </div>
 </body>
</html>
"""

# 创建BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')

# 查找第一个class为hot的<a>标签
hot_item = soup.find('a', class_='hot')
print(hot_item.text)

查找第一个class为hot的a标签：

查找所有a标签：

    # 查找所有<a>标签
all_links = soup.find_all('a')
print([link['href'] for link in all_links])

使用CSS选择器：

    # 使用CSS选择器
# 选择div下所有a标签
products = soup.select('div.products > a')
print([link.text for link in products])

# 属性包含"item"的链接
partial_links = soup.select('a[href*="item"]')
print([link.text for link in partial_links])

2.3.XPath示例

from lxml import etree

# 准备HTML文档片段
html_doc = """
<html>
<head>
 <title>商品列表</title>
</head>
 <body>
  <div class="products">
   <a href="/item1" class="hot">电脑</a>
   <a href="/item2">手机</a>
   <a href="/item3">PAD</a>
  </div>
 </body>
</html>
"""

# 创建ElementTree对象
html = etree.HTML(html_doc)

# 获取所有a标签的href
links = html.xpath('//a/@href')
print(links)

# 定位class包含hot的元素
hot_items = html.xpath('//*[contains(@class, "hot")]')
for item in hot_items:
    print(item.text)

稍微复杂一点的场景：

# 获取第一个div下的直接子元素a
first_div_links = html.xpath('/html/body/div[1]/a')
for item in first_div_links:
    print(item.text)

# 文本包含"手机"的元素
keyword_items = html.xpath('//*[contains(text(), "手机")]')
for item in keyword_items:
    print(item.text)

2.4.综合示例

假设有一段商品列表信息的html文档片段，需要从其中提取商品信息：

<!-- 商品列表 -->
<div class="product-list">
  <div class="item" data-id="1001">
    <h3 class="title">无线耳机</h3>
    <p class="price">￥199.00</p>
    <div class="tags">
      <span>热卖</span>
      <span>新品</span>
    </div>
  </div>
  <!-- 更多商品... -->
</div>

2.4.1.BeautifulSoup版本

from bs4 import BeautifulSoup
from lxml import etree

# 准备HTML文档片段
html_doc="""
<!-- 商品列表 -->
<div class="product-list">
  <div class="item" data-id="1001">
    <h3 class="title">无线耳机</h3>
    <p class="price">￥199.00</p>
    <div class="tags">
      <span>热卖</span>
      <span>新品</span>
    </div>
  </div>
  <!-- 更多商品... -->
</div>
"""

# BeautifulSoup实现版本
soup = BeautifulSoup(html_doc, 'lxml')
items = soup.select('div.product-list > div.item')
for item in items:
    print({
        'title': item.find('h3').text.strip(),
        'price': item.find('p', class_='price').text[1:],
        'tags': [tag.text for tag in item.select('.tags span')]
    })

2.4.2.XPath版本

# XPath实现版本
html = etree.HTML(html_doc)
products = html.xpath('//div[@class="product-list"]/div[@class="item"]')
for p in products:
    print({
        'title': p.xpath('.//h3/text()')[0].strip(),
        'price': p.xpath('.//p[@class="price"]/text()')[0][1:],
        'tags': p.xpath('.//div[@class="tags"]/span/text()')
    })

2.4.3.CSS选择器版本

# CSS选择器实现版本
soup = BeautifulSoup(html_doc, 'lxml')
for item in soup.select('div.item'):
    print({
        'title': item.select('h3.title')[0].text,
        'price': item.select('p.price')[0].text,
        'tags': [tag.text for tag in item.select('.tags span')]
    })

2.5.语法解析

2.5.1.三种解决方案语法对比

2.5.2.XPath语法细节

三种解决方案对比起来，XPath要灵活复杂一些，当然支持更复杂的应用场景。我们来看一下它的详细语法细节。

2.5.2.1.文档节点类型

2.5.2.2.路径表达式

绝对路径：从根路径开始

/html/body/div[1]/a

相对路径：当前上下文路径开始

./div//span

2.5.2.3.节点选择器

复杂示例：

div[@class='container']           # 带class属性的div
a[contains(@href, 'example')]     # href包含example的链接
li[position()>1 and position()<5] # 第2-4个li元素
input[@type='text' or @type='email'] # 文本或邮箱输入框

更详细内容，我建议你参考这个文档：www.w3cschool.cn/xpath/xpath…