使用 Python 和 BeautifulSoup 解析 HTML - 获取标签内外文本

157 阅读1分钟

解析 HTML 时可能会遇到获取标签内外文本的需求,但直接获取标签文本会得到所有标签内文本,这并不是我们想要的。因此需要找到一种方法来分别获取标签内外文本。

2、解决方案

方法一:使用 BeautifulSoup 的 get_text() 方法

BeautifulSoup 提供了 get_text() 方法,可以获取 HTML 文档中所有文本内容,包括标签内和标签外的文本。

代码示例:

from bs4 import BeautifulSoup

html = """
<td align="left">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1740935">
Garcia, Leury
</a>
SS CHW - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1813191">
Almonte, Abraham
</a>
OF SEA - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2046044">
Pillar, Kevin
</a>
OF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1666824">
Sierra, Moises
</a>
LF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599">
Paulino, Felipe
</a>
SP KC
<span title="Felipe Paulino off 60-day DL">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599" subtab="Update">
<img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
</a>
</span>
- Traded from Royal Disappointments
</br>
</br>
</br>
</br>
</td>
"""

soup = BeautifulSoup(html, "html.parser")

# 获取所有文本内容
text = soup.get_text()

# 打印文本内容
print(text)

输出结果:

Garcia, Leury

SS CHW - Traded from Royal Disappointments


Almonte, Abraham

OF SEA - Traded from Royal Disappointments


Pillar, Kevin

OF TOR - Traded from Royal Disappointments


Sierra, Moises

LF TOR - Traded from Royal Disappointments


Paulino, Felipe

SP KC


- Traded from Royal Disappointments

方法二:使用 BeautifulSoup 的 Tag.next 属性

BeautifulSoup 的 Tag 对象提供了 next 属性,可以获取标签后面的第一个元素。通过遍历标签,并获取其 next 属性,即可获取标签外的文本。

代码示例:

from bs4 import BeautifulSoup

html = """
<td align="left">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1740935">
Garcia, Leury
</a>
SS CHW - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1813191">
Almonte, Abraham
</a>
OF SEA - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2046044">
Pillar, Kevin
</a>
OF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1666824">
Sierra, Moises
</a>
LF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599">
Paulino, Felipe
</a>
SP KC
<span title="Felipe Paulino off 60-day DL">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599" subtab="Update">
<img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
</a>
</span>
- Traded from Royal Disappointments
</br>
</br>
</br>
</br>
</td>
"""

soup = BeautifulSoup(html, "html.parser")

# 获取所有链接标签
links = soup.find_all("a", {"class": "playerLink"})

# 遍历链接标签,并获取其后面的文本
for link in links:
    print(link.text)
    print(link.next.next)

输出结果:

Garcia, Leury
SS CHW - Traded from Royal Disappointments


Almonte, Abraham
OF SEA - Traded from Royal Disappointments


Pillar, Kevin
OF TOR - Traded from Royal Disappointments


Sierra, Moises
LF TOR - Traded from Royal Disappointments


Paulino, Felipe
SP KC


- Traded from Royal Disappointments