解析 HTML 时可能会遇到获取标签内外文本的需求,但直接获取标签文本会得到所有标签内文本,这并不是我们想要的。因此需要找到一种方法来分别获取标签内外文本。
2、解决方案
方法一:使用 BeautifulSoup 的 get_text() 方法
BeautifulSoup 提供了 get_text() 方法,可以获取 HTML 文档中所有文本内容,包括标签内和标签外的文本。
代码示例:
from bs4 import BeautifulSoup
html = """
<td align="left">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1740935">
Garcia, Leury
</a>
SS CHW - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1813191">
Almonte, Abraham
</a>
OF SEA - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2046044">
Pillar, Kevin
</a>
OF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1666824">
Sierra, Moises
</a>
LF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599">
Paulino, Felipe
</a>
SP KC
<span title="Felipe Paulino off 60-day DL">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599" subtab="Update">
<img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
</a>
</span>
- Traded from Royal Disappointments
</br>
</br>
</br>
</br>
</td>
"""
soup = BeautifulSoup(html, "html.parser")
# 获取所有文本内容
text = soup.get_text()
# 打印文本内容
print(text)
输出结果:
Garcia, Leury
SS CHW - Traded from Royal Disappointments
Almonte, Abraham
OF SEA - Traded from Royal Disappointments
Pillar, Kevin
OF TOR - Traded from Royal Disappointments
Sierra, Moises
LF TOR - Traded from Royal Disappointments
Paulino, Felipe
SP KC
- Traded from Royal Disappointments
方法二:使用 BeautifulSoup 的 Tag.next 属性
BeautifulSoup 的 Tag 对象提供了 next 属性,可以获取标签后面的第一个元素。通过遍历标签,并获取其 next 属性,即可获取标签外的文本。
代码示例:
from bs4 import BeautifulSoup
html = """
<td align="left">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1740935">
Garcia, Leury
</a>
SS CHW - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1813191">
Almonte, Abraham
</a>
OF SEA - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2046044">
Pillar, Kevin
</a>
OF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1666824">
Sierra, Moises
</a>
LF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599">
Paulino, Felipe
</a>
SP KC
<span title="Felipe Paulino off 60-day DL">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599" subtab="Update">
<img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
</a>
</span>
- Traded from Royal Disappointments
</br>
</br>
</br>
</br>
</td>
"""
soup = BeautifulSoup(html, "html.parser")
# 获取所有链接标签
links = soup.find_all("a", {"class": "playerLink"})
# 遍历链接标签,并获取其后面的文本
for link in links:
print(link.text)
print(link.next.next)
输出结果:
Garcia, Leury
SS CHW - Traded from Royal Disappointments
Almonte, Abraham
OF SEA - Traded from Royal Disappointments
Pillar, Kevin
OF TOR - Traded from Royal Disappointments
Sierra, Moises
LF TOR - Traded from Royal Disappointments
Paulino, Felipe
SP KC
- Traded from Royal Disappointments