BeautifulSoup Python 中搜索特定单词

136 阅读3分钟

要从 Crunchyroll 网页中提取字幕的 ssid,即特定链接中的数字。例如,从以下链接中提取数字:

huake_00152_.jpg

http://www.crunchyroll.com/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035?ssid=154757

希望提取 "154757",但目前的 Python 脚本无法正常工作。

import feedparser
import re
import urllib2
from urllib2 import urlopen
from bs4 import BeautifulSoup

feed = feedparser.parse('http://www.crunchyroll.com/rss/anime')
url1 = feed['entries'][0]['link']
soup = BeautifulSoup(urlopen(url1), 'html.parser')
  1. 解决方案

问题在于当前代码使用了 urllib2re 模块,可以简化代码,使用 feedparserBeautifulSoup 模块即可实现所需的功能。以下是如何修改代码来搜索并提取特定数字:

import feedparser
import requests
from bs4 import BeautifulSoup

d = feedparser.parse('http://www.crunchyroll.com/rss/anime')
for url in d.entries:
    # 获取链接
    r = requests.get(url.link)
    soup = BeautifulSoup(r.text)

    # 找到包含字幕信息的元素
    subtitles = soup.find_all('span', {'class': 'showmedia-subtitle-text'})

    # 提取 ssid
    for ssid in subtitles:
        links = ssid.findAll('a')
        for a in links:
            print(a['href'])

此代码将解析 Crunchyroll 的 RSS 提要,并针对每个条目执行以下操作:

  1. 使用 requests 库获取链接的 HTML 内容。
  2. 使用 BeautifulSoup 解析 HTML 内容。
  3. 找到包含字幕信息的元素。
  4. 提取字幕链接中的 ssid。

输出结果类似如下:

/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166035
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165817
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165819
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166783
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165839
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165989
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166051
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166011
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165995
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165997
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166033
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165825
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166013
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166009
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166003
/etotama/episode-11-catrat-shuffle-678659?ssid=166007
/etotama/episode-11-catrat-shuffle-678659?ssid=165969
/etotama/episode-11-catrat-shuffle-678659?ssid=166489
/etotama/episode-11-catrat-shuffle-678659?ssid=166023
/etotama/episode-11-catrat-shuffle-678659?ssid=166015
/etotama/episode-11-catrat-shuffle-678659?ssid=166049
/etotama/episode-11-catrat-shuffle-678659?ssid=165993
/etotama/episode-11-catrat-shuffle-678659?ssid=165981

您可以根据需要进一步处理这些结果,例如提取 ssid 并将其存储在列表中。