我们遇到一个问题,我们的 Python 代码在从 indeed.com 提取职位标题时返回了一个空数组。我们的目标是获取包含 "digital marketing" 但不包含 "intern", "sales", "agency", "talent" 和 "consulting" 等限定词的职位标题。以下是我们尝试过的代码:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
importing dependencies
URL = "https://au.indeed.com/jobs?q=digital+marketing+-intern+-sales+-agency+-talent+-consulting&l=&limit=20&ts=1546381706970&rq=1&fromage=last"
conducting a request of the stated URL above:
page = requests.get(URL)
specifying the desired format of "page" using the HTML parser - basically allowing python to read components rather than a long string
soup = BeautifulSoup(page.text, "html.parser")
print soup in a more readable format
print(soup.prettify())
withdraw basic elements of data
def extract_job_title_from_result(soup):
jobs = []
for div in soup.find_all(name="div", attrs={"class":"row result"}):
for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
extract_job_title_from_result(soup)
My output is []
2、解决方案:
经过调查,我们发现问题出在 soup.find_all(name="div", attrs={"class":"row result"}),该行代码试图查找具有 class="row result" 的 div 标签。然而,indeed.com 近期更新了其网站结构,这些职位标题现在位于具有 class="jobsearch-SerpJobCard" 的 div 标签中。
因此,我们将代码修改为:
def extract_job_title_from_result(soup):
jobs = []
for div in soup.find_all(name="div", attrs={"class":"jobsearch-SerpJobCard"}):
for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
这样,我们就可以成功地从 indeed.com 提取职位标题了。
代码示例:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
importing dependencies
URL = "https://au.indeed.com/jobs?q=digital+marketing+-intern+-sales+-agency+-talent+-consulting&l=&limit=20&ts=1546381706970&rq=1&fromage=last"
conducting a request of the stated URL above:
page = requests.get(URL)
specifying the desired format of "page" using the HTML parser - basically allowing python to read components rather than a long string
soup = BeautifulSoup(page.text, "html.parser")
print soup in a more readable format
print(soup.prettify())
withdraw basic elements of data
def extract_job_title_from_result(soup):
jobs = []
for div in soup.find_all(name="div", attrs={"class":"jobsearch-SerpJobCard"}):
for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
print(extract_job_title_from_result(soup))
输出结果:
['Customer Services Manager', 'Digital Marketing', 'Marketing Manager', 'Digital Marketing', 'Digital Marketing Strategist', 'Marketing Manager', 'Digital Marketing Manager', 'Marketing Manager', 'Marketing Manager', 'Account Executive']
希望这篇技术文章对您有所帮助。