Python 代码无法从 indeed.com 提取职位标题

39 阅读2分钟

我们遇到一个问题,我们的 Python 代码在从 indeed.com 提取职位标题时返回了一个空数组。我们的目标是获取包含 "digital marketing" 但不包含 "intern", "sales", "agency", "talent" 和 "consulting" 等限定词的职位标题。以下是我们尝试过的代码:

import requests
import bs4
from bs4 import BeautifulSoup

import pandas as pd
import time

importing dependencies
URL = "https://au.indeed.com/jobs?q=digital+marketing+-intern+-sales+-agency+-talent+-consulting&l=&limit=20&ts=1546381706970&rq=1&fromage=last"

conducting a request of the stated URL above:
page = requests.get(URL)

specifying the desired format of "page" using the HTML parser - basically allowing python to read components rather than a long string
soup = BeautifulSoup(page.text, "html.parser")

print soup in a more readable format
print(soup.prettify())

withdraw basic elements of data
def extract_job_title_from_result(soup): 
    jobs = []
    for div in soup.find_all(name="div", attrs={"class":"row result"}): 
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    return(jobs)

extract_job_title_from_result(soup)

My output is []

2、解决方案: 经过调查,我们发现问题出在 soup.find_all(name="div", attrs={"class":"row result"}),该行代码试图查找具有 class="row result"div 标签。然而,indeed.com 近期更新了其网站结构,这些职位标题现在位于具有 class="jobsearch-SerpJobCard"div 标签中。

因此,我们将代码修改为:

def extract_job_title_from_result(soup): 
    jobs = []
    for div in soup.find_all(name="div", attrs={"class":"jobsearch-SerpJobCard"}): 
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    return(jobs)

这样,我们就可以成功地从 indeed.com 提取职位标题了。

代码示例:

import requests
import bs4
from bs4 import BeautifulSoup

import pandas as pd
import time

importing dependencies
URL = "https://au.indeed.com/jobs?q=digital+marketing+-intern+-sales+-agency+-talent+-consulting&l=&limit=20&ts=1546381706970&rq=1&fromage=last"

conducting a request of the stated URL above:
page = requests.get(URL)

specifying the desired format of "page" using the HTML parser - basically allowing python to read components rather than a long string
soup = BeautifulSoup(page.text, "html.parser")

print soup in a more readable format
print(soup.prettify())

withdraw basic elements of data
def extract_job_title_from_result(soup): 
    jobs = []
    for div in soup.find_all(name="div", attrs={"class":"jobsearch-SerpJobCard"}): 
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    return(jobs)

print(extract_job_title_from_result(soup))

输出结果:

['Customer Services Manager', 'Digital Marketing', 'Marketing Manager', 'Digital Marketing', 'Digital Marketing Strategist', 'Marketing Manager', 'Digital Marketing Manager', 'Marketing Manager', 'Marketing Manager', 'Account Executive']

希望这篇技术文章对您有所帮助。