如何使用Python、Selenium和Twilio构建一个亚马逊价格跟踪器

定期检查某个产品的价格是否下降是一件很累人的事情。本教程将指导读者如何在Python中建立一个自动价格跟踪器。

我们将使用Selenium 和BeautifulSoup 库来检索某个特定搜索产品的价格，然后将其与你所期望的物品跌落的价格进行比较。

如果有任何价格在你的预期范围内，Twilio将发送一条短信通知你有产品在你的价格范围内。该短信还将包含您的价格范围内的产品数量。

我们还需要一个作业，每天运行我们的脚本。Heroku调度器增强了这个过程。本教程将向你展示如何用Heroku调度器设置一个预定工作。

前提条件

要跟上本教程，你将需要。

安装[Python]。
一个免费的[Twilio]账户。
对网络搜刮的基本了解。

Selenium和Beautiful Soup简介

Selenium是浏览器自动化、自动化测试、网络刮擦以及与网页互动的优秀工具。例如，它允许一个程序与浏览器互动，并搜刮网页。我们将使用Selenium与Beautiful Soup相结合来进行网页刮削。

Beautiful Soup是一个用于解析HTML和XML文档的Python包。我们将用它来抓取亚马逊上的鞋子列表。Selenium网络驱动器使用一个真正的网络浏览器来访问一个网站。这个活动模拟了一个普通用户的浏览，而不是一个机器人。这是有好处的，因为有些网站限制未认证的用户进行网络搜刮活动。

在亚马逊上搜索一个产品

第一步是使用下面的命令安装Selenium,Webdriver manager, 和Beautiful Soup 。

$ pip install selenium beautifulsoup4 webdriver-manager

接下来，为你的代码创建一个Python文件并粘贴下面的导入文件。

from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium import webdriver

如果你在亚马逊上搜索一个产品，你会发现搜索词总是被嵌入到网站的URL中。我们可以利用这一信息来创建一个函数，该函数将搜索air force 1 。这样，air force 1 的搜索链接就会出现在浏览器中。https://www.amazon.com/s?k=air+force+1&ref=nb_sb_noss_2

让我们写一个函数，为搜索生成一个URL 。

def get_url(search_text):
"""Generate a url from search text"""

    url = f"https://www.amazon.com/s?k={search_text}&ref=nb_sb_noss_1"

    # add page query for pagination

    url += "&page{}"

    return url

从搜索结果中提取价格

为了提取价格，我们需要在亚马逊搜索结果页面上点击右键，然后点击Inspect ，并浏览到Elements 标签。寻找一个HTML标签，这个标签对每个项目都是唯一的。这将缓解提取过程。

当你点击Elements 标签中的一个标签时，你会看到它突出了标签所在的网页的部分内容。我们必须更深入地寻找能唯一识别价格的标签和元素。根据可用的字段，标签<data-component-type> ，其值为s-search-result ，似乎是识别项目的一个好选择。

如果你进一步深入该标签，你会看到一个显示价格的<span> 标签。我们将在我们的代码中使用这个属性。

要提取价格，请在你的项目中添加以下代码。

def extract_record(single_item):
"""Extract and return data from a single item in the search"""

# because some products don't have prices you have to

# use try-except block to catch AttributeError

    try:

# Get product prices from page HTML

        price_parent = single_item.find(“span”, “a-price”)

        price = price_parent.find(“span”, “a-offscreen”).text

    except AttributeError:

        return

    return price

检索搜索结果中的所有价格

为了提取所有的价格，我们将再次深入到HTML元素中，检索标签<data-component-type> ，其值为"s-search-result" 。

接下来，粘贴下面的代码。

def main(search_term, max_price):
"""This function will accept the search term and the maximum price you are expecting the product to be"""

# startup the webdriver

    options=Options()
    
  options.headless = True #choose if we want the web browser to be open when doing the crawling 
  driver = webdriver.Chrome('/home/muhammed/Desktop/dev/blog-repo/twilioXseleniumXpython/chromedriver',options=options)
    prices_list=[] # this will hold the list of prices

    url = get_url(search_term) # takes the search term to get_url() function above.

    for page in range(1, 5):
    
    """For loop to get each item in the first 5 pages of the search"""

        driver.get(url.format(page))
        soup = BeautifulSoup(driver.page_source, "html.parser") #retrieve and parse HTML text.
        results = soup.find_all("div", {"data-component-type": "s-search-result"}) #get all the attributes of each item

        for item in results:
            record = extract_record(item) #takes each item to extract_record() function above to get the prices
            if item:
                prices_list.append(item)

上面的代码将浏览搜索结果的前5页并获取价格。

检索符合指定预算的价格

为了检索小于或等于预算的价格，你必须使用一个循环语句。上面的代码将显示价格。然而，这些数字有诸如$ 和, 等字符，这使得它无法进行比较。

为了消除这些符号，在main() 函数中添加以下代码。

new_prices= [s.replace("$", "").replace(",","") for s in prices_list] # goes through the `price_list` and eliminate the symbols
new=[] #this will contain the new list of prices ready for comparison.

prices_float = [float(i) for i in new_prices]  # converts the value from a string to float so that the comparison can be done

for i in prices_float:

“””For loop to handle the comparison”””

    if i <= max_price:

        new.append(i)

添加Twilio可编程短信

访问你的Twilio凭证

你的Account SID 和Auth Token 将使你能够连接到Twilio API。

注意：你的账户SID和Auth Token必须始终是隐藏的!

处理警报

在这一节中，我们将需要Twilio的Python包，它允许人们使用Twilio的可编程短信API来发送和接收短信。

运行下面的命令，在本地安装Twilio包。

$ pip install twilio

接下来，使用下面的代码导入Twilio客户端。

from twilio.rest import Client

接下来，导航到Twilio控制台页面，获得你的Twilio电话号码。

然后，将下面的代码粘贴到main() 功能中。

client.messages.create(

# To send SMS to mobile phone

to="+2348888888", # Your phone number, don’t forget to add the country code. This should be hidden

from_="+14688888", # Your Twilio phone number. This should be hidden

body=f"There are {len(new)} air force 1s within budget, ${max_price}" # message that will be sent to your mobile phone

)

测试

要测试该应用程序，只需输入your price range,the item you want to search ，然后运行程序，它就会完成剩下的工作。在你的代码末尾添加将在main("air force 1", 300) 函数中传递的值，然后运行它。

一条短信将被发送到你的手机上。

注意，我使用一个免费的Twilio电话号码建立了这个项目，这就是为什么它的文本中有Sent from Twilio trial account 。

添加一个预定工作

目前，我们必须不断地到终端去运行脚本，这可能有点烦人。那么，在这一节中，我将告诉你如何在Heroku中创建一个预定作业，以便Heroku可以自动运行你的脚本。

要做到这一点，请按照下面的步骤进行。

如果你还没有Heroku的账户，请创建一个账户。然后在你的本地机器上安装Heroku-CLI。

接下来，创建一个requirements.txt 文件。我们将在这里列出运行项目所需的所有依赖项的清单。在requirements.txt 文件中，粘贴下面的语句。

selenium==4.1.0
beautifulsoup4==4.10.0
webdriver-manager==3.5.2
twilio==7.3.2
bs4==0.0.1

然后，创建另一个文件runtime.txt ，在这里你将指定你的Python的版本，如下图所示。

python-3.8.10

导航到Heroku仪表盘，点击仪表盘页面上的新建按钮，为项目创建一个应用程序。记得给它起个名字。接下来，用$heroku login 登录到Heroku-CLI ，然后为你的项目初始化一个git 仓库。

将该项目链接到Heroku的远程仓库。你可以用以下命令来做。

$ git init
$ heroku git:remote -a <name-of-your-heroku-app>

导航到Heroku仪表板，点击your app ，然后点击 "Add buildpack "按钮。接下来，按Python按钮，将其添加到构建包中。你将需要为Chromedriver 和Headless Google Chrome 添加构建包。

点击Add buildpack按钮，然后粘贴以下链接并保存修改。

这些构建包是

[无头谷歌浏览器]
[Chromedriver]

现在我们可以通过提交代码到存储库并使用Git 将其部署到Heroku 来发布该应用程序。

$ git add .
$ git commit -m "initial commit"
$ git push heroku master

如果运行成功，你会在终端看到以下输出。

在你的终端上，运行$ heroku run bash 。这个命令允许我们使用我们刚刚部署在Heroku上的项目。为了检查你是否在正确的轨道上，你可以继续运行该脚本。

要添加调度程序，在Heroku仪表板上，点击资源选项卡，然后按查找更多附加组件按钮。

然后按安装Heroku Scheduler，并按照提示完成这一过程。

一旦你添加了Scheduler，点击它，然后点击创建作业按钮。你将被引导到一个表格，在那里你将填写你希望代码执行的时间和应该执行的实际命令。

总结

在本教程中，你已经学会了如何从亚马逊搜刮商品的价格，以及如何使用Twilio可编程短信。我们还建立了一个软件，在某些价格下降时向你发出警报。它计算在你价格范围内的产品数量，并每天向你的手机发送警报。

希望你应该能够将从本教程中获得的知识整合到你未来的项目中。

如何使用Python、Selenium和Twilio建立一个亚马逊价格跟踪器