来源: towardsdatascience.com/scheduled-w… 《Scheduled Web Scraping with Django and Heroku》
翻译:子竹聆风
Build a django app that scrapes a job board daily
创建一个 Django 应用程序,每天抓取一下职位公告板
We often need a lot of training data for machine learning, and web scraping can be a way to acquire it.
我们经常需要大量的机器学习训练数据,而网络抓取可以成为获取这些数据的一种方式。
But in the past, there was a company I really wanted to work for. They didn’t currently have a data science posting but as soon as they did, I wanted to apply.
但在过去,有一家公司,我真的很想为之工作。他们目前没有发布数据科学的帖子,但是一旦他们发布了,我就想申请。
The solution? Scrape their job board daily so I was notified anytime a new job was posted.
解决方法是什么呢? 每天爬他们的职位公告板,这样我就可以通知任何时候一个新的工作被张贴。
Let’s build a simple django app, deploy to Heroku, and scrape a job board daily.
让我们构建一个简单的 django 应用程序,部署到 Heroku,每天搜集一个求职板。
Setup the app 设置应用程序
Create the directory for the app and cd into it.
为应用程序创建目录,并且,通过 cd命令,进入文件夹
mkdir jobs && cd jobs
Open this in whatever code editor your prefer. I’m using Sublime.
用你喜欢的任何代码编辑器打开这个,我正在使用 Sublime。
Create and start our virtual environment. Then install the packages we’ll need.
创建并启动虚拟环境,然后安装我们需要的软件包。
python -m venv env
source env/bin/activate
pip3 install django psycopg2 django-heroku bs4 gunicorn
Create the project (django’s version of a web application).
创建项目(django 的 web 应用程序版本)。
django-admin startproject jobs
cd into the project and create an app for scraping.
通过cd命令,进入jobs项目文件夹,然后创建一个应用程序scraping。
cd jobs
django-admin startapp scraping
Create a Job model 创建一个工作模型
We’ll only need to define 1 model in this app, a Job model. This represents jobs that we’ll collect.
我们只需要在这个应用程序中定义一个模型,一个Job模型,它代表我们将收集的工作。
Overwrite /scraping/models.py with the following.
用以下代码覆盖 /scraping/models.py。
from django.db import models
from django.utils import timezone
class Job(models.Model):
url = models.CharField(max_length=250, unique=True)
title = models.CharField(max_length=250)
location = models.CharField(max_length=250)
created_date = models.DateTimeField(default=timezone.now)
def __str__(self):
return self.title
class Meta:
ordering = ['title']
class Admin:
pass
Register your model in /scraping/admin.py. This allows us to view records in Django’s default admin panel (we’ll get to this shortly).
在/scraping/admin.py 中注册模型。这允许我们在 Django 的默认管理面板中查看记录(我们将很快介绍这一点)。
from scraping.models import Job
admin.site.register(Job)
Add scraping to installed apps in /jobs/settings.py.
在/jobs/settings.py 中添加抓取已安装的应用程序。
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'scraping'
]
Setup database 设置数据库
Setup the database. I love postgres so we’ll use that.
设置数据库。我喜欢postgres,所以我们将使用它。
At this point, ensure you’ve previously installed postgres on your mac with brew and started it (this is beyond the scope of this article).
此时,请确保您已经在 mac 上使用 brew 安装了 postgres 并启动了它(这超出了本文的讨论范围)。
Create a database for this project on the command line. You can open the postgres console with the below command.
在命令行上为此项目创建一个数据库。您可以使用下面的命令打开 postgres 控制台。
psql -d template1
Create a user and a database, then exit.
创建一个用户和一个数据库,然后退出。
create user django_user;
create database django_jobs owner django_user;
\q
In /jobs/settings.py, update DATABASES. In other frameworks you’d want to scope these specifically for the development environment, but here we won’t worry about it. It will work on Heroku anyway (we’ll get there shortly).
在/jobs/settings.py 中,更新 DATABASES。在其他框架中,您可能希望专门针对开发环境对这些进行调整,但是在这里我们不会担心这个问题。它无论如何都会在 Heroku 身上起作用(我们很快就会到达那里)。
Note these are the user and database names we created above.
注意,这些是我们在上面创建的用户名和数据库名。
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'django_jobs',
'USER': 'django_user',
'HOST': '',
'PORT': ''
}
}
Create migrations, and migrate the database from the command line.
python manage.py makemigrations
python manage.py migrate
This will create a table called scraping_job. This is a django namespace convention because it belongs to the scraping app.
这将创建一个名为scraping_job的表,这是一个 django 命名空间的约定,因为它属于scraping应用。
Now create a superuser and give it a password in your command line.
现在创建一个超级用户,并在命令行中给它一个密码。
python manage.py createsuperuser --email admin@example.com --username admin
Test current app 测试当前的应用程序
We’ve done some work so far but we have no idea if anything works. Let’s test it out before going further.
到目前为止,我们已经做了一些工作,但我们不知道是否有任何工作。在更进一步之前,让我们先测试一下。
On the command line.
在命令行上。
python manage.py runserver
Then navigate to http://127.0.0.1:8000/admin in your browser. Login with the superuser you just created.
然后在你的浏览器中导航到 http://127.0.0.1:8000/admin ,用你刚刚创建的超级用户登录。
After logging in, click “jobs” under “scraping”, then “add job” in the top right. Now fill in some made up information and click “save”. If you can see the job you created, everything works up until this point!
登录后,点击“scraping”下的“jobs” ,然后在右上角的“add job”。现在填写一些虚构的信息,然后点击“save”。如果你可以看到你创造的工作,到这一步,一切都运行正常。
Custom django-admin commands 自定义 django-admin 命令
We’re going to setup a custom django-admin command that scrapes a job board. This is what we’ll automatically schedule at the infrastructure level in order to automate scraping.
我们将设置一个定制的 django-admin 命令,它将爬一个职位公告板。这就是我们将在基础设施级别,自动调度的内容,以便实现自动抓取。
inside the /scraping module, create a directory called /management, and a directory inside /management called /commands. Then create 2 python files, _private.py and scrape.py in /commands.
在/scraping 模块内部,创建一个名为/management 的目录和一个名为/commands 的/management 内部目录。然后在/commands 中创建2个 python 文件—— _ private.py 和 scrape.py。
Drop this code into scrape.py.
将此代码放入 scrape.py 中。
from django.core.management.base import BaseCommand
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
from scraping.models import Job
class Command(BaseCommand):
help = "collect jobs"
# define logic of command
def handle(self, *args, **options):
# collect html
html = urlopen('https://jobs.lever.co/opencare')
# convert to soup
soup = BeautifulSoup(html, 'html.parser')
# grab all postings
postings = soup.find_all("div", class_="posting")
for p in postings:
url = p.find('a', class_='posting-btn-submit')['href']
title = p.find('h5').text
location = p.find('span', class_='sort-by-location').text
# check if url in db
try:
# save in db
Job.objects.create(
url=url,
title=title,
location=location
)
print('%s added' % (title,))
except:
print('%s already exists' % (title,))
self.stdout.write( 'job complete' )
Putting the code in a directory structure like this, then defining a Command class with a handle function tells django this is a custom django-admin command.
将代码放在这样的目录结构中,然后定义一个带有 handle 函数的 Command 类,告诉 django 这是一个定制的 django-admin 命令。
Here we’re using beautiful soup to scrape a Lever job board, then save it in our database. I have no affiliation but the job board I’ve chosen is for a fantastic company called Opencare - you should apply if they post something!
在这里,我们使用beautifulsoup去解析爬到的职位公告板,然后保存在我们的数据库。我没有从属关系,但我选择的职位板是为一个奇妙的公司叫Opencare-你应该申请,如果他们发布了的东西!
You can now run this from your command line like below.
现在可以从命令行运行这个命令,如下所示。
python manage.py scrape
And you’ll see this output.
你会看到这个输出。
Run it again and you’ll see this.
再运行一次,你会看到这个。
That’s because we’ve prevented adding duplicate job records in the scrape.py code above.
这是因为我们已经防止了在上面的 scrape.py 代码中添加重复的作业记录。
If you have a database administration program setup (like dbeaver), you can also inspect rows in your database. We won’t get into that here but it should look like below.
如果有数据库管理程序设置(如 dbeaver) ,还可以检查数据库中的行。我们不会在这里讨论这个问题,但是它应该看起来像下面这样。
Deploying to production 部署到生产环境
Now let’s get this onto Heroku.
现在我们把这个传到Heroku上。
Freeze your requirements so Heroku knows what to install on deploy.
冻结您的依赖库,以便 Heroku 知道在部署时要安装什么。
pip3 freeze > requirements.txt
On the command line, run nano .gitignore and add below.
在命令行中,运行 nano. gigignore 并在下面添加。
.DS_Store
jobs/__pycache__
scraping/__pycache__
Then ctrl+x, y, enter to save and close (on mac). This prevents deploying unnecessary files.
Create a file named Procfile in your root, and paste below inside. This tells heroku to boot up a web dyno, and the 2nd command migrates the db.
在根目录中创建一个名为 Procfile 的文件,并将其粘贴到下面。这告诉 heroku 启动一个Web服务器网关接口 gunicorn wsgi,然后第2个命令迁移数据库。
web: gunicorn jobs.wsgi
release: python manage.py migrate
The file tree will look like.
文件树看起来像。
Make sure you have a Heroku account then login with heroku login on the command line.
确保您有一个 Heroku 帐户,然后用 Heroku 登录在命令行上。
Create an app with any name you want. But it will needs to be unique across all the apps on Heroku. I ran heroku create django-scraper-879 , where the app name is django-scraper-879. But you’ll need to pick your own.
创建任意名称的应用程序。但它需要在 Heroku 所有的应用程序中都是独一无二的。我运行 heroku 创建了 django-scraper-879,其中应用程序名为 django-scraper-879。但你需要自己选择。
Now add these lines to the very bottom of settings.py. heroku_django takes care of some settings configuration like (Static files / WhiteNoise) for you.
现在将这些行添加到 settings.py 的最底部。Herokus_django 为您处理一些设置配置,如(Static files/WhiteNoise)。
import django_heroku
django_heroku.settings(locals())
Update DEBUG in settings. Don’t want to deploy to prod in debug mode.
在设置中更新 DEBUG。不希望在调试模式中部署为 prod。
DEBUG = False
Adds files to git with below.
添加文件到 git 与下面。
git init
git add . -A
git commit -m 'first commit'
Now push our app to Heroku with this.
现在把我们的应用推送到 Heroku。
git push heroku master
Scheduling the job 定时调度任务
You can manually run the job from your local command line with heroku run python manage.py scrape but it would be annoying to have to manually run it every day.
您可以使用 heroku run python manage.py scrape,从本地命令行手动运行作业,但是每天必须手动运行,这是很烦人的。
Let’s automate this.
让我们自动化这个过程。
Log into your Heroku console and click on “Resources,” then “find more add-ons”.
登录到 Heroku 控制台,点击“参考资料” ,然后“查找更多附加组件”。
Now find and click on this add-on. Try a “ctrl+f” for “schedule” to help locate it. Heroku has a ton of potential add-ons. It looks like below.
现在找到并点击这个附加组件。试试“ ctrl + f”的“ schedule”来帮助定位它。Heroku 有大量的潜在附加组件。它看起来像下面。
Now add it to your app, so you now have.
现在把它添加到你的应用程序中,这样你就有了
Click on it and create a job, everyday at… 12am UTC . It’s not nice to ping websites more than necessary!
点击它,并创建一个工作,每天上午12点(UTC时间)。爬取网站超过必要频率,是不太好的。
Input your command and save it.
输入命令并保存它。
Save and we’re done!
保存,这样我们就完成了!
Now just wait until 12am UTC (or whatever time you picked) and you’re database will populate.
现在只要等到 UTC时间12点(或任何你选择的时间) ,你的数据库将被写入数据。
Conclusion 总结
We touched on a lot of things here. Django, Heroku, Scheduling, Web Scraping, Postgres.
我们在这里涉及了很多东西,比如 Django,Heroku,任务调度,爬虫,Postgres。
While I used the example of wanting to know when a company posted a new job, there are lots of reasons you might want to scrape a website.
虽然我举了一个例子,想知道什么时候一家公司发布了一个新的工作。但是这里有很多原因,你可能想爬一个网站。
- E-commerce companies want to monitor competitor prices.电子商务公司希望监控竞争对手的价格
- Executive recruiters may want to see when a company posts a job opening 高管招聘人员可能希望看到公司何时公布职位空缺
- You or I may want to see when a new competition is added on Kaggle. 你或者我可能想知道什么时候在 Kaggle 增加一个新的竞争对手
This tutorial was just to illustrate what’s possible. Let me know if you’ve built anything cool with web scraping in the comments.
这个教程只是为了说明爬虫的可能性和玩法。如果你使用爬虫技术,做出了什么很酷的东西,请你你在评论中请告诉我。