怎么用Python写爬虫抓取网页数据机器学习首先面临的一个问题就是准备数据，数据的来源大概有这么几种：公司积累数据，购买

机器学习首先面临的一个问题就是准备数据，数据的来源大概有这么几种：公司积累数据，购买，交换，政府机构及企业公开的数据，通过爬虫从网上抓取。本篇介绍怎么写一个爬虫从网上抓取公开的数据。

很多语言都可以写爬虫，但是不同语言的难易程度不同，Python作为一种解释型的胶水语言，上手简单、入门容易，标准库齐全，还有丰富的各种开源库，语言本身提供了很多提高开发效率的语法糖，开发效率高，总之“人生苦短，快用Python”(Life is short, you need Python!)。在Web网站开发，科学计算，数据挖掘/分析，人工智能等很多领域广泛使用。

开发环境配置，Python3.5.2，Scrapy1.2.1，使用pip安装scrapy，命令：pip3 install Scrapy，此命令在Mac下会自动安装Scrapy的依赖包，安装过程中如果出现网络超时，多试几次。

创建工程

首先创建一个Scrapy工程，工程名为：kiwi，命令：scrapy startproject kiwi，将创建一些文件夹和文件模板。

定义数据结构

settings.py是一些设置信息，items.py用来保存解析出来的数据，在此文件里定义一些数据结构，示例代码：

[url=]

[/url]

-*- coding: utf-8 -*-

Define here the models for your scraped items

See documentation in:

http://www.smpeizi.com/en/latest/topics/items.html

import

scrapy

class

AuthorInfo(scrapy.Item):

authorName = scrapy.Field()

作者昵称

authorUrl = scrapy.Field()

作者Url

class

ReplyItem(scrapy.Item):

content = scrapy.Field()

回复内容

time = scrapy.Field()

发布时间

author = scrapy.Field()

回复人(AuthorInfo)

class

TopicItem(scrapy.Item):

title = scrapy.Field()

帖子标题

url = scrapy.Field()

帖子页面Url

content = scrapy.Field()

帖子内容

time = scrapy.Field()

发布时间

author = scrapy.Field()

发帖人(AuthorInfo)

reply = scrapy.Field()

回复列表(ReplyItem list)

replyCount = scrapy.Field()

回复条数

[url=]

[/url]

上面TopicItem中嵌套了AuthorInfo和ReplyItem list，但是初始化类型必须是scrapy.Field()，注意这三个类都需要从scrapy.Item继续。

创建爬虫蜘蛛

工程目录spiders下的kiwi_spider.py文件是爬虫蜘蛛代码，爬虫代码写在这个文件里。示例以爬豆瓣群组里的帖子和回复为例。

[url=]

[/url]

-*- coding: utf-8 -*-

from

scrapy.selector

import

Selector

from

scrapy.spiders

import

CrawlSpider, Rule

from

scrapy.linkextractors

import

LinkExtractor

from

kiwi.items

import

TopicItem, AuthorInfo, ReplyItem

class

KiwiSpider(CrawlSpider):

name =

kiwi

allowed_domains = [

douban.com

]

anchorTitleXPath =

a/text()

anchorHrefXPath =

a/@href

start_urls = [

https://www.pzzs168.com/group/topic/90895393/?start=0

]

rules = (

Rule(

LinkExtractor(allow=(r

/group/[^/]+/discussion\?start=\d+

,)),

callback=

parse_topic_list

follow=True

Rule(

LinkExtractor(allow=(r

/group/topic/\d+/$

,)),

帖子内容页面

callback=

parse_topic_content

follow=True

Rule(

LinkExtractor(allow=(r

/group/topic/\d+/\?start=\d+

,)),

帖子内容页面

callback=

parse_topic_content

follow=True

)

帖子详情页面

def

parse_topic_content(self, response):

标题XPath

titleXPath =

//html/head/title/text()

帖子内容XPath

contentXPath =

//div[@class="topic-content"]/p/text()

发帖时间XPath

timeXPath =

//div[@class="topic-doc"]/h3/span[@class="color-green"]/text()

发帖人XPath

authorXPath =

//div[@class="topic-doc"]/h3/span[@class="from"]

item = TopicItem()

当前页面Url

item[

url

] = response.url

标题

titleFragment = Selector(response).xpath(titleXPath)

item[

title

] = str(titleFragment.extract()[0]).strip()

帖子内容

contentFragment = Selector(response).xpath(contentXPath)

strs = [line.extract().strip()

for

line

contentFragment]

item[

content

] =

.join(strs)

发帖时间

timeFragment = Selector(response).xpath(timeXPath)

timeFragment:

item[

time

] = timeFragment[0].extract()

发帖人信息

authorInfo = AuthorInfo()

authorFragment = Selector(response).xpath(authorXPath)

authorFragment:

authorInfo[

authorName

] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0]

authorInfo[

authorUrl

] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0]

item[

author

] = dict(authorInfo)

回复列表XPath

replyRootXPath = r

//div[@class="reply-doc content"]

回复时间XPath

replyTimeXPath = r

div[@class="bg-img-green"]/h4/span[@class="pubtime"]/text()

回复人XPath

replyAuthorXPath = r

div[@class="bg-img-green"]/h4

replies = []

itemsFragment = Selector(response).xpath(replyRootXPath)

for

replyItemXPath

itemsFragment:

replyItem = ReplyItem()

回复内容

contents = replyItemXPath.xpath(

p/text()

)

strs = [line.extract().strip()

for

line

contents]

replyItem[

content

] =

.join(strs)

回复时间

timeFragment = replyItemXPath.xpath(replyTimeXPath)

timeFragment:

replyItem[

time

] = timeFragment[0].extract()

回复人

replyAuthorInfo = AuthorInfo()

authorFragment = replyItemXPath.xpath(replyAuthorXPath)

authorFragment:

replyAuthorInfo[

authorName

] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0]

replyAuthorInfo[

authorUrl

] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0]

replyItem[

author

] = dict(replyAuthorInfo)

添加进回复列表

replies.append(dict(replyItem))

100

101

item[

] = replies

102

yield

item

103

104

帖子列表页面

105

def

parse_topic_list(self, response):

106

帖子列表XPath(跳过表头行)

107

topicRootXPath = r

//table[@class="olt"]/tr[position()>1]

108

单条帖子条目XPath

109

titleXPath = r

td[@class="title"]

110

发帖人XPath

111

authorXPath = r

td[2]

112

回复条数XPath

113

replyCountXPath = r

td[3]/text()

114

发帖时间XPath

115

timeXPath = r

td[@class="time"]/text()

116

117

topicsPath = Selector(response).xpath(topicRootXPath)

118

for

topicItemPath

topicsPath:

119

item = TopicItem()

120

titlePath = topicItemPath.xpath(titleXPath)

121

item[

title

] = titlePath.xpath(self.anchorTitleXPath).extract()[0]

122

item[

url

] = titlePath.xpath(self.anchorHrefXPath).extract()[0]

123

发帖时间

124

timePath = topicItemPath.xpath(timeXPath)

125

timePath:

126

item[

time

] = timePath[0].extract()

127

发帖人

128

authorPath = topicItemPath.xpath(authorXPath)

129

authInfo = AuthorInfo()

130

authInfo[

authorName

] = authorPath[0].xpath(self.anchorTitleXPath).extract()[0]

131

authInfo[

authorUrl

] = authorPath[0].xpath(self.anchorHrefXPath).extract()[0]

132

item[

author

] = dict(authInfo)

133

回复条数

134

replyCountPath = topicItemPath.xpath(replyCountXPath)

135

item[

replyCount

] = replyCountPath[0].extract()

136

137

item[

content

] =

138

yield

item

139

140

parse_start_url = parse_topic_content

[url=]

[/url]

特别注意

1、KiwiSpider需要改成从CrawlSpider类继承，模板生成的代码是从Spider继承的，那样的话不会去爬rules里的页面。

2、parse_start_url = parse_topic_list 是定义入口函数，从CrawlSpider类的代码里可以看到parse函数回调的是parse_start_url函数，子类可以重写这个函数，也可以像上面代码那样给它赋值一个新函数。

3、start_urls里是入口网址，可以添加多个网址。

4、rules里定义在抓取到的网页中哪些网址需要进去爬，规则和对应的回调函数，规则用正则表达式写。上面的示例代码，定义了继续抓取帖子详情首页及分页。

5、注意代码里用dict()包装的部分，items.py文件里定义数据结构的时候，author属性实际需要的是AuthorInfo类型，赋值的时候必须用dict包装起来，item['author'] = authInfo 赋值会报Error。

6、提取内容的时候利用XPath取出需要的内容，有关XPath的资料参看：XPath教程。开发过程中可以利用浏览器提供的工具查看XPath，比如Firefox 浏览器中的FireBug、FirePath插件，对于https://www.aiidol.com/group/python/discussion?start=0这个页面，XPath规则“//td[@class="title"]”可以获取到帖子标题列表，示例：

上图红框中可以输入XPath规则，方便测试XPath的规则是否符合要求。新版Firefox可以安装 Try XPath 这个插件查看XPath，Chrome浏览器可以安装 XPath Helper 插件。

使用随机UserAgent

为了让网站看来更像是正常的浏览器访问，可以写一个Middleware提供随机的User-Agent，在工程根目录下添加文件useragentmiddleware.py，示例代码：

[url=]

[/url]

-*-coding:utf-8-*-

import

random

from

scrapy.downloadermiddlewares.useragent

import

UserAgentMiddleware

class

RotateUserAgentMiddleware(UserAgentMiddleware):

def

__init__

(self, user_agent=

self.user_agent = user_agent

def

process_request(self, request, spider):

ua = random.choice(self.user_agent_list)

ua:

request.headers.setdefault(

User-Agent

, ua)

for more user agent strings,you can find it in http://www.idiancai.com/pages/useragentstring.php

user_agent_list = [ \

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1

Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11

, \

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6

, \

Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6

, \

Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1

, \

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5

, \

Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5

, \

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

, \

Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

, \

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

, \

Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3

, \

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3

, \

Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

, \

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

, \

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

, \

Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3

, \

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24

, \

Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24

]

[url=]

[/url]

修改settings.py，添加下面的设置，

DOWNLOADER_MIDDLEWARES = {

kiwi.useragentmiddleware.RotateUserAgentMiddleware

: 1,}

同时禁用cookie，COOKIES_ENABLED = False。

运行爬虫

切换到工程根目录，输入命令：scrapy crawl kiwi，console窗口可以看到打印出来的数据，或者使用命令“scrapy crawl kiwi -o result.json -t json”将结果保存到文件里。

上述代码中FundEquity类的属性值使用getter/setter函数方式定义的，这种方式可以对值进行检查。__str__(self)函数类似其它语言里的toString()。

在命令行运行fund_spider.py代码，console窗口会输出净值数据。

小结

从以上的示例代码中可见少量代码就能把豆瓣网上小组中的帖子和回复数据抓取、内容解析、存储下来，可见Python语言的简洁、高效。

例子的代码比较简单，唯一比较花时间的是调 XPath规则，借助于浏览器辅助插件工具能大大提高效率。

例子中没有提及Pipeline(管道)、Middleware(中间件) 这些复杂东西。没有考虑爬虫请求太频繁导致站方封禁IP(可以通过不断更换HTTP Proxy 方式破解)，没有考虑需要登录才能抓取数据的情况(代码模拟用户登录破解)。

实际项目中提取内容的XPath规则、正则表达式这类易变动的部分不应该硬编码写在代码里，网页抓取、内容解析、解析结果的存储等应该使用分布式架构的方式独立运行。总之实际生产环境中运行的爬虫系统需要考虑的问题很多，github上也有一些开源的网络爬虫系统，可以参考。

更多java技术资讯可关注:itheimaGZ获取