网页爬虫解决的问题
通过脚本来自动化获取我们想要的网页信息,并将数据整理在一个结构的文件里,可以减少重复性工作的时间,提高效率。
准备的库
这里我们使用Python3.7来实现,使用了的库有BeautifulSoup、lxml、requests。 Mac下可以在终端上,分别执行如下命令进行安装。
安装beautifulsoup4、bs4
pip3 install bs4
安装lxml
pip3 install lxml
安装requests
pip3 install requests
网页爬虫
网页爬虫,这里通过动森图鉴html信息爬取开发所需的文本信息及下载图片
爬取文本信息
爬取动森的鱼图鉴数据,并保存成json文件。
下面我们先通过html找到对应标签,再进行python代码书写抓取
如上图,点击有点弹出框,选择显示网页源代码,进行查看
我们可以在上图html表格标签下(即图中896行),可以看到我们要的数据就包含在这个标签中
<table id="CardSelectTr" class="CardSelect wikitable sortable poke-zt" style="width:100%;text-align: center;margin-top:5px;color:#000">
或者,我们也可以通过开发者模式,然后选中对应元素的方式找到我们要的数据的表格。
Python代码
from bs4 import BeautifulSoup
import json
import requests
import copy
import os
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8",
"Connection": "close",
"Cookie": "_gauges_unique_hour=1; _gauges_unique_day=1; _gauges_unique_month=1; _gauges_unique_year=1; _gauges_unique=1",
"Referer": "https://wiki.biligame.com/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"
}
url = 'https://wiki.biligame.com/dongsen/%E8%99%AB%E5%9B%BE%E9%89%B4'
def fish(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
# 通过table标签找到鱼图鉴列表数据
table = soup.find('table', attrs={'class': 'CardSelect wikitable sortable poke-zt'})
results = table.find_all('tr')
print('Number of results', len(results))
# 初始化列表数据
rows = []
row = {'title': '', 'place': '', 'feature': '', 'northMonth': '', 'sourthMonth': '', 'appearDay': '', 'price': '', 'imgName': ''}
# 遍历所有的数据
for result in results:
# 找到所有td下的数据
data = result.find_all('td')
# 检查是否有数据
if len(data) == 0:
continue
# 写入对应数据
row['title'] = data[0].getText()
row['place'] = data[1].getText()
row['feature'] = data[2].getText()
row['northMonth'] = data[3].getText()
row['sourthMonth'] = data[4].getText()
row['appearDay'] = data[5].getText()
row['price'] = data[6].getText()
imgurl = data[0].find('img').get('src')
filename = os.path.basename(imgurl)
row['imgName'] = filename
# 把数据写入rows中
rows.append(row)
row = copy.copy(row)
# 转换json,需注意编码
jsonList = json.dumps(rows, ensure_ascii=False)
print(jsonList)
# 写入json文件,文件生成在当前目录下
with open("fish.json", "w", encoding='utf-8') as f:
f.write(jsonList)
print("加载入文件完成...")
fish(url)
下载图片
在table标签下找到图片的所在位置,并执行下载操作
<img alt="シャンハイガニ.png" src="https://patchwiki.biligame.com/images/dongsen/thumb/9/92/lknkm6txgzxvpzijekvcba0g0ofcpag.png/80px-%E3%82%B7%E3%83%A3%E3%83%B3%E3%83%8F%E3%82%A4%E3%82%AC%E3%83%8B.png" decoding="async" width="80" height="80" class="img-kk" srcset="https://patchwiki.biligame.com/images/dongsen/thumb/9/92/lknkm6txgzxvpzijekvcba0g0ofcpag.png/120px-%E3%82%B7%E3%83%A3%E3%83%B3%E3%83%8F%E3%82%A4%E3%82%AC%E3%83%8B.png 1.5x, https://patchwiki.biligame.com/images/dongsen/9/92/lknkm6txgzxvpzijekvcba0g0ofcpag.png 2x" data-file-width="128" data-file-height="128" />
Python代码
from bs4 import BeautifulSoup
import requests
import os
import shutil
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8",
"Connection": "close",
"Cookie": "_gauges_unique_hour=1; _gauges_unique_day=1; _gauges_unique_month=1; _gauges_unique_year=1; _gauges_unique=1",
"Referer": "https://wiki.biligame.com/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"
}
url = 'https://wiki.biligame.com/dongsen/%E5%8D%9A%E7%89%A9%E9%A6%86%E5%9B%BE%E9%89%B4'
# 下载图片
# 这里使用Requests 库,通过为请求设置特殊参数 stream 来实现。当 stream 设为 True 时,
# 上述请求只下载HTTP响应头,并保持连接处于打开状态,
# 直到访问 Response.content 属性时才开始下载响应主体内容
def download_jpg(image_url, image_localpath):
response = requests.get(image_url, stream=True)
if response.status_code == 200:
with open(image_localpath, 'wb') as f:
response.raw.deconde_content = True
shutil.copyfileobj(response.raw, f)
# 下载鱼图片
def downloadImg(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
for pic_href in soup.find_all('div', class_='floatnone'):
for pic in pic_href.find_all('img'):
imgurl = pic.get('src')
dir = os.path.abspath('.')
filename = os.path.basename(imgurl)
imgpath = os.path.join(dir, filename)
print('开始下载 %s' % imgurl)
download_jpg(imgurl, imgpath)
downloadImg(url)
执行上方代码,会在当前目录下载图片
可能会遇到的问题
通过终端命令安装找不到对应模块?
可以用过以下方式解决:
1.File->Other Settings->Preferences for New Projects
2.点击下方的➕号
3.在搜索框输入想要安装的库,再点左下角 Install Package,等待安装完成即可
注:因时间及个人水平有限,不足之处,忘大神提点