本文已参与「新人创作礼」活动,一起开启掘金创作之路。
本教程实现了Winkawaks官网( www.winkawaks.org )的所有rom爬取,使用lxml.etree+requests实现了网页资源的读取与文件下载,pandas完成了数据关系表的整理。
环境引入
(Environmental Version)
pandas==1.1.5
requests==2.26.0
lxml==4.6.3
数据来源分析
(Datasource Analysis)
1、NeoGeo Roms(283条)
2、CPS2 Roms(250条)
3、CPS1 Roms(170条)
Winkawas网站的Roms类别 | 图片来源:Winkawaks官网截图
三种Roms的来源链接与内容格式相同,可以使用一套代码完成解析,各自的链接如下:
# NeoGeo
https://www.winkawaks.org/roms/neogeo/index.htm
https://www.winkawaks.org/roms/neogeo/[包名]-download.htm
# CPS1
https://www.winkawaks.org/roms/cps1/index.htm
https://www.winkawaks.org/roms/cps1/[包名]-download.htm
# CPS2
https://www.winkawaks.org/roms/cps2/index.htm
https://www.winkawaks.org/roms/cps2/[包名]-download.htm
NeoGeo Roms列表呈现的网页数据格式 | 图片来源:Winkawaks官网截图
下载链由动态MD5进行构建,灵活使用实时requests获取
https://dl.winkawaks.org/roms/1944j/ad7cf87d4c480dd7764ccff0b3d32c71caf29e5d/1944j.zip
2020 Super Baseball (set 2)的下载资源链接说明 | 图片来源:Winkawas官网截图
反爬虫处理
(Anti-scrapy Process)
复制浏览器中的User-Agent代理,用于下载链接资源,避开资源访问“403 Forbidden”
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 Edg/98.0.1108.62'}
整合代码
(Integrate Code)
def download_file(folder, source_url):
content = requests.get(source_url).text
html = etree.HTML(content)
tree_href = html.xpath('//*[@id="rom-system-index"]/div')
pre_url = source_url.replace('index.htm', '')
href_list = []
for i in range(1, len(tree_href) - 1, 1):
href = pre_url + tree_href[i].xpath('a[2]/@href')[0]
href = href.replace('.htm', '-download.htm')
sub_content = etree.HTML(requests.get(href).text)
download_href = 'https:' + sub_content.xpath('//*[@id="rom-url"]/div[2]/a/@href')[0]
# 得到文件名称
filename = download_href.split('/')[-1]
# 文件本地地址
save_path = folder + '/' + filename
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36 Edg/98.0.1108.62'}
print("第{}个文件正在下载……{}".format(i, download_href))
# 写入文件
with open(save_path, 'wb') as f:
hurl = urllib.request.Request(download_href, headers=headers)
file_data = request.urlopen(hurl).read()
f.write(file_data)
href_list.append(download_href)
print('已完成下载')
return href_list
格式化Roms主要数据
(Format Key Data)
将表格分为title、download_url、local_filename三列
Roms数据表 | 图片来源:CodeItEasy
整合代码
(Integrate Code)
import requests
from lxml import etree
import pandas as pd
def getURL(source_url):
content = requests.get(source_url).text
html = etree.HTML(content)
tree_href = html.xpath('//*[@id="rom-system-index"]/div')
pre_url = source_url.replace('index.htm', '')
href_list = []
title_list = []
filename_list = []
for i in range(1, len(tree_href) - 1, 1):
href = pre_url + tree_href[i].xpath('a[2]/@href')[0]
filename = href.split('/')[-1].replace('.htm', '.zip')
title = tree_href[i].xpath('a[2]/text()')[0]
href_list.append(href)
filename_list.append(filename)
title_list.append(title)
games_data = dict()
games_data['title'] = title_list
games_data['url'] = href_list
games_data['filename'] = filename_list
df_games = pd.DataFrame(games_data)
df_games.to_excel(pre_url.split('/')[-2] + '.xlsx')
print('已完成数据保存')
return href_list
console输出 | 图片来源:CodeItEasy
原创:本文由CodeItEasy公众号(lsir_34567)原创,编辑:原虫子
CSDN创作者:刘先生的u写倒了 ( blog.csdn.net/weixin_4379… )
微信公众平台创作者:CodeItEasy
掘金平台创作者:刘先生的u写倒了