Python爬虫（七）图形验证码识别——使用云打码平台网站登录一般会使用验证码来识别是否是机器操作，重复登录，消耗服务器

网站登录一般会使用验证码来识别是否是机器操作，重复登录，消耗服务器资源。

验证码大概有以下几大类：

1：图形验证码（数字字母）

2：拖动滑块验证码

3：点击文字验证码

4：点击验证码

5：绘制图案（宫格）验证码

前四种验证码比较常见，第五种验证码我还没有遇到过。

图形验证码的识别，我这里使用云打码平台。

百度了一下，识别这种图形（由数字和字母组成）验证码，教程里边一般使用云打码平台：

www.yundama.com/demo.html

我也想使用这个平台测试，但是，这个网址现在无法访问……

有同事给我推荐他之前用过的云打码平台：超级鹰 www.chaojiying.com/

这个平台也还可以，但是他不是免费的

测试的时候购买1块钱一次的体验包就可以了。

使用就比较简单了，我这里放一下官方给的文档示例

#!/usr/bin/env python
# coding:utf-8

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()

# 调用
chaojiying = Chaojiying_Client('xxxx', 'xxxx', '96001') #用户中心>>软件ID 生成一个替换 96001
im = open('a.jpg', 'rb').read()                                        #本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
print(chaojiying.PostPic(im, 8001))                                  #1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()

返回值：

{'err_no': 0, 'err_str': 'OK', 'pic_id': '8170409478072900001', 'pic_str': '7261', 
'md5': 'f01b7182f1afec940fd0559970c9dd8d'}

验证码图片：

在正式的使用中，我们一般将验证码保存成如上图所示的图片，使用超级鹰进行验证码识别。

我这里识别一下古诗文网站的验证码

古诗文网站：www.gushiwen.cn/

登录页面网址：so.gushiwen.cn/user/login.…

验证码地址：so.gushiwen.cn/RandCode.as…

爬取验证码图片这就很简单了，参照《Python爬虫（三）图片数据爬取》

我这里就不啰嗦了，直接上代码：

# 1:指定url
url = "https://so.gushiwen.cn/RandCode.ashx"
# 2：模拟网络请求链接
responce = requests.get(url=url)
# 3：获取响应数据,content获取二进制数据
content = responce.content
# 4：持久化存储
with open('./yanzhengma.jpg','wb') as fe:
    fe.write(content)
    print('爬取完成')

# 调用
chaojiying = Chaojiying_Client('xxxxx', 'xxxx', '96001')  # 用户中心>>软件ID 生成一个替换 96001
im = open('yanzhengma.jpg', 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
print(chaojiying.PostPic(im, 8001))  # 1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()

这样就可以爬取到登录页面中的二维码了。

后来发现，百度云是有ocr文字识别这个服务的。并且是免费的，这个就很棒了。但是这个我没有深入的去研究，贴张图在下边

以上大概就是使用云打码平台识别二维码的情况。

有好的建议，请在下方输入你的评论。