爬虫存储html:获取编码,压缩存储有效部分

1,486 阅读1分钟

本人文章<=>个人笔记,若有误,望指正,感激不尽.

本人邮箱:silenceandsharp@163.com

文章基于python3

先说获取字节编码:

   pip install cchardet 
   这是一个没人维护的库!有时候会不灵光!
    
import cchardet
x=cchardet.detect('我是菜鸡'.encode(encoding='GBK'))
print(x)
print(cchardet.detect('为反对三法司'.encode('gbk')))

输出:

{'encoding': None, 'confidence': None}
{'encoding': 'GB18030', 'confidence': 0.9900000095367432}

再说压缩存储:

import zlib
import htmlmin

html = """
<!DOCTYPE html>
<html lang="en">
<head>
  <title>Bootstrap Case</title>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
</head>
<body> 
<div class="container">
  <h2>Well</h2>
  <div class="well">Basic Well</div>
</div>
</body>
</html>
"""
s = 'slfsjdalfkasflkkdkaleeeeeeeeeeeeeeeeeeeeeeeeeeeelaaalkllfksaklfasdll  kkkkkk123'
#zlib压缩
zlib_s = zlib.compress(s.encode('utf8'))
# zlib_s = zlib.compress(s.encode('gbk'))
print('s'.ljust(12), len(s))
print('zlib_s'.ljust(12), len(zlib_s))
#对比htmlmin的压缩
zlib_html = zlib.compress(html.encode('utf8'))
print('html'.ljust(12), len(html))
print('zlib_html'.ljust(12), len(zlib_html))
htmlmin_html = htmlmin.minify(html, remove_empty_space=True)
print('htmlmin_html'.ljust(12), len(htmlmin_html))
#zlib还原
ss = zlib.decompress(zlib_s)
print(ss.decode('utf8'))

结论:

    1,尽量选择需要的部分来压缩,像<head>这样的存下来也没什么用的,就甭存了(这个示例没写那么透彻)
    2,zlib的压缩要比htmlmin的压缩比率要高
    3,存html,是为了预防产品或客户变更需求