Python requests 库中requests.text和requests.content的区别简单研究了一下re

今天在请求一家斯洛伐克网站的时候，发现网页返回的内容有乱码，随机更换了几个常见编码方式无果，最后将requests.test改为requests.content之后解决了乱码问题。为此简单研究了一下二者的区别，以及如何检测网站编码。

首先点进代码看一下两个函数的定义。

def content(self):
    """Content of the response, in bytes."""

```
def text(self):
    """Content of the response, in unicode.

    If Response.encoding is None, encoding will be guessed using
    ``charset_normalizer`` or ``chardet``.

    The encoding of the response content is determined based solely on HTTP
    headers, following RFC 2616 to the letter. If you can take advantage of
    non-HTTP knowledge to make a better guess at the encoding, you should
    set ``r.encoding`` appropriately before accessing this property.
    """
```

描述中可以看出content返回的是bytes型（二进制数据）；text返回的是Unicode数据。

也就是说，如果想取文本，就用response.text，如果想取其他类型的数据，比如图片，就用response.content。但有些网站的编码方式比较特殊，使用requests直接请求拿response.text是得不到正确的文本的，所以我在用etree.HTML的时候，都会直接用content。

那为什么会导致文本乱码呢？因为requests的响应内容的编码方式是先尝试从响应头中根据content-type获取，like this： Content-Type: text/html; charset=utf-8 但是我请求的网站，Content-Type长这样子：

这个网站没有在content-type中设置charset参数，如果有参数的话，就会根据其设置的编码参数来解码。而在判断无果后，requests就会默认文本类型的编码都是ISO-8859-1。我debug了一下，果然。。。

debug编码.png

那既然网页的编码并非是默认值，如何才能检测出网页的真实编码呢？ requests提供了一个方法可以探测网页的编码方式：requests.utils.get_encodings_from_content(response.text)

其实就是从网页上抽取有没有设置编码方式，也就是说得到的编码在网页上就是存在的。我们再浅浅debug一下

debug真实编码.png 很好，果然得到了真实编码，再浅浅回网页找一下

网页中编码.png

nice，网页上果然设置了编码方式，那就可以手动使用content然后自己进行重新编码。

content = response.content.decode('windows-1250')

参考：www.cnblogs.com/cc11001100/…