如何在Python中使用urllib包Python 在标准库中包含了一些模块，使与互联网数据的工作变得简单。urllib

Python 在标准库中包含了一些模块，使与互联网数据的工作变得简单。urllib 包就是这样一个模块。这个包可以用来从互联网上获取数据，并执行普通的处理任务。在 urllib 中有一个 request 模块。这个模块用于读取在线 URL。一个错误模块可用于处理可能出现的错误。parse 模块促进了对 URL 结构的解析。还有一个robotparser，用于处理你可能在网络服务器上发现的robots.txt文件。在本教程中，我们将看一下 urllib 包中的一些模块。

如何获取数据

首先，我们可以在Python中建立一个虚拟环境，用 **virtualenv .*命令在我们选择的目录中建立一个虚拟环境。不要忘记用以下命令激活虚拟环境 source ./Scripts/activate.我们的虚拟环境被命名为vurllib (意思是虚拟化的urllib)，我们的提示符现在是(vurllib) vurllib $*表示我们的环境已经准备就绪。

现在让我们在 Pycharm 中打开项目并添加一个新文件来尝试一些 urllib 示例。

python urllib examples

导入 urllib

在我们能够使用 urllib 包内的软件之前，我们需要导入它。让我们使用下面这行代码来导入 urllib 包的 request 类。

urllib_examples.py

import urllib.request

这使我们能够访问我们稍后要测试的类方法。但首先，我们需要一些外部 URL 来工作。

httpbin来拯救我们

Httpbin是一个用于测试HTTP库的惊人的网络服务。它有几个很棒的端点，可以测试你在HTTP库中需要的几乎所有东西。请在httpbin.org上查看它

设置Url和获取数据

现在我们可以指定一个 URL 来工作，同时将其存储在 **url**变量中。为了向这个URL发出请求，我们可以使用urlopen()函数，同时传入保存Url的变量。响应现在被存储在 **result**变量中。

import urllib.request

# specify the URL to get data from
url = 'http://httpbin.org/xml'

# open the URL and fetch some data
result = urllib.request.urlopen(url)

检查Http响应代码

HTTP响应代码告诉我们一个特定的HTTP请求是否已经成功完成。这些响应被归为五个不同的类别。

信息性响应(100-199)
成功响应(200-299)
重定向(300-399)
客户端错误(400-499)
服务器错误(500-599)

import urllib.request

# specify the URL to get data from
url = 'http://httpbin.org/xml'

# open the URL and fetch some data
result = urllib.request.urlopen(url)

# Print the resulting http status code
print('Result code: {0}'.format(result.status))

当我们运行上面的代码时，我们看到的是200 OK状态代码，这意味着一切都很顺利!

Http响应头文件

来自服务器的响应也包括Http headers。这是Web服务器在收到HTTP请求时发回的文本形式的信息。响应头包含各种类型的信息，我们可以使用 **getheaders()**函数来检查这些信息。

import urllib.request

# specify the URL to get data from
url = 'http://httpbin.org/xml'

# open the URL and fetch some data
result = urllib.request.urlopen(url)

# Print the resulting http status code
print('Result code: {0}'.format(result.status))

# print the response data headers
print('Headers: ---------------------')
print(result.getheaders())

结果

[('Date', 'Mon, 09 Mar 2020 16:05:38 GMT'), ('Content-Type', 'application/xml'), ('Content-Length', '522'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]

我们可以看到上面服务器发回的头信息是调用**getheaders()函数的结果。如果你只想要一个单一的头信息值，你可以使用getheader()**函数来代替。在头信息响应中是一个元组值的列表。因此，我们可以看到我们有日期、内容类型、内容长度、连接、服务器、访问控制-允许-起源和访问控制-允许-证书的值。有趣的是!

读取响应数据

现在我们需要读取实际返回的数据，或者说有效载荷，包含在Http响应中。要做到这一点，我们可以像这样使用read()和decode()函数。

import urllib.request

# specify the URL to get data from
url = 'http://httpbin.org/xml'

# open the URL and fetch some data
result = urllib.request.urlopen(url)

# Print the resulting http status code
print('Result code: {0}'.format(result.status))

# print the response data headers
print('Headers: ---------------------')
print(result.getheaders())

# print the actual response data
print('Returned data: ---------------------')
print(result.read().decode('utf-8'))

结果

Returned data: ---------------------
<?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>

我们可以在网页浏览器中访问相同的 Url，看看它是如何呈现这些数据的。

使用 urllib 的 GET 和 POST

在上面的部分，我们看到了如何使用 urllib 从网络服务中获取数据。现在我们想看看如何向网络服务器发送信息。最常见的是，这将通过一个 GET 或 POST Http 请求来完成。一个GET请求使用直接编码在URL中的参数，这是向网络服务（如Bing搜索）发出查询的一种相当普遍的方式。如果你试图在网络服务器上创建或更新什么，那么你通常会利用POST Http请求。还有其他的Http方法需要学习，比如PUT、PATCH和DELETE，但是GET和POST在大多数时候是足够的，这两种方法将是我们在这里测试的。

对GET端点的请求

在下面的代码中，我们可以通过再次设置一个简单的url httpbin.org/get 开始。然后我们再次读取Http状态代码，并使用read()和decode()读取返回的数据。

import urllib.request

# set up Url for the request
url = 'http://httpbin.org/get'

result = urllib.request.urlopen(url)

print('Result code: {0}'.format(result.status))
print('Returned data: ----------------------')
print(result.read().decode('utf-8'))

结果

C:pythonvurllibScriptspython.exe C:/python/vurllib/urllib_examples.py
Result code: 200
Returned data: ----------------------
{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.8", 
    "X-Amzn-Trace-Id": "Root=1-5e667d77-8282fd705e85709035d2c830"
  }, 
  "origin": "127.0.0.1", 
  "url": "http://httpbin.org/get"
}

注意，在响应中，args键是空的。这意味着我们没有和请求一起发送任何数据。然而，我们可以做到这一点，这就是我们接下来要做的。

创建一个args有效载荷

为了在有效载荷中传递数据，我们可以使用一个简单的python字典，其中有一些随机的数据，只是为了举例。然后，数据需要先用**urlencode()**函数进行url编码。该操作的结果被存储在 **data变量中。最后，我们用urlopen()**函数发出请求，输入url和数据，用一个问号字符分开。

import urllib.request
import urllib.parse

# set up Url for the request
url = 'http://httpbin.org/get'

# define sample data to pass to the GET request
args = {
    'color': 'Blue',
    'shape': 'Circle',
    'is_active': True
}

# url-encoded data before passing as arguments
data = urllib.parse.urlencode(args)

# issue the request with the data params as part of the URL
result = urllib.request.urlopen(url + '?' + data)

print('Result code: {0}'.format(result.status))
print('Returned data: ----------------------')
print(result.read().decode('utf-8'))

结果


C:pythonvurllibScriptspython.exe C:/python/vurllib/urllib_examples.py
Result code: 200
Returned data: ----------------------
{
  "args": {
    "color": "Blue", 
    "is_active": "True", 
    "shape": "Circle"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.8", 
    "X-Amzn-Trace-Id": "Root=1-5e668013-78946ef0a23939d07b2ceff8"
  }, 
  "origin": "127.0.0.1", 
  "url": "http://httpbin.org/get?color=Blue&shape=Circle&is_active=True"
}

看一下上面的结果，我们注意到两件新的事情。args键没有填入我们感兴趣的有效载荷数据。此外，注意到url中所有的数据都被编码在Url本身。这就是GET请求的工作方式。

制作POST请求

POST与GET的工作方式不同。同样的args字典仍然可以被用作有效载荷，但在发出POST请求之前需要将其编码为字节。这是用 encode() 函数完成的。这是 Python 中可用的内置字符串函数之一，它默认使用 UTF-8。对于POST请求，我们不把参数添加到URL中。相反，你可以使用urlopen()函数的数据参数。通过直接向 urlopen() 函数传递数据，urllib 将自动切换到幕后使用 POST 方法。不需要告诉 urllib 使用 POST 而不是 GET。

import urllib.request
import urllib.parse

# issue the request with a data parameter to use POST
url = 'http://httpbin.org/post'

# define sample data to pass to the GET request
args = {
    'color': 'Blue',
    'shape': 'Circle',
    'is_active': True
}

# url-encoded data before passing as arguments
data = urllib.parse.urlencode(args)

data = data.encode()
result = urllib.request.urlopen(url, data=data)

print('Result code: {0}'.format(result.status))
print('Returned data: ----------------------')
print(result.read().decode('utf-8'))

结果

C:pythonvurllibScriptspython.exe C:/python/vurllib/urllib_examples.py
Result code: 200
Returned data: ----------------------
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "color": "Blue", 
    "is_active": "True", 
    "shape": "Circle"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "38", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.8", 
    "X-Amzn-Trace-Id": "Root=1-5e6683a5-777d0378401b31982e213810"
  }, 
  "json": null, 
  "origin": "127.0.0.1", 
  "url": "http://httpbin.org/post"
}

你能发现我们从 httpbin 得到的响应有什么不同吗？没错，有效载荷数据现在是在表单键里面，而不是在 args 里面。另外，注意Url键没有任何数据嵌入Url本身。所以我们可以看到 GET 和 POST 之间的区别，以及它们在携带有效载荷数据方面的不同。

使用 urllib 的错误

处理错误并不总是最有趣的事情，但它是需要的。网络本质上是容易出错的，所以发出 Http 请求的程序应该为这些情况做好准备。你可能会遇到一个问题，即一个Http错误代码是服务器的响应。或者，你试图获取数据的URL已经不存在了。再有，可能有一个网络问题，导致请求超时。任何数量的事情都可能导致程序的问题。为了缓解这些情况，你可以在Python中把Http请求包裹在一个try-catch块中。下面是几个如何做到这一点的例子。

import urllib.request
from urllib.error import HTTPError, URLError
from http import HTTPStatus

url = 'http://httpbin.org/html'

# wrap the web request in a try catch block
try:
    result = urllib.request.urlopen(url)
    print('Result code: {0}'.format(result.status))
    if (result.getcode() == HTTPStatus.OK):
        print(result.read().decode('utf-8'))

# happens on a non-success error code
except HTTPError as err:
    print('There was an HTTP Error with code: {0}'.format(err.code))

# happens when there is something wrong with the URL itself
except URLError as err:
    print('There has been a catastrophic failure. {0}'.format(err.reason))

第一个例子实际上没有错误，而且效果很好。我们正在使用 urllib 来获取 httpbin.org/html 的 url，它保存了 Herman Melville 的 Moby Dick 小说中的一些文本。我们可以在Pycharm中看到这个结果。

Herman Melville - Moby-Dick

如果我们对代码做这样的修改呢？注意第5行，现在有一个无效的Url。

import urllib.request
from urllib.error import HTTPError, URLError
from http import HTTPStatus

url = 'http://i-dont-exist.org/'

# wrap the web request in a try catch block
try:
    result = urllib.request.urlopen(url)
    print('Result code: {0}'.format(result.status))
    if (result.getcode() == HTTPStatus.OK):
        print(result.read().decode('utf-8'))

# happens on a non-success error code
except HTTPError as err:
    print('There was an HTTP Error with code: {0}'.format(err.code))

# happens when there is something wrong with the URL itself
except URLError as err:
    print('There has been a catastrophic failure. {0}'.format(err.reason))

这一次，结果就完全不同了。我们的except块优雅地处理了这个错误，并显示了一个用户友好的错误。

pycharm handle urllib error

Httpbin还提供了一种检查404状态代码的方法。我们可以像这样测试那个错误条件，并注意到我们现在得到一个不同的错误。

import urllib.request
from urllib.error import HTTPError, URLError
from http import HTTPStatus

url = 'http://httpbin.org/status/404'

# wrap the web request in a try catch block
try:
    result = urllib.request.urlopen(url)
    print('Result code: {0}'.format(result.status))
    if (result.getcode() == HTTPStatus.OK):
        print(result.read().decode('utf-8'))

# happens on a non-success error code
except HTTPError as err:
    print('There was an HTTP Error with code: {0}'.format(err.code))

# happens when there is something wrong with the URL itself
except URLError as err:
    print('There has been a catastrophic failure. {0}'.format(err.reason))

try except python urllib

urllib的一些不足之处

urllib模块是相当容易使用的，然而与其他库相比，它确实有一些缺点。urllib 的一个缺点是，它只支持全套 HTTP 动词的一个子集，如 GET 和 POST。PUT、PATCH和DELETE不那么常用，但如果你使用的Http库能够实现它们，那就更好了。第二个缺点是 urllib 不会自动为你解码返回的数据。如果你正在编写一个必须处理未知数据源或几种编码的应用程序，那么工作起来就很麻烦了。urllib 没有内置的功能来处理 cookie、认证或会话。处理JSON响应有点困难，超时也很难处理。我们可以尝试的 urllib 的替代方案是 Python Requests。

Python Urllib 总结

在本教程中，我们学习了在Python中使用urllib获取互联网数据的一点知识，它是Python标准库的一部分。要用 urllib 访问一个 URL，你可以使用urlopen()函数，这是 urllib.request 的一部分。从请求返回到服务器的数据需要使用decode()函数进行转换。当你使用urlopen()函数时，要指定一个 POST 请求，你所需要做的就是包括数据参数，而 urllib 会改变引擎盖下的 Http 动词。我们还看到了一些 HTTPError 和 URLError 的例子以及如何处理它们。