学习笔记 - first web crawlerurlopen 是用来打开并读取一个从网络获取的远程对象，是一个通用的库

打开jupyter

首先我们先导入python urllib库里面的request模块

from urllib.request import urlopen

urlopen 是用来打开并读取一个从网络获取的远程对象，是一个通用的库，可以读取html文件图像文件以及其他任何文件流。

html = urlopen("http://www.naver.com")
print(html.read())

爬取结果：

BeautifulSoup

BeautifulSoup 库通过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的python对象展现XML结构信息。

由于 BeautifulSoup 库不是 Python 标准库，因此需要单独安装。但是jupyter中可以直接使用，可以省去很多时间，直接使用。（BeautifulSoup 库最常用的对象恰好就是 BeautifulSoup 对象）

把文章开头的例子进行调整：

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
html = urlopen("http://www.pythonscraping.com/pages/page1.html") 
bsObj = BeautifulSoup(html.read()) 
print(bsObj.h1)

导入 urlopen，然后调用 html.read() 获取网页的 HTML 内容。这样就可以把 HTML 内容传到 BeautifulSoup 对象，转换结构：

想提取什么标签的话：

        bsObj.html.body.h1 
        bsObj.body.h1 
        bsObj.html.h1

现在初步的框架搭建出来了，开始考虑问题。

问题

如果网页在服务器上不存在（或者获取页面的时候出现错误）时候：

程序会返回 HTTP 错误。HTTP 错误可能是“404 Page Not Found”“500 Internal Server Error”等。所有类似情形，urlopen 函数都会抛出“HTTPError”异常。

解决方法：

try:     
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:     
    print(e)     
    # 返回空值，中断程序，或者执行另一个方案 
else:    
    # 程序继续。注意：如果你已经在上面异常捕捉那一段代码里返回或中断（break），
    # 那么就不需要使用else语句了，这段代码也不会执行

服务器不存在的时候：

如果服务器不存在（就是说链接打不开，或者是 URL 链接写错了），urlopen 会返回一个 None 对象。可以增加一个判断语句检测返回的 html 是不是 None：

if html is None:     
    print("URL is not found") 
else:     
    # 程序继续

当你调用的标签不存在：

如果你想要调用的标签不存在，BeautifulSoup 就会返回 None 对象。不过，如果再调用这个 None 对象下面的子标签，就会发生 AttributeError 错误。

解决方法：

try:     
    badContent = bsObj.body.li
except AttributeError as e:     
    print("Tag was not found") 
else:     
    if badContent == None:         
        print ("Tag was not found")     
    else:         
        print(badContent)

重新组织代码：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("没有找到网页")
else:
    print(title)

结果：

P1：

P2：

推荐一本好书：

python网络数据采集