BeautifulSoup基本使用持续创作，加速成长！这是我参与「掘金日新计划 · 10 月更文挑战」的第12天，点击查

持续创作，加速成长！这是我参与「掘金日新计划 · 10 月更文挑战」的第12天，点击查看活动详情

啥叫BeautifulSoup

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库，和lxml功能一样，主要解析和提取数据缺点：效率没有lxml高优点：接口设计人性化，使用方便。

首先引入BeautifulSoup,然后open一个本地html文件，以'utf-8'格式打开，然后必须传第二个参数'lxml'得到解析后的对象。然后点标签名，会获取到文档中第一个标签。点attrs就能得到标签上的属性集合。

from bs4 import BeautifulSoup

root = BeautifulSoup(open('ip.html', encoding='utf-8'),'lxml')
print(root.a)
print(root.a.attrs)

from bs4 import BeautifulSoup

root = BeautifulSoup(open('ip.html', encoding='utf-8'),'lxml')
print(root.find("a"))
print(root.find("a",title='w'))

特殊的属性class,当我们根据class查找标签时，编辑器就会检测到错误，因为class在python中是关键字，不可以被当成变量名使用，所以要加一个下划线_后缀，就可以正常查询到了

from bs4 import BeautifulSoup

root = BeautifulSoup(open('ip.html', encoding='utf-8'),'lxml')
print(root.find(class_='tom'))

from bs4 import BeautifulSoup

root = BeautifulSoup(open('ip.html', encoding='utf-8'),'lxml')


print(root.find_all('b'))

如果查找多个条件的，会使用数据，满足其中一个条件都视为满足条件

from bs4 import BeautifulSoup

root = BeautifulSoup(open('ip.html', encoding='utf-8'),'lxml')


print(root.find_all(['b','a']))

find_all可以传一个limit,表示查找到几个就不查了，直接返回，这里传了2，找到两个以后，就不找了，直接返回。

from bs4 import BeautifulSoup

root = BeautifulSoup(open('ip.html', encoding='utf-8'),'lxml')
print(root.find_all(['li','a'],limit=2))

from bs4 import BeautifulSoup

root = BeautifulSoup(open('ip.html', encoding='utf-8'),'lxml')
print(root.select('li'))
print(root.select('.tom'))

get_text方法和string属性 get_text获取标签里的文本内容，string获取标签中的仅文本内容，如果还有标签嵌套，什么也得不到。但是get_text方法没关系，可以获取标签和标签内标签的文本内容