巩固bs4的标签,属性,文本的提取
基本属性,标签,文字的获取
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- E
lsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie<
/a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilli
e</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")
print(soup.title.string)
print(type(soup.name)) # 编码
print(soup.name)
print(soup.attrs)
print(soup.a)
print(soup.a.string) # 清理注释 显示注释的信息
print(type(soup.a.string))
数据的展示
The Dormouse's story
<class 'str'>
[document]
{}
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
Elsie
<class 'bs4.element.Comment'>
bs4的儿子节点等部分展示
bs4的祖先节点,父节点,兄弟节点,儿子节点,子孙节点的使用,部分展示
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- E
lsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie<
/a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilli
e</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup(html,"lxml")
print soup.head.contents # contents 返回一个元素下的所有的子节点,列表
"""
[<title>The Dormouse's story</title>]
"""
print(soup.a.contents) # 按照标签的闭合进行分割,换行元素也算
"""
[u' E\nlsie ']
"""
# children
for child in soup.body.children:
print child # 其实就是闭合标签的节点分布
# for child in soup.body.descendants:
# print child # 标签及标签下的内容
children数据展示
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;
and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- E
lsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie<
/a> and
</a><a class="sister" href="http://example.com/tillie" id="link3">Tilli
e</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
子孙数据展示:一层一层向下剥离数据,直到最终是string
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> # body下的第一层
<b>The Dormouse's story</b> # p拔掉的一层
The Dormouse's story # b拔掉的一层
<p class="story">Once upon a time there were three little sisters;
and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- E
lsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie<
/a> and
</a><a class="sister" href="http://example.com/tillie" id="link3">Tilli
e</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters;
and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- E
lsie --></a>
E
lsie
,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie<
/a> and
</a>
Lacie<
/a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tilli
e</a>
Tilli
e
;
and they lived at the bottom of a well.
<p class="story">...</p>
...