BeautifulSoup的使用(二)

164 阅读1分钟

巩固bs4的标签,属性,文本的提取

基本属性,标签,文字的获取

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- E
lsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie<
/a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilli
e</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")

print(soup.title.string)
print(type(soup.name))  # 编码
print(soup.name)
print(soup.attrs)

print(soup.a)
print(soup.a.string)  # 清理注释  显示注释的信息
print(type(soup.a.string))

数据的展示

The Dormouse's story
<class 'str'>
[document]
{}
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
Elsie 
<class 'bs4.element.Comment'>

bs4的儿子节点等部分展示

bs4的祖先节点,父节点,兄弟节点,儿子节点,子孙节点的使用,部分展示

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- E
lsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie<
/a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tilli
e</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup=BeautifulSoup(html,"lxml")

print soup.head.contents  # contents 返回一个元素下的所有的子节点,列表
"""
[<title>The Dormouse's story</title>]
"""

print(soup.a.contents)  # 按照标签的闭合进行分割,换行元素也算
"""
[u' E\nlsie ']
"""

# children
for child in soup.body.children:
   print child                # 其实就是闭合标签的节点分布


# for child in soup.body.descendants:
#     print child                   # 标签及标签下的内容

children数据展示

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters;
and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- E
lsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie&lt;
/a&gt; and
</a><a class="sister" href="http://example.com/tillie" id="link3">Tilli
e</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

子孙数据展示:一层一层向下剥离数据,直到最终是string

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>  # body下的第一层
<b>The Dormouse's story</b>                                       # p拔掉的一层
The Dormouse's story                                              # b拔掉的一层


<p class="story">Once upon a time there were three little sisters;
and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- E
lsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie&lt;
/a&gt; and
</a><a class="sister" href="http://example.com/tillie" id="link3">Tilli
e</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters;
and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- E
lsie --></a>
 E
lsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie&lt;
/a&gt; and
</a>
Lacie<
/a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tilli
e</a>
Tilli
e
;
and they lived at the bottom of a well.


<p class="story">...</p>
...