爬虫笔记随记直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本。注意：当有多个节点的时

BeautifulSoup

BeautifulSoup对象的prettify（）方法：可以把要解析的字符串以标准的缩进格式输出 BeautifulSoup对象的title.string 是输出HTML中的title节点的文本内容

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')

print(soup.prettify()
print(soup.title.string)

输出结果：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dormouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

节点选择器

直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本。

print(soup.title)
print(soup.title.string)
print(type(soup.title))
print(soup.head)
print(soup.p)

输出结果：

<title>The Dormouse's story</title>
The Dormouse's story
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

注意：当有多个节点的时候，这种方法只能匹配到第一个节点。如上面的p节点

提取信息

获取名称

print(soup.p.name)

获取属性

print(soup.p.attrs)
print(soup.p.attrs['name'])

获取内容

print(soup.p.string)

关联选择: 直接子节点

print(soup.p.contents)

返回结果（取的第二个p）

['Once upon a time there were three little sisters; and their names were\n', <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, ',\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\nand they lived at the bottom of a well.']

另一种直接子节点children

print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i,child)

子孙节点descendants：

print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i,child)

直接父节点parent：父节点只有一个，但子节点有很多。注意比较parent和children的类型

print(soup.p.parent)

祖先节点parents：（类比子孙节点descendants）

for i, parent in enumerate(soup.p.parents):
    print(i,parent)

兄弟节点：

next_sibling 下一个兄弟节点
previous_sibling 上一个兄弟节点
next_siblings 后面的兄弟节点们
previous_siblings 前面的兄弟节点们

方法选择器

find_all(name, attrs, recursive, **kwargs)
查找所有符合条件的元素
（1） name

soup.find_all(name = 'ul')

(2) attrs

soup.find_all(attrs = {'id':"list-1"})
#也可以直接传入属性
soup.find_all(id = 'list-1')
soup.find_all(class_='element')#class是Python的关键字，所以后面要加一个下划线

（3）text
可以用来匹配节点的文本，传入形式可以是字符串也可以是正则表达式

soup.find_all(text=re.compile('link'))

find() 同find_all()，只不过返回单个元素，即匹配到的第一个元素

find_parents() 和 find_parent():前者返回所有祖先节点，后者返回直接父节点 find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟节点，后者返回后面第一个兄弟节点
find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点，后者返回前面第一个兄弟节点
find_all_next()和find_next():前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点