python_Xpath(爬虫)

286 阅读1分钟

快捷键:Ctrl+Shift+X

语法

<body>
     <div>
          <a href="/ershoufang/dongcheng/"  title="北京东城在售二手房 ">东城</a>
          <a href="/ershoufang/xicheng/"  title="北京西城在售二手房 ">西城</a>
          <a href="/ershoufang/chaoyang/"  title="北京朝阳在售二手房 ">朝阳</a>
    </div>
</body>

/:父子关系,如:div/a[2],需要第几个标签,就在标签后面加上[顺序值]

//:子孙不相邻关系

@:通过属性定位标签,如://a[@title="北京西城在售二手房 "]

@属性名:提取标签内指定属性名的属性值,如://a[@title="北京西城在售二手房 "]/@title

代码

from lxml import etree
import requests
​
url = 'http://xczx.news.cn/mlxc.htm'
response = requests.get(url=url)
report = response.text
​
html_tree = etree.HTML(report)
tags = html_tree.xpath('//div[@id="content-list"]//span/a/text()')
​
text = ''
for tag in tags:
    text += tag
    
print(text)