python之xpath、jsonpath

144 阅读1分钟

jsonpath

将json字符串转换为python字典后,使用jsonpath提取数据

  • $:根节点
  • .:子节点,相当于/
  • ..:子孙节点,相当于//

代码演示:

from jsonpath import jsonpath

data = {
    'uid': 123456,
    'title': '深耕医疗很多年',
    'photo': 'http://www.baidu.com',
    'author': {
        'name':'小陈',
        'photo': 'https://www.xfz.cn/',
        'authors': ['ccb', 'ccb2']
    },
    'source': '网上冲浪',
    'keywords': ['动脉网', '新华网']
}

title = jsonpath(data, '$.title')[0]
name = jsonpath(data, '$..name')[0]
source = jsonpath(data, '$.source')[0]

Xpath

语法

<body>
	 <div>
	      <a href="/ershoufang/dongcheng/"  title="北京东城在售二手房 ">东城</a>
	      <a href="/ershoufang/xicheng/"  title="北京西城在售二手房 ">西城</a>
	      <a href="/ershoufang/chaoyang/"  title="北京朝阳在售二手房 ">朝阳</a>
	</div>
</body>

/:父子关系,如:div/a[2],需要第几个标签,就在标签后面加上[顺序值]

//:子孙不相邻关系

@:通过属性定位标签,如://a[@title="北京西城在售二手房 "]

@属性名:提取标签内指定属性名的属性值,如://a[@title="北京西城在售二手房 "]/@title

代码

from lxml import etree
import requests

url = 'http://xczx.news.cn/mlxc.htm'
response = requests.get(url=url)
report = response.text

html_tree = etree.HTML(report)
tags = html_tree.xpath('//div[@id="content-list"]//span/a/text()')

text = ''
for tag in tags:
    text += tag
    
print(text)

其他