python爬虫xpath不支持ends-with的替代方案

868 阅读1分钟

本人文章<=>个人笔记,若有误,望指正,感激不尽.

本人邮箱:silenceandsharp@163.com

文章基于python3

参考资料(需要翻X):

stackoverflow.com/questions/2…

www.iro.umontreal.ca/~lapalme/Fo…

没翻x,打不开上面的网址也没关系,下面是主要内容:

The ends-with function is part of xpath 2.0 but browsers (you indicate you're testing with chrome) generally only support 1.0. So you'll have to implement it yourself with a combination of string-length, substring and equals

substring(@id, string-length(@id) - string-length('register') +1) = 'register'

大概意思就是:xpath 2.0支持ends-with,但是咱们python的lxml模块支持的不是xpath 2.0,是xpath 1.0

于是用下面这个方案替代:

xpath的substring语法:
substring(string, start, length):
解释:
string portion with string, from position start (the first character is numbered 1) for length characters; if length is not given, then the rest of the string is returned.

另外,要求不是那么严谨的话,就用contains来代替吧.

示例代码:

x='<a href="http://finance.sina.com.cn/zl/stock/2019-07-24/zl-ihytcitm4323120.shtml" target="_blank" >老艾侃股:美降息将推动外资加速流入</a>'
from lxml import etree
html=etree.HTML(x)
print(html.xpath("//a[starts-with(@href,'http') and substring(@href, string-length(@href) - string-length('shtml') +1) = 'shtml']/@href"))
print(html.xpath("//a[starts-with(@href,'http') and contains(@href,'shtml')]/@href"))

输出:


['http://finance.sina.com.cn/zl/stock/2019-07-24/zl-ihytcitm4323120.shtml']
['http://finance.sina.com.cn/zl/stock/2019-07-24/zl-ihytcitm4323120.shtml']