
本人文章<=>个人笔记,若有误,望指正,感激不尽.
本人邮箱:silenceandsharp@163.com
文章基于python3
参考资料(需要翻X):
stackoverflow.com/questions/2…
www.iro.umontreal.ca/~lapalme/Fo…
没翻x,打不开上面的网址也没关系,下面是主要内容:
The ends-with function is part of xpath 2.0 but browsers (you indicate you're testing with chrome) generally only support 1.0. So you'll have to implement it yourself with a combination of string-length, substring and equals
substring(@id, string-length(@id) - string-length('register') +1) = 'register'
大概意思就是:xpath 2.0支持ends-with,但是咱们python的lxml模块支持的不是xpath 2.0,是xpath 1.0
于是用下面这个方案替代:
xpath的substring语法:
substring(string, start, length):
解释:
string portion with string, from position start (the first character is numbered 1) for length characters; if length is not given, then the rest of the string is returned.
另外,要求不是那么严谨的话,就用contains来代替吧.
示例代码:
x='<a href="http://finance.sina.com.cn/zl/stock/2019-07-24/zl-ihytcitm4323120.shtml" target="_blank" >老艾侃股:美降息将推动外资加速流入</a>'
from lxml import etree
html=etree.HTML(x)
print(html.xpath("//a[starts-with(@href,'http') and substring(@href, string-length(@href) - string-length('shtml') +1) = 'shtml']/@href"))
print(html.xpath("//a[starts-with(@href,'http') and contains(@href,'shtml')]/@href"))
输出:
['http://finance.sina.com.cn/zl/stock/2019-07-24/zl-ihytcitm4323120.shtml']
['http://finance.sina.com.cn/zl/stock/2019-07-24/zl-ihytcitm4323120.shtml']