xpath的使用
-css选择器
-xpath选择器:xpath即为XML路径语言(XML Path Language),它是一种用来确定XML文档中某部分位置的语言
-nodename 选取子节点的所有子节点
-/ 从根节点开始选取 /body/div:body下一层的所有div
-// 从当前节点选择子节点,不考虑他们的位置
-. 选取当前节点
-.. 选取当前节点的父节点
-@ 选取属性
doc = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html' id='id_a'>Name: My image 1 <br/><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
<a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
</div>
</body>
</html>
'''
from lxml import etree
html = etree.HTML(doc)
a=html.xpath('//a[1]/following-sibling::*[2]/@href')
print(a)
selenium动作链
-滑动验证码
http://npm.taobao.org/mirrors/chromedriver/
-形式一:
actions=ActionChains(bro)
actions.drag_and_drop(sourse,target)
actions.perform()
-方式二:
ActionChains(bro).click_and_hold(sourse).perform()
distance=target.location['x']-sourse.location['x']
track=0
while track < distance:
ActionChains(bro).move_by_offset(xoffset=2,yoffset=0).perform()
track+=2
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
bro = webdriver.Chrome(executable_path='../chromedriver.exe', options=options)
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
bro.maximize_window()
bro.implicitly_wait(5)
try:
username = bro.find_element(by=By.ID, value='J-userName')
username.send_keys('1111')
password = bro.find_element(by=By.ID, value='J-password')
password.send_keys('1111')
time.sleep(3)
btn = bro.find_element(by=By.ID, value='J-login')
btn.click()
time.sleep(1)
span = bro.find_element(by=By.ID, value='nc_1_n1z')
ActionChains(bro).click_and_hold(span).perform()
ActionChains(bro).move_by_offset(xoffset=300, yoffset=0).perform()
time.sleep(3)
except Exception as e:
print(e)
finally:
bro.close()
打码平台自动登录
对于图片验证码的验证方式,程序无法破解,我们需要将验证码发给第三方。破解之后再验证。我们只需要付钱就行
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from chaojiying import ChaojiyingClient
from PIL import Image
bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get('http://www.chaojiying.com/apiuser/login/')
bro.implicitly_wait(10)
bro.maximize_window()
try:
username = bro.find_element(by=By.XPATH, value='/html/body/div[3]/div/div[3]/div[1]/form/p[1]/input')
password = bro.find_element(by=By.XPATH, value='/html/body/div[3]/div/div[3]/div[1]/form/p[2]/input')
code = bro.find_element(by=By.XPATH, value='/html/body/div[3]/div/div[3]/div[1]/form/p[3]/input')
btn = bro.find_element(by=By.XPATH, value='/html/body/div[3]/div/div[3]/div[1]/form/p[4]/input')
username.send_keys('306334678')
password.send_keys('lqz123')
bro.save_screenshot('main.png')
img = bro.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div[1]/form/div/img')
location = img.location
size = img.size
print(location)
print(size)
img_tu = (int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
img = Image.open('./main.png')
fram = img.crop(img_tu)
fram.save('code.png')
chaojiying = ChaojiyingClient('306334678', 'lqz123', '937234')
im = open('code.png', 'rb').read()
print(chaojiying.PostPic(im, 1902))
res_code=chaojiying.PostPic(im, 1902)['pic_str']
code.send_keys(res_code)
time.sleep(5)
btn.click()
time.sleep(10)
except Exception as e:
print(e)
finally:
bro.close()
使用selenium爬取京东商品信息
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.common.keys import Keys
def get_goods(driver):
try:
goods = driver.find_elements(by=By.CLASS_NAME, value='gl-item')
for good in goods:
name = good.find_element(by=By.CSS_SELECTOR, value='.p-name em').text
price = good.find_element(by=By.CSS_SELECTOR, value='.p-price i').text
commit = good.find_element(by=By.CSS_SELECTOR, value='.p-commit a').text
url = good.find_element(by=By.CSS_SELECTOR, value='.p-name a').get_attribute('href')
img = good.find_element(by=By.CSS_SELECTOR, value='.p-img img').get_attribute('src')
if not img:
img ='https://'+ good.find_element(by=By.CSS_SELECTOR, value='.p-img img').get_attribute('data-lazy-img')
print('''
商品名字:%s
商品价格:%s
商品链接:%s
商品图片:%s
商品评论:%s
''' % (name, price, url, img, commit))
button = driver.find_element(by=By.PARTIAL_LINK_TEXT, value='下一页')
button.click()
time.sleep(1)
get_goods(driver)
except Exception as e:
print(e)
def spider(url, keyword):
driver = webdriver.Chrome(executable_path='./chromedriver.exe')
driver.get(url)
driver.implicitly_wait(10)
try:
input_tag = driver.find_element(by=By.ID, value='key')
input_tag.send_keys(keyword)
input_tag.send_keys(Keys.ENTER)
get_goods(driver)
finally:
driver.close()
if __name__ == '__main__':
spider('https://www.jd.com/', keyword='精品内衣')
scrapy介绍
-做爬虫用的东西,都封装好了,只需要在固定的位置写固定的代码即可
-django 大而全,做web相关的它都用
-scrapy 大而全,做爬虫的,它都用
Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛,可用于如数据挖掘、监测和自动化测试等领域,也可以应用在获取API所返回的数据或者通用的网络爬虫
-mac,linux:
pip3 install scrapy
-win:看人品
-pip3 install scrapy
-人品不好:
1、pip3 install wheel
3、pip3 install lxml
4、pip3 install pyopenssl
5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/
7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
8、pip3 install scrapy
-以后使用这个创建爬虫项目 ---》django-admin创建django项目
scrapy startproject myfirstscrapy
scrapy genspider cnblogs www.cnblogs.com
scrapy crawl cnblogs --nolog
新建run.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'cnblogs','--nolog'])