Python爬虫系列之----Scrapy(七)使用IP代理池

16,101 阅读43分钟
原文链接: blog.csdn.net

一、手动更新IP池

1.在settings配置文件中新增IP池:

  1. IPPOOL=[  
  2.     {"ipaddr":"61.129.70.131:8080"},  
  3.     {"ipaddr":"61.152.81.193:9100"},  
  4.     {"ipaddr":"120.204.85.29:3128"},  
  5.     {"ipaddr":"219.228.126.86:8123"},  
  6.     {"ipaddr":"61.152.81.193:9100"},  
  7.     {"ipaddr":"218.82.33.225:53853"},  
  8.     {"ipaddr":"223.167.190.17:42789"}  
  9. ]  
IPPOOL=[
	{"ipaddr":"61.129.70.131:8080"},
	{"ipaddr":"61.152.81.193:9100"},
	{"ipaddr":"120.204.85.29:3128"},
	{"ipaddr":"219.228.126.86:8123"},
	{"ipaddr":"61.152.81.193:9100"},
	{"ipaddr":"218.82.33.225:53853"},
	{"ipaddr":"223.167.190.17:42789"}
]

这些IP可以从这个几个网站获取:快代理代理66有代理西刺代理guobanjia。如果出现像下面这种提示:"由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败"或者是这种," 由 于目标计算机积极拒绝,无法连接。". 那就是IP的问题,更换就行了。。。。发现上面好多IP都不能用。。

  1. 2017-04-16 12:38:11 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://news.sina.com.cn /> (failed 1 times): TCP connection timed out: 10060: 由于连接方在 一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.  
  2. this is ip:182.241.58.70:51660  
  3. 2017-04-16 12:38:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://news.sina.com.cn /> (failed 2 times): TCP connection timed out: 10060: 由于连接方在 一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.  
  4. this is ip:49.75.59.243:28549  
  5. 2017-04-16 12:38:33 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force  
  6. 2017-04-16 12:38:33 [scrapy.core.engine] INFO: Closing spider (shutdown)  
  7. 2017-04-16 12:38:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  
  8. 2017-04-16 12:38:53 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://news.sina.com.cn /> (failed 3 times): TCP connection timed out: 10060: 由于 连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.  
  9. 2017-04-16 12:38:54 [scrapy.core.scraper] ERROR: Error downloading <GET http://news.sina.com.cn />  
  10. Traceback (most recent call last):  
  11.   File "f:\software\python36\lib\site-packages\twisted\internet\defer.py", line 1299, in _inlineCallbacks  
  12.     result = result.throwExceptionIntoGenerator(g)  
  13.   File "f:\software\python36\lib\site-packages\twisted\python\failure.py", line 393, in throwExceptionIntoGenerator  
  14.     return g.throw(self.type, self.value, self.tb)  
  15.   File "f:\software\python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request  
  16.     defer.returnValue((yield download_func(request=request,spider= spider)))  
  17. twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.  
2017-04-16 12:38:11 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://news.sina.com.cn/> (failed 1 times): TCP connection timed out: 10060: 由于连接方在 一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.
this is ip:182.241.58.70:51660
2017-04-16 12:38:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://news.sina.com.cn/> (failed 2 times): TCP connection timed out: 10060: 由于连接方在 一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.
this is ip:49.75.59.243:28549
2017-04-16 12:38:33 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2017-04-16 12:38:33 [scrapy.core.engine] INFO: Closing spider (shutdown)
2017-04-16 12:38:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-16 12:38:53 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://news.sina.com.cn/> (failed 3 times): TCP connection timed out: 10060: 由于 连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.
2017-04-16 12:38:54 [scrapy.core.scraper] ERROR: Error downloading <GET http://news.sina.com.cn/>
Traceback (most recent call last):
  File "f:\software\python36\lib\site-packages\twisted\internet\defer.py", line 1299, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "f:\software\python36\lib\site-packages\twisted\python\failure.py", line 393, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "f:\software\python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

在Scrapy中与代理服务器设置相关的下载中间件是HttpProxyMiddleware,对应的类为:

  1. scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware  
scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware


2.修改中间件文件middlewares.py

  1. # -*- coding: utf-8 -*-  
  2.   
  3. # Define here the models for your spider middleware  
  4. #  
  5. # See documentation in:  
  6. # http://doc.scrapy.org/en/latest/topics/spider-middleware.html  
  7.   
  8. import random  
  9. from scrapy import signals  
  10. from myproxies.settings import IPPOOL  
  11.   
  12. class MyproxiesSpiderMiddleware(object):  
  13.   
  14.       def __init__(self,ip=''):  
  15.           self.ip=ip  
  16.          
  17.       def process_request(self, request, spider):  
  18.           thisip=random.choice(IPPOOL)  
  19.           print("this is ip:"+thisip["ipaddr"])  
  20.           request.meta["proxy"]="http://"+thisip["ipaddr"]  
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

import random
from scrapy import signals
from myproxies.settings import IPPOOL

class MyproxiesSpiderMiddleware(object):

      def __init__(self,ip=''):
          self.ip=ip
       
      def process_request(self, request, spider):
          thisip=random.choice(IPPOOL)
          print("this is ip:"+thisip["ipaddr"])
          request.meta["proxy"]="http://"+thisip["ipaddr"]


3.在settings中设置DOWNLOADER_MIDDLEWARES

  1. DOWNLOADER_MIDDLEWARES = {  
  2. #    'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,  
  3.      'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,  
  4.      'myproxies.middlewares.MyproxiesSpiderMiddleware':125  
  5. }  
DOWNLOADER_MIDDLEWARES = {
#    'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,
     'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,
     'myproxies.middlewares.MyproxiesSpiderMiddleware':125
}


4.爬虫文件为

  1. # -*- coding: utf-8 -*-  
  2. import scrapy  
  3.   
  4.   
  5. class ProxieSpider(scrapy.Spider):  
  6.   
  7.   
  8.     def __init__(self):  
  9.         self.headers = {  
  10.             'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',  
  11.             'Accept-Encoding':'gzip, deflate',  
  12.             'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'  
  13.         }  
  14.   
  15.   
  16.     name = "proxie"  
  17.     allowed_domains = ["sina.com.cn"]  
  18.     start_urls = ['http://news.sina.com.cn/']  
  19.   
  20.     def parse(self, response):  
  21.         print(response.body)  
# -*- coding: utf-8 -*-
import scrapy


class ProxieSpider(scrapy.Spider):


    def __init__(self):
        self.headers = {
            'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
            'Accept-Encoding':'gzip, deflate',
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
        }


    name = "proxie"
    allowed_domains = ["sina.com.cn"]
    start_urls = ['http://news.sina.com.cn/']

    def parse(self, response):
        print(response.body)


5.运行爬虫

  1. scrapy crawl proxie  
scrapy crawl proxie


输出结果为:

  1. G:\Scrapy_work\myproxies>scrapy crawl proxie  
  2. 2017-04-16 12:23:14 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: myproxies)  
  3. 2017-04-16 12:23:14 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myproxies', 'NEWSPIDER_MODULE': 'myproxies.spiders', 'SPIDER_MODULES': ['myproxies.spiders']}  
  4. 2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled extensions:  
  5. ['scrapy.extensions.corestats.CoreStats',  
  6.  'scrapy.extensions.telnet.TelnetConsole',  
  7.  'scrapy.extensions.logstats.LogStats']  
  8. 2017-04-16 12:23:14 [py.warnings] WARNING: f:\software\python36\lib\site-packages\scrapy\utils\deprecate.py:156: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` instead  
  9.   ScrapyDeprecationWarning)  
  10.   
  11. 2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled downloader middlewares:  
  12. ['myproxies.middlewares.MyproxiesSpiderMiddleware',  
  13.  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',  
  14.  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',  
  15.  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',  
  16.  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',  
  17.  'scrapy.downloadermiddlewares.retry.RetryMiddleware',  
  18.  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',  
  19.  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',  
  20.  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',  
  21.  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',  
  22.  'scrapy.downloadermiddlewares.stats.DownloaderStats']  
  23. 2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled spider middlewares:  
  24. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',  
  25.  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',  
  26.  'scrapy.spidermiddlewares.referer.RefererMiddleware',  
  27.  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',  
  28.  'scrapy.spidermiddlewares.depth.DepthMiddleware']  
  29. 2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled item pipelines:  
  30. []  
  31. 2017-04-16 12:23:14 [scrapy.core.engine] INFO: Spider opened  
  32. 2017-04-16 12:23:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  
  33. 2017-04-16 12:23:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023  
  34. this is ip:222.92.111.234:1080  
  35. 2017-04-16 12:23:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://news.sina.com.cn /> (referer: None)  
  36. b'<html>\n<head>\n <meta http-equiv="Pragma" content= "no-cache">\n<meta http-equiv= "Expires" content="-1">\n< meta http-equiv="Cache-Control" content= "no-cache">\n<link rel= "SHORTCUT ICON" href="/favicon.ico">\n\n <title>Login</title>\n <script language="JavaScript" >\n\nvar base64EncodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";\nvar  base64DecodeChars = new Array(\n    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63,\n    52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -1, -1, -1,\n    -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,\n    15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1,\n    -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,\n    41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1);\n\nfunction base64encode(str) {\n    var out, i, len;\n    var c1, c2, c3;\n\n     len = str.length;\n    i =  0;\n    out = "";\n    while(i  < len) {\n\tc1 =  str.charCodeAt(i++) & 0xff;\n\tif(i == len)\n\t{\n\t    out += base64EncodeChars.charAt(c1  >> 2);\n\t    out += base64EncodeChars.charAt((c1 & 0x3)  << 4);\n\t    out += "==";\n\t    break;\n\t}\n\ tc2 = str.charCodeAt(i++);\n\tif(i == len)\n\t{\n\t    out += base64EncodeChars.charAt(c1  >> 2);\n\t    out += base64EncodeChars.charAt(((c1 & 0x3) << 4) | ((c2 & 0xF0)  >> 4));\n\t    out += base64EncodeChars.charAt((c2 & 0xF)  << 2);\n\t    out += "=";\n\t    break;\n\t}\n\ tc3 = str.charCodeAt(i++);\n\tout += base64EncodeChars.charAt(c1  >> 2);\n\tout += base64EncodeChars.charAt(((c1 & 0x3) << 4) | ((c2 & 0xF0)  >> 4));\n\tout += base64EncodeChars.charAt(((c2 & 0xF)  << 2) | ((c3 & 0xC0)  >>6));\n\tout += base64EncodeChars.charAt(c3 & 0x3F);\n    }\n    return out;\n}\n\nfunction base64decode(str) {\n    var c1, c2, c3, c4;\n    var i, len, out;\n\n     len = str.length;\n     i = 0;\n     out = "";\n    while(i  < len) {\n\t/* c1 */\n\tdo {\n\t     c1 = base64DecodeChars[str.charCodeAt(i++) & 0xff];\n\t} while(i  < len &&  c1 == -1);\n\tif(c1 == -1)\n\t    break;\n\n\t/* c2 */\n\tdo {\n\t     c2 = base64DecodeChars[str.charCodeAt(i++) & 0xff];\n\t} while(i  < len &&  c2 == -1);\n\tif( c2 == -1)\n\t    break;\n\n\tout += String.fromCharCode((c1  <<  2) | ((c2 & 0x30)  >> 4));\n\n\t/* c3 */\n\tdo {\n\t     c3 =  str.charCodeAt(i++) & 0xff;\n\t    if( c3 == 61)\n\t\treturn out;\n\t     c3 =  base64DecodeChars[c3];\n\t} while(i  <  len &&  c3 == -1);\n\tif( c3 == -1)\n\t    break;\n\n\tout += String.fromCharCode(((c2 & 0XF)  < <  4) | ((c3 & 0x3C)  > > 2));\n\n\t/* c4 */\n\tdo {\n\t     c4 =  str.charCodeAt(i++) & 0xff;\n\t    if( c4 == 61)\n\t\treturn out;\n\t     c4 =  base64DecodeChars[c4];\n\t} while(i  <  len &&  c4 == -1);\n\tif( c4 == -1)\n\t    break;\n\tout += String.fromCharCode(((c3 & 0x03)  < <  6) | c4);\n    }\n    return out;\n}\n   \t \n\n\n/*\nif (window.opener) {\n\ twindow.opener.location.href =  document.location.href;\n\tself.close();\n}\t\n*/\n\nif (top.location != document.location)  top.location.href =  document.location.href;\n\nvar  is_DOM = (document.getElementById)? true : false;\nvar  is_NS4 = (document.layers && !is_DOM)? true : false;\n\nvar  sAgent =  navigator.userAgent;\nvar  bIsIE = (sAgent.indexOf("MSIE")  > -1)? true : false;\nvar  bIsNS = (is_NS4 || (sAgent.indexOf("Netscape")  > -1))? true : false;\nvar  bIsMoz5 = ((sAgent.indexOf("Mozilla/5")  > -1) && !bIsIE)? true : false;\n\nif (is_NS4 || bIsMoz5)\t{\n    document.writeln(" < style  type=\\"text/css\\" >");\n    document.writeln(".spacer { background-image : url(\\"/images/tansparent.gif\\"); background-repeat : repeat; }");\n    document.writeln(".operadummy {}");\n    document.writeln(" </ style >");\n}else if (is_DOM) {\n    document.writeln(" < style  type=\\"text/css\\" >");\n    document.writeln("body {");\n    document.writeln("\tfont-family: \\"Verdana\\", \\"Arial\\", \\"Helvetica\\", \\"sans-serif\\";");\n    //document.writeln("\tfont-size: x-small;");\n    document.writeln("\tbackground-color : #FFFFFF;");\n    document.writeln("\tbackground-image: URL(\\"/images/logon.gif\\");");\n    document.writeln("\tbackground-repeat: no-repeat;");\n\tdocument.writeln("\tbackground-position: center;");\n    document.writeln("}");\n    document.writeln(".spacer {}");\n    document.writeln(".operadummy {}");\n    document.writeln(" </ style >");\n//} else if (document.all) {\n//    document.write(\' < link  rel= "stylesheet"  href= "ie4.css"  type= "text/css" >\');\n}\n\t   \nfunction stripSpace(x)\n{\n\treturn x.replace(/^\\W+/,"");\n}\n\nfunction toggleDisplay(style2)\n{\n\tif ( style2.display == "block") {\n\t\ tstyle2.display =  "none";\n\t\ tstyle2.visibility =  "hidden";\n\t} else {\n\t\ tstyle2.display =  "block";\n\t\ tstyle2.visibility =  "";\n\t}\n}\n\nfunction toggleLayer(whichLayer)\n{\n\tif (document.getElementById)\n\t{\n\t\t// this is the way the standards work\n\t\tvar  style2 =  document.getElementById(whichLayer).style;\n\t\ttoggleDisplay(style2);\n\t}\n\telse if (document.all)\n\t{\n\t\t// this is the way old msie versions work\n\t\tvar  style2 =  document.all[whichLayer].style;\n//\t\ tstyle2.display =  style2.display? "":"block";\n\t\ttoggleDisplay(style2);\n\t}\n\telse if (document.layers)\n\t{\n\t\t// this is the way nn4 works\n\t\tvar  style2 =  document.layers[whichLayer].style;\n//\t\ tstyle2.display =  style2.display? "":"block";\n\t\ttoggleDisplay(style2);\n\t}\n}\n\nvar  today =  new Date();\nvar  expires =  new Date(today.getTime() + (365 * 24 * 60 * 60 * 1000));\nvar  timer =  null; \nvar  nlen =  0;\n\t\t\t\nfunction Set_Cookie(name,value,expires,path,domain,secure) \n{\n     document.cookie =  name + "=" +escape(value) +\n        ( (expires) ? "; expires=" + expires.toGMTString() : " ") +\n        ( (path) ? "; path=" + path : " ") + \n        ( (domain) ? "; domain=" + domain : " ") +\n        ( (secure) ? ";secure" : "");\n}\n\nSet_Cookie("has_cookie", "1", expires);\nvar  has_cookie =  Get_Cookie("has_cookie") == null ? false : true;  \n\t\nfunction Get_Cookie(name)\n{\n    var  start =  document.cookie.indexOf(name+"=");\n    var  len =  start+name.length+1;\n    if ((!start) && (name != document.cookie.substring(0,name.length))) return null;\n    if ( start == -1) return null;\n    var  end =  document.cookie.indexOf(";",len);\n    if ( end == -1)  end =  document.cookie.length;\n    return unescape(document.cookie.substring(len,end));\n}\n \n  \t\t\t\t\nfunction save_cookies() \n{\n\tvar  fm =  document.forms[0];\n\t\n\ tcookie_name =  "mingzi";\n    if (has_cookie && fm.save_username_info.checked) {\n        Set_Cookie(cookie_name, fm.un.value, expires);\n\t} else if (Get_Cookie(cookie_name)) {\n\t\ tdocument.cookie =  cookie_name + "=" +\n\t\t\t\t\t\t  ";  expires= Thu, 01-Jan-70 00:00:01 GMT";\n\t}\n         \n\ tcookie_name =  "kouling";\n    if (has_cookie && fm.save_username_info.checked) {\n        Set_Cookie(cookie_name, fm.pw.value, expires);\n\t} else if (Get_Cookie(cookie_name)) {\n\t\ tdocument.cookie =  cookie_name + "=" +\n\t\t\t\t\t\t  ";  expires= Thu, 01-Jan-70 00:00:01 GMT";\n\t}\n}\n\nvar  admin_pw =  null;\nfunction get_cookies() \n{\n\tvar  fm =  document.forms[0];\n     admin_id =  Get_Cookie("mingzi");\t\n    if (admin_id != null) {\n         fm.admin_id.value =  base64decode(admin_id);\n         fm.save_username_info.checked =  true;\n    }\n     admin_pw =  Get_Cookie("kouling");\n    if (admin_pw != null) {\n         fm.admin_pw.value =  base64decode(admin_pw);\n         fm.save_username_info.checked =  true;\n         nlen =  fm.admin_pw.value.toString().length;\n\t\ tstar =  "***********************************";\n\t\tfm.admin_pw.value += star.substring(0, 31 - nlen);\n    } else {\n\t\ tfm.admin_pw.value =  "";\n\t}\n     fm.pw.value =  fm.admin_pw.value;\n\tfm.admin_id.select();\n\tfm.admin_id.focus();\n}\n\nfunction checkPassword()\n{\n   var  fm =  document.forms[0];\n   if (fm.admin_pw.value != fm.pw.value) {\n\t    nlen =  fm.admin_pw.value.toString().length;\n\t   if (nlen >31)  nlen =  31;\n   }\n}\t  \n\n\nfunction acceptCheckIt(ok)\n{\n\tif (!eval(ok)) {\n         top.location.href =  "/index.html";\n\t\treturn;\n\t}\n\tvar  fm =  document.forms[0];\n\tvar  d =  new Date();\n\ tfm.time.value =  d.getTime().toString().substring(4,13);\n\ tname =  fm.admin_id.value; //stripSpace(fm.admin_id.value);\n\ tpass =  fm.admin_pw.value; //stripSpace(fm.admin_pw.value);\n\tif (   (name.length  > 0) \n\t\t&& (pass.length  > 0)\n\t   ) { \n\t\t\t  fm.un.value= base64encode(name);\n\t\t\t if (pass != fm.pw.value) { // password changed\n\t\t\t\t  fm.pw.value= base64encode(pass);\n\t\t\t } else {\n\t\t\t\t  fm.pw.value= base64encode(pass.substring(0,nlen));\n\t\t\t }\n\t\t\t save_cookies();\n\t\t\t  fm.admin_id.value= "";\n\t\t\t  fm.admin_pw.value= "";\n\t\t\t fm.submit();\n\t }\n}\n\nfunction checkIt() \n{\n   \n\t\tacceptCheckIt(true);\n   \n}  \n \t\nfunction cancelIt() \n{\n   return false;\n}\t\t\n\n   \nfunction auto_submit() \n{\n   var  fm =  document.forms[0];\n   get_cookies();\n   fm.admin_id.select();//focus(); \n\n   return checkIt();\n}\t\n\t\nfunction testSelect()\n{\n\tdocument.forms[0].admin_pw.select();\n}\t\n\n\nfunction write_one_check_box(txt)\n{\n   if (has_cookie) {\n\t  document.writeln(" < tr  align=\'center\'  valign=\'middle\' >");\n\t  document.writeln(" < td  align=\'center\'  colspan=\'2\'  style=\'color:white;font-size:10pt;\' >");\n\t  document.writeln(" < in"+"put  name=\'"+txt+"\'  type=\'checkbox\'  tabindex=\'3\' >");\n\t  document.writeln("Remember my name and password </ td > </ tr >");\n   }\n}   \n  \n\t\nfunction reloadNow()\n{\n     document.location =  document.location;\n}\n\nvar  margin_top =  0;   \nif (document.layers || bIsMoz5) {\n\ tmargin_top = (window.innerHeight - 330) / 2;\n\tif (margin_top  <  0)  margin_top =  0;\n\t\n\ twindow.onResize =  reloadNow;\n} \n\t\n </ script > \n </ head >\n\n < body  bgcolor= "White"  link= "Black"  onLoad= "get_cookies();" >\n\n < noscript >\n < h1 >This WebUI administration tool requires scripting support. </ h1 >\n < h2 >Please obtain the latest version of browsers which support the Javascript language or\nenable scripting by changing the browser setting \nif you are using the latest version of the browsers.\n </ h2 >\n </ noscript >\t\n \n < div  id= "div1"  style= "display:block" >\n < FORM  method= "POST"  name= "login"  autocomplete= "off"  ACTION= "/index.html" >   \n < script  language= "javascript" >\n\tif (bIsMoz5 && (margin_top  > 0)) {\n\t\tdocument.writeln(" < table  width=\'100%\'  border=\'0\'  cellspacing=\'0\'  cellpadding=\'0\'  style=\'margin-top: " + margin_top + "px;\' >");\n\t} else {\n\t\tdocument.writeln(" < table  width=\'100%\'  height=\'100%\'  border=\'0\'  cellspacing=\'0\'  cellpadding=\'0\' >");\n\t}\n </ script >\n\n < tr  align= "center"  valign= "middle"  style= "width: 471px; height: 330px;" >\n < td  align= "center"  valign= "middle"  scope= "row" >\n\n\t < script  language= "javascript" >\n\tif (is_NS4 || bIsMoz5) {\n\t\tdocument.writeln(" < table  background=\'/images/logon.gif\'  width=\'471\'  height=\'330\'  border=\'0\'  cellpadding=\'0\'  cellspacing=\'0\' >");\n\t} else {\n\t\tdocument.writeln(" < table  border=\'0\'  cellpadding=\'0\'  cellspacing=\'0\' >");\n\t}\n\t </ script >\n  \n\t < tr  align= "center"  valign= "middle" >\n\t < script  language= "javascript" >\n\t\tdocument.writeln(" < td  width=\'100%\'  align=\'center\'  valign=\'middle\' >");\n\t </ script >\n\n \t\t < table  bgcolor=\'\'  background=\'\'  border=\'0\' >\n\t\t < tr  align= "center"  valign= "middle" >\n\t\t\t < th  align= "right"  style= "color:white;font-size:10pt;" >Admin Name:  </ th >\n\t\t\t < td  align= "left"  style= "color:white;font-size:10pt;" > < INPUT  type= text  name= "admin_id"  tabindex= "1"  SIZE= "21"  MAXLENGTH= "31"  VALUE= "" >\n\t\t\t </ td >\n\t\t </ tr >\n\t\t < tr  align= "center"  valign= "middle" >\n\t\t\t < th  align= "right"  style= "color:white;font-size:10pt;" >Password:  </ th >\n\t\t\t < td  align= "left"  style= "color:white;font-size:10pt;" > < INPUT  type= "password"  name= "admin_pw"  tabindex= "2"  onFocus= "testSelect();"  onChange= "checkPassword();"  SIZE= "21"  MAXLENGTH= "31"  VALUE= "" > \n\t\t\t </ td >\n\t\t </ tr >\n\t\t\n\t\t < script  language= "javascript" >\n\t\t\twrite_one_check_box("save_username_info");\n\t\t </ script >\t\n\t\t\n\t\t < tr  align= "center"  valign= "middle" >\n\t\t\t < td >  </ td >\n\t\t\t < td  align= "left" >\n\t\t\t < INPUT  type= "button"  value= " Login "  onClick= "checkIt();"  tabindex=\\ "4\\" >\n\t\t\t </ td >\n\t\t </ tr >\n\t\t\n\t\t </ table >\n\t\t\n\t </ td >\n\n\t </ tr >\n\t </ table >\n\n </ td >\n </ tr >\n </ table >\n < INPUT  type= "hidden"  name= "time"  VALUE= "0" >\n < INPUT  type= "hidden"  name= "un"  VALUE= "" >\n < INPUT  type= "hidden"  name= "pw"  VALUE= "" >\n </ FORM >\n </ div >\n\n < div  id= "div2"  style= "display:none" >\n < pre >\n \n </ pre >\n < bar  />\n < center >\n < FORM  name= "additional" >\n\t < INPUT  type= "button"  value= "Accept"  onclick= "acceptCheckIt(true);" >\n\t \n\t < INPUT  type= "button"  value= "Decline"  onclick= "acceptCheckIt(false);" >\n </ FORM >\n\n </ center >\n </ div >\n\n </ body >\n </ html >\n\n\n'  
  37. 2017-04-16 12:23:15 [scrapy.core.engine] INFO: Closing spider (finished)  
  38. 2017-04-16 12:23:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:  
  39. {'downloader/request_bytes': 214,  
  40.  'downloader/request_count': 1,  
  41.  'downloader/request_method_count/GET': 1,  
  42.  'downloader/response_bytes': 12111,  
  43.  'downloader/response_count': 1,  
  44.  'downloader/response_status_count/200': 1,  
  45.  'finish_reason': 'finished',  
  46.  'finish_time': datetime.datetime(2017, 4, 16, 4, 23, 15, 198955),  
  47.  'log_count/DEBUG': 2,  
  48.  'log_count/INFO': 7,  
  49.  'log_count/WARNING': 1,  
  50.  'response_received_count': 1,  
  51.  'scheduler/dequeued': 1,  
  52.  'scheduler/dequeued/memory': 1,  
  53.  'scheduler/enqueued': 1,  
  54.  'scheduler/enqueued/memory': 1,  
  55.  'start_time': datetime.datetime(2017, 4, 16, 4, 23, 14, 706603)}  
  56. 2017-04-16 12:23:15 [scrapy.core.engine] INFO: Spider closed (finished)  
  57.   
  58. G:\Scrapy_work\myproxies>  
G:\Scrapy_work\myproxies>scrapy crawl proxie
2017-04-16 12:23:14 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: myproxies)
2017-04-16 12:23:14 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myproxies', 'NEWSPIDER_MODULE': 'myproxies.spiders', 'SPIDER_MODULES': ['myproxies.spiders']}
2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-04-16 12:23:14 [py.warnings] WARNING: f:\software\python36\lib\site-packages\scrapy\utils\deprecate.py:156: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` instead
  ScrapyDeprecationWarning)

2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['myproxies.middlewares.MyproxiesSpiderMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-16 12:23:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-16 12:23:14 [scrapy.core.engine] INFO: Spider opened
2017-04-16 12:23:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-16 12:23:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
this is ip:222.92.111.234:1080
2017-04-16 12:23:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://news.sina.com.cn/> (referer: None)
b'<html>\n<head>\n<meta http-equiv="Pragma" content="no-cache">\n<meta http-equiv="Expires" content="-1">\n<meta http-equiv="Cache-Control" content="no-cache">\n<link rel="SHORTCUT ICON" href="/favicon.ico">\n\n<title>Login</title>\n<script language="JavaScript">\n\nvar base64EncodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";\nvar base64DecodeChars = new Array(\n    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63,\n    52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -1, -1, -1,\n    -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,\n    15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1,\n    -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,\n    41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1);\n\nfunction base64encode(str) {\n    var out, i, len;\n    var c1, c2, c3;\n\n    len = str.length;\n    i = 0;\n    out = "";\n    while(i < len) {\n\tc1 = str.charCodeAt(i++) & 0xff;\n\tif(i == len)\n\t{\n\t    out += base64EncodeChars.charAt(c1 >> 2);\n\t    out += base64EncodeChars.charAt((c1 & 0x3) << 4);\n\t    out += "==";\n\t    break;\n\t}\n\tc2 = str.charCodeAt(i++);\n\tif(i == len)\n\t{\n\t    out += base64EncodeChars.charAt(c1 >> 2);\n\t    out += base64EncodeChars.charAt(((c1 & 0x3)<< 4) | ((c2 & 0xF0) >> 4));\n\t    out += base64EncodeChars.charAt((c2 & 0xF) << 2);\n\t    out += "=";\n\t    break;\n\t}\n\tc3 = str.charCodeAt(i++);\n\tout += base64EncodeChars.charAt(c1 >> 2);\n\tout += base64EncodeChars.charAt(((c1 & 0x3)<< 4) | ((c2 & 0xF0) >> 4));\n\tout += base64EncodeChars.charAt(((c2 & 0xF) << 2) | ((c3 & 0xC0) >>6));\n\tout += base64EncodeChars.charAt(c3 & 0x3F);\n    }\n    return out;\n}\n\nfunction base64decode(str) {\n    var c1, c2, c3, c4;\n    var i, len, out;\n\n    len = str.length;\n    i = 0;\n    out = "";\n    while(i < len) {\n\t/* c1 */\n\tdo {\n\t    c1 = base64DecodeChars[str.charCodeAt(i++) & 0xff];\n\t} while(i < len && c1 == -1);\n\tif(c1 == -1)\n\t    break;\n\n\t/* c2 */\n\tdo {\n\t    c2 = base64DecodeChars[str.charCodeAt(i++) & 0xff];\n\t} while(i < len && c2 == -1);\n\tif(c2 == -1)\n\t    break;\n\n\tout += String.fromCharCode((c1 << 2) | ((c2 & 0x30) >> 4));\n\n\t/* c3 */\n\tdo {\n\t    c3 = str.charCodeAt(i++) & 0xff;\n\t    if(c3 == 61)\n\t\treturn out;\n\t    c3 = base64DecodeChars[c3];\n\t} while(i < len && c3 == -1);\n\tif(c3 == -1)\n\t    break;\n\n\tout += String.fromCharCode(((c2 & 0XF) << 4) | ((c3 & 0x3C) >> 2));\n\n\t/* c4 */\n\tdo {\n\t    c4 = str.charCodeAt(i++) & 0xff;\n\t    if(c4 == 61)\n\t\treturn out;\n\t    c4 = base64DecodeChars[c4];\n\t} while(i < len && c4 == -1);\n\tif(c4 == -1)\n\t    break;\n\tout += String.fromCharCode(((c3 & 0x03) << 6) | c4);\n    }\n    return out;\n}\n   \t \n\n\n/*\nif (window.opener) {\n\twindow.opener.location.href = document.location.href;\n\tself.close();\n}\t\n*/\n\nif (top.location != document.location) top.location.href = document.location.href;\n\nvar is_DOM = (document.getElementById)? true : false;\nvar is_NS4 = (document.layers && !is_DOM)? true : false;\n\nvar sAgent = navigator.userAgent;\nvar bIsIE = (sAgent.indexOf("MSIE") > -1)? true : false;\nvar bIsNS = (is_NS4 || (sAgent.indexOf("Netscape") > -1))? true : false;\nvar bIsMoz5 = ((sAgent.indexOf("Mozilla/5") > -1) && !bIsIE)? true : false;\n\nif (is_NS4 || bIsMoz5)\t{\n    document.writeln("<style type=\\"text/css\\">");\n    document.writeln(".spacer { background-image : url(\\"/images/tansparent.gif\\"); background-repeat : repeat; }");\n    document.writeln(".operadummy {}");\n    document.writeln("</style>");\n}else if (is_DOM) {\n    document.writeln("<style type=\\"text/css\\">");\n    document.writeln("body {");\n    document.writeln("\tfont-family: \\"Verdana\\", \\"Arial\\", \\"Helvetica\\", \\"sans-serif\\";");\n    //document.writeln("\tfont-size: x-small;");\n    document.writeln("\tbackground-color : #FFFFFF;");\n    document.writeln("\tbackground-image: URL(\\"/images/logon.gif\\");");\n    document.writeln("\tbackground-repeat: no-repeat;");\n\tdocument.writeln("\tbackground-position: center;");\n    document.writeln("}");\n    document.writeln(".spacer {}");\n    document.writeln(".operadummy {}");\n    document.writeln("</style>");\n//} else if (document.all) {\n//    document.write(\'<link rel="stylesheet" href="ie4.css" type="text/css">\');\n}\n\t   \nfunction stripSpace(x)\n{\n\treturn x.replace(/^\\W+/,"");\n}\n\nfunction toggleDisplay(style2)\n{\n\tif (style2.display == "block") {\n\t\tstyle2.display = "none";\n\t\tstyle2.visibility = "hidden";\n\t} else {\n\t\tstyle2.display = "block";\n\t\tstyle2.visibility = "";\n\t}\n}\n\nfunction toggleLayer(whichLayer)\n{\n\tif (document.getElementById)\n\t{\n\t\t// this is the way the standards work\n\t\tvar style2 = document.getElementById(whichLayer).style;\n\t\ttoggleDisplay(style2);\n\t}\n\telse if (document.all)\n\t{\n\t\t// this is the way old msie versions work\n\t\tvar style2 = document.all[whichLayer].style;\n//\t\tstyle2.display = style2.display? "":"block";\n\t\ttoggleDisplay(style2);\n\t}\n\telse if (document.layers)\n\t{\n\t\t// this is the way nn4 works\n\t\tvar style2 = document.layers[whichLayer].style;\n//\t\tstyle2.display = style2.display? "":"block";\n\t\ttoggleDisplay(style2);\n\t}\n}\n\nvar today = new Date();\nvar expires = new Date(today.getTime() + (365 * 24 * 60 * 60 * 1000));\nvar timer = null; \nvar nlen = 0;\n\t\t\t\nfunction Set_Cookie(name,value,expires,path,domain,secure) \n{\n    document.cookie = name + "=" +escape(value) +\n        ( (expires) ? ";expires=" + expires.toGMTString() : "") +\n        ( (path) ? ";path=" + path : "") + \n        ( (domain) ? ";domain=" + domain : "") +\n        ( (secure) ? ";secure" : "");\n}\n\nSet_Cookie("has_cookie", "1", expires);\nvar has_cookie = Get_Cookie("has_cookie") == null ? false : true;  \n\t\nfunction Get_Cookie(name)\n{\n    var start = document.cookie.indexOf(name+"=");\n    var len = start+name.length+1;\n    if ((!start) && (name != document.cookie.substring(0,name.length))) return null;\n    if (start == -1) return null;\n    var end = document.cookie.indexOf(";",len);\n    if (end == -1) end = document.cookie.length;\n    return unescape(document.cookie.substring(len,end));\n}\n \n  \t\t\t\t\nfunction save_cookies() \n{\n\tvar fm = document.forms[0];\n\t\n\tcookie_name = "mingzi";\n    if (has_cookie && fm.save_username_info.checked) {\n        Set_Cookie(cookie_name, fm.un.value, expires);\n\t} else if (Get_Cookie(cookie_name)) {\n\t\tdocument.cookie = cookie_name + "=" +\n\t\t\t\t\t\t  "; expires=Thu, 01-Jan-70 00:00:01 GMT";\n\t}\n         \n\tcookie_name = "kouling";\n    if (has_cookie && fm.save_username_info.checked) {\n        Set_Cookie(cookie_name, fm.pw.value, expires);\n\t} else if (Get_Cookie(cookie_name)) {\n\t\tdocument.cookie = cookie_name + "=" +\n\t\t\t\t\t\t  "; expires=Thu, 01-Jan-70 00:00:01 GMT";\n\t}\n}\n\nvar admin_pw = null;\nfunction get_cookies() \n{\n\tvar fm = document.forms[0];\n    admin_id = Get_Cookie("mingzi");\t\n    if (admin_id != null) {\n        fm.admin_id.value = base64decode(admin_id);\n        fm.save_username_info.checked = true;\n    }\n    admin_pw = Get_Cookie("kouling");\n    if (admin_pw != null) {\n        fm.admin_pw.value = base64decode(admin_pw);\n        fm.save_username_info.checked = true;\n        nlen = fm.admin_pw.value.toString().length;\n\t\tstar = "***********************************";\n\t\tfm.admin_pw.value += star.substring(0, 31 - nlen);\n    } else {\n\t\tfm.admin_pw.value = "";\n\t}\n    fm.pw.value = fm.admin_pw.value;\n\tfm.admin_id.select();\n\tfm.admin_id.focus();\n}\n\nfunction checkPassword()\n{\n   var fm = document.forms[0];\n   if (fm.admin_pw.value != fm.pw.value) {\n\t   nlen = fm.admin_pw.value.toString().length;\n\t   if (nlen>31) nlen = 31;\n   }\n}\t  \n\n\nfunction acceptCheckIt(ok)\n{\n\tif (!eval(ok)) {\n        top.location.href = "/index.html";\n\t\treturn;\n\t}\n\tvar fm = document.forms[0];\n\tvar d = new Date();\n\tfm.time.value = d.getTime().toString().substring(4,13);\n\tname = fm.admin_id.value; //stripSpace(fm.admin_id.value);\n\tpass = fm.admin_pw.value; //stripSpace(fm.admin_pw.value);\n\tif (   (name.length > 0) \n\t\t&& (pass.length > 0)\n\t   ) { \n\t\t\t fm.un.value=base64encode(name);\n\t\t\t if (pass != fm.pw.value) { // password changed\n\t\t\t\t fm.pw.value=base64encode(pass);\n\t\t\t } else {\n\t\t\t\t fm.pw.value=base64encode(pass.substring(0,nlen));\n\t\t\t }\n\t\t\t save_cookies();\n\t\t\t fm.admin_id.value="";\n\t\t\t fm.admin_pw.value="";\n\t\t\t fm.submit();\n\t }\n}\n\nfunction checkIt() \n{\n   \n\t\tacceptCheckIt(true);\n   \n}  \n \t\nfunction cancelIt() \n{\n   return false;\n}\t\t\n\n   \nfunction auto_submit() \n{\n   var fm = document.forms[0];\n   get_cookies();\n   fm.admin_id.select();//focus(); \n\n   return checkIt();\n}\t\n\t\nfunction testSelect()\n{\n\tdocument.forms[0].admin_pw.select();\n}\t\n\n\nfunction write_one_check_box(txt)\n{\n   if (has_cookie) {\n\t  document.writeln("<tr align=\'center\' valign=\'middle\'>");\n\t  document.writeln("<td align=\'center\' colspan=\'2\' style=\'color:white;font-size:10pt;\'>");\n\t  document.writeln("<in"+"put name=\'"+txt+"\' type=\'checkbox\' tabindex=\'3\'>");\n\t  document.writeln("Remember my name and password</td></tr>");\n   }\n}   \n  \n\t\nfunction reloadNow()\n{\n    document.location = document.location;\n}\n\nvar margin_top = 0;   \nif (document.layers || bIsMoz5) {\n\tmargin_top = (window.innerHeight - 330) / 2;\n\tif (margin_top < 0) margin_top = 0;\n\t\n\twindow.onResize = reloadNow;\n} \n\t\n</script> \n</head>\n\n<body bgcolor="White" link="Black" onLoad="get_cookies();">\n\n<noscript>\n<h1>This WebUI administration tool requires scripting support.</h1>\n<h2>Please obtain the latest version of browsers which support the Javascript language or\nenable scripting by changing the browser setting \nif you are using the latest version of the browsers.\n</h2>\n</noscript>\t\n \n<div id="div1" style="display:block">\n<FORM method="POST" name="login" autocomplete="off" ACTION="/index.html">   \n<script language="javascript">\n\tif (bIsMoz5 && (margin_top > 0)) {\n\t\tdocument.writeln("<table width=\'100%\' border=\'0\' cellspacing=\'0\' cellpadding=\'0\' style=\'margin-top: " + margin_top + "px;\'>");\n\t} else {\n\t\tdocument.writeln("<table width=\'100%\' height=\'100%\' border=\'0\' cellspacing=\'0\' cellpadding=\'0\'>");\n\t}\n</script>\n\n<tr align="center" valign="middle" style="width: 471px; height: 330px;">\n<td align="center" valign="middle" scope="row">\n\n\t<script language="javascript">\n\tif (is_NS4 || bIsMoz5) {\n\t\tdocument.writeln("<table background=\'/images/logon.gif\' width=\'471\' height=\'330\' border=\'0\' cellpadding=\'0\' cellspacing=\'0\'>");\n\t} else {\n\t\tdocument.writeln("<table border=\'0\' cellpadding=\'0\' cellspacing=\'0\'>");\n\t}\n\t</script>\n  \n\t<tr align="center" valign="middle">\n\t<script language="javascript">\n\t\tdocument.writeln("<td width=\'100%\' align=\'center\' valign=\'middle\'>");\n\t</script>\n\n \t\t<table bgcolor=\'\' background=\'\' border=\'0\'>\n\t\t<tr align="center" valign="middle">\n\t\t\t<th align="right" style="color:white;font-size:10pt;">Admin Name: </th>\n\t\t\t<td align="left" style="color:white;font-size:10pt;"><INPUT type=text name="admin_id" tabindex="1" SIZE="21" MAXLENGTH="31" VALUE="">\n\t\t\t</td>\n\t\t</tr>\n\t\t<tr align="center" valign="middle">\n\t\t\t<th align="right" style="color:white;font-size:10pt;">Password: </th>\n\t\t\t<td align="left" style="color:white;font-size:10pt;"><INPUT type="password" name="admin_pw" tabindex="2" onFocus="testSelect();" onChange="checkPassword();" SIZE="21" MAXLENGTH="31" VALUE=""> \n\t\t\t</td>\n\t\t</tr>\n\t\t\n\t\t<script language="javascript">\n\t\t\twrite_one_check_box("save_username_info");\n\t\t</script>\t\n\t\t\n\t\t<tr align="center" valign="middle">\n\t\t\t<td> </td>\n\t\t\t<td align="left">\n\t\t\t<INPUT type="button" value=" Login " onClick="checkIt();" tabindex=\\ "4\\">\n\t\t\t</td>\n\t\t</tr>\n\t\t\n\t\t</table>\n\t\t\n\t</td>\n\n\t</tr>\n\t</table>\n\n</td>\n</tr>\n</table>\n<INPUT type="hidden" name="time" VALUE="0">\n<INPUT type="hidden" name="un" VALUE="">\n<INPUT type="hidden" name="pw" VALUE="">\n</FORM>\n</div>\n\n<div id="div2" style="display:none">\n<pre>\n \n</pre>\n<bar />\n<center>\n<FORM name="additional">\n\t<INPUT type="button" value="Accept" onclick="acceptCheckIt(true);">\n\t \n\t<INPUT type="button" value="Decline" onclick="acceptCheckIt(false);">\n</FORM>\n\n</center>\n</div>\n\n</body>\n</html>\n\n\n'
2017-04-16 12:23:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-16 12:23:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 214,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 12111,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 16, 4, 23, 15, 198955),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 4, 16, 4, 23, 14, 706603)}
2017-04-16 12:23:15 [scrapy.core.engine] INFO: Spider closed (finished)

G:\Scrapy_work\myproxies>


示例:download.csdn.net/detail/u011…

二、自动更新IP池

这里写个自动获取IP的类proxies.py,执行一下把获取的IP保存到txt文件中去:

  1. # *-* coding:utf-8 *-*  
  2. import requests  
  3. from bs4 import BeautifulSoup  
  4. import lxml  
  5. from multiprocessing import Process, Queue  
  6. import random  
  7. import json  
  8. import time  
  9. import requests  
  10.   
  11. class Proxies(object):  
  12.   
  13.   
  14.     """docstring for Proxies"""  
  15.     def __init__(self, page=3):  
  16.         self.proxies = []  
  17.         self.verify_pro = []  
  18.         self.page = page  
  19.         self.headers = {  
  20.         'Accept': '*/*',  
  21.         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',  
  22.         'Accept-Encoding': 'gzip, deflate, sdch',  
  23.         'Accept-Language': 'zh-CN,zh;q=0.8'  
  24.         }  
  25.         self.get_proxies()  
  26.         self.get_proxies_nn()  
  27.   
  28.     def get_proxies(self):  
  29.         page = random.randint(1,10)  
  30.         page_stop = page + self.page  
  31.         while page < page_stop:  
  32.             url = 'http://www.xicidaili.com/nt/%d' % page  
  33.             html = requests.get(url,  headers=self.headers).content  
  34.             soup = BeautifulSoup(html, 'lxml')  
  35.             ip_list = soup.find( id='ip_list')  
  36.             for odd in ip_list.find_all(class_='odd'):  
  37.                 protocol = odd.find_all('td')[5].get_text().lower()+'://'  
  38.                 self.proxies.append(protocol + ':'.join([x.get_text() for x in odd.find_all('td')[1:3]]))  
  39.             page += 1  
  40.   
  41.     def get_proxies_nn(self):  
  42.         page = random.randint(1,10)  
  43.         page_stop = page + self.page  
  44.         while page < page_stop:  
  45.             url = 'http://www.xicidaili.com/nn/%d' % page  
  46.             html = requests.get(url,  headers=self.headers).content  
  47.             soup = BeautifulSoup(html, 'lxml')  
  48.             ip_list = soup.find( id='ip_list')  
  49.             for odd in ip_list.find_all(class_='odd'):  
  50.                 protocol = odd.find_all('td')[5].get_text().lower() + '://'  
  51.                 self.proxies.append(protocol + ':'.join([x.get_text() for x in odd.find_all('td')[1:3]]))  
  52.             page += 1  
  53.   
  54.     def verify_proxies(self):  
  55.         # 没验证的代理  
  56.         old_queue = Queue()  
  57.         # 验证后的代理  
  58.         new_queue = Queue()  
  59.         print ('verify proxy........')  
  60.         works = []  
  61.         for _ in range(15):  
  62.             works.append(Process(target=self.verify_one_proxy,  args=(old_queue,new_queue)))  
  63.         for work in works:  
  64.             work.start()  
  65.         for proxy in self.proxies:  
  66.             old_queue.put(proxy)  
  67.         for work in works:  
  68.             old_queue.put(0)  
  69.         for work in works:  
  70.             work.join()  
  71.         self.proxies = []  
  72.         while 1:  
  73.             try:  
  74.                 self.proxies.append(new_queue.get(timeout=1))  
  75.             except:  
  76.                 break  
  77.         print ('verify_proxies done!')  
  78.   
  79.   
  80.     def verify_one_proxy(self, old_queue, new_queue):  
  81.         while 1:  
  82.             proxy = old_queue.get()  
  83.             if proxy == 0:break  
  84.             protocol = 'https' if 'https' in proxy else 'http'  
  85.             proxies = {protocol: proxy}  
  86.             try:  
  87.                 if requests.get('http://www.baidu.com', proxies= proxies, timeout=2).status_code == 200:  
  88.                     print ('success %s' % proxy)  
  89.                     new_queue.put(proxy)  
  90.             except:  
  91.                 print ('fail %s' % proxy)  
  92.   
  93.   
  94. if __name__ == '__main__':  
  95.     a = Proxies()  
  96.     a.verify_proxies()  
  97.     print (a.proxies)  
  98.     proxie = a.proxies   
  99.     with open('proxies.txt', 'a') as f:  
  100.        for proxy in proxie:  
  101.              f.write(proxy+'\n')  
# *-* coding:utf-8 *-*
import requests
from bs4 import BeautifulSoup
import lxml
from multiprocessing import Process, Queue
import random
import json
import time
import requests

class Proxies(object):


    """docstring for Proxies"""
    def __init__(self, page=3):
        self.proxies = []
        self.verify_pro = []
        self.page = page
        self.headers = {
        'Accept': '*/*',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8'
        }
        self.get_proxies()
        self.get_proxies_nn()

    def get_proxies(self):
        page = random.randint(1,10)
        page_stop = page + self.page
        while page < page_stop:
            url = 'http://www.xicidaili.com/nt/%d' % page
            html = requests.get(url, headers=self.headers).content
            soup = BeautifulSoup(html, 'lxml')
            ip_list = soup.find(id='ip_list')
            for odd in ip_list.find_all(class_='odd'):
                protocol = odd.find_all('td')[5].get_text().lower()+'://'
                self.proxies.append(protocol + ':'.join([x.get_text() for x in odd.find_all('td')[1:3]]))
            page += 1

    def get_proxies_nn(self):
        page = random.randint(1,10)
        page_stop = page + self.page
        while page < page_stop:
            url = 'http://www.xicidaili.com/nn/%d' % page
            html = requests.get(url, headers=self.headers).content
            soup = BeautifulSoup(html, 'lxml')
            ip_list = soup.find(id='ip_list')
            for odd in ip_list.find_all(class_='odd'):
                protocol = odd.find_all('td')[5].get_text().lower() + '://'
                self.proxies.append(protocol + ':'.join([x.get_text() for x in odd.find_all('td')[1:3]]))
            page += 1

    def verify_proxies(self):
        # 没验证的代理
        old_queue = Queue()
        # 验证后的代理
        new_queue = Queue()
        print ('verify proxy........')
        works = []
        for _ in range(15):
            works.append(Process(target=self.verify_one_proxy, args=(old_queue,new_queue)))
        for work in works:
            work.start()
        for proxy in self.proxies:
            old_queue.put(proxy)
        for work in works:
            old_queue.put(0)
        for work in works:
            work.join()
        self.proxies = []
        while 1:
            try:
                self.proxies.append(new_queue.get(timeout=1))
            except:
                break
        print ('verify_proxies done!')


    def verify_one_proxy(self, old_queue, new_queue):
        while 1:
            proxy = old_queue.get()
            if proxy == 0:break
            protocol = 'https' if 'https' in proxy else 'http'
            proxies = {protocol: proxy}
            try:
                if requests.get('http://www.baidu.com', proxies=proxies, timeout=2).status_code == 200:
                    print ('success %s' % proxy)
                    new_queue.put(proxy)
            except:
                print ('fail %s' % proxy)


if __name__ == '__main__':
    a = Proxies()
    a.verify_proxies()
    print (a.proxies)
    proxie = a.proxies 
    with open('proxies.txt', 'a') as f:
       for proxy in proxie:
             f.write(proxy+'\n')


执行一下:  python  proxies.py


这些IP就会保存到proxies.txt文件中去


修改代理文件middlewares.py的内容为如下:

  1. import random  
  2. import scrapy  
  3. from scrapy import log  
  4.   
  5.   
  6. # logger = logging.getLogger()  
  7.   
  8. class ProxyMiddleWare(object):  
  9.     """docstring for ProxyMiddleWare"""  
  10.     def process_request(self,request, spider):  
  11.         '''对request对象加上proxy'''  
  12.         proxy = self.get_random_proxy()  
  13.         print("this is request ip:"+proxy)  
  14.         request.meta['proxy'] = proxy   
  15.   
  16.   
  17.     def process_response(self, request, response, spider):  
  18.         '''对返回的response处理'''  
  19.         # 如果返回的response状态不是200,重新生成当前request对象  
  20.         if response.status != 200:  
  21.             proxy = self.get_random_proxy()  
  22.             print("this is response ip:"+proxy)  
  23.             # 对当前reque加上代理  
  24.             request.meta['proxy'] = proxy   
  25.             return request  
  26.         return response  
  27.   
  28.     def get_random_proxy(self):  
  29.         '''随机从文件中读取proxy'''  
  30.         while 1:  
  31.             with open('G:\\Scrapy_work\\myproxies\\myproxies\\proxies.txt', 'r') as f:  
  32.                 proxies = f.readlines()  
  33.             if proxies:  
  34.                 break  
  35.             else:  
  36.                 time.sleep(1)  
  37.         proxy = random.choice(proxies).strip()  
  38.         return proxy  
import random
import scrapy
from scrapy import log


# logger = logging.getLogger()

class ProxyMiddleWare(object):
	"""docstring for ProxyMiddleWare"""
	def process_request(self,request, spider):
		'''对request对象加上proxy'''
		proxy = self.get_random_proxy()
		print("this is request ip:"+proxy)
		request.meta['proxy'] = proxy 


	def process_response(self, request, response, spider):
		'''对返回的response处理'''
		# 如果返回的response状态不是200,重新生成当前request对象
		if response.status != 200:
			proxy = self.get_random_proxy()
			print("this is response ip:"+proxy)
			# 对当前reque加上代理
			request.meta['proxy'] = proxy 
			return request
		return response

	def get_random_proxy(self):
		'''随机从文件中读取proxy'''
		while 1:
			with open('G:\\Scrapy_work\\myproxies\\myproxies\\proxies.txt', 'r') as f:
				proxies = f.readlines()
			if proxies:
				break
			else:
				time.sleep(1)
		proxy = random.choice(proxies).strip()
		return proxy

修改下settings文件

  1. DOWNLOADER_MIDDLEWARES = {  
  2. #    'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,  
  3.      'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,  
  4.      'myproxies.middlewares.ProxyMiddleWare':125,  
  5.      'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware':None  
  6. }  
DOWNLOADER_MIDDLEWARES = {
#    'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,
     'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,
     'myproxies.middlewares.ProxyMiddleWare':125,
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware':None
}

运行爬虫:

  1. scrapy crawl proxie  
scrapy crawl proxie


输出结果为:


示例:download.csdn.net/detail/u011…

三、利用crawlera神器(收费)

Crawlera是Scrapinghub公司提供的一个下载的中间件,其提供了很多服务器和ip,scrapy可以通过Crawlera向目标站点发起请求。

crawlera官方网址:scrapinghub.com/crawlera/
crawlera帮助文档:doc.scrapinghub.com/crawlera.ht…

一、crawlera平台注册

    首先申明,注册是免费的,使用的话除了一些特殊定制外都是free的。

    1、登录其网站 https://dash.scrapinghub.com/account/signup/


    填写用户名、密码、邮箱,注册一个crawlera账号并激活






新建一个项目


选择Scrapy....

二、部署到srcapy项目


1、安装scarpy-crawlera

  1. pip install scarpy-crawlera  
pip install scarpy-crawlera


2、修改settings.py


如果你之前设置过代理ip,那么请注释掉,加入crawlera的代理,最重要的是需要在配置文件里,配置开启Crawlera中间件。如下所示:

  1. DOWNLOADER_MIDDLEWARES = {  
  2. #    'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,  
  3. #     'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,  
  4. #     'myproxies.middlewares.ProxyMiddleWare':125,  
  5. #    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware':None  
  6.      'scrapy_crawlera.CrawleraMiddleware': 600  
  7. }  
DOWNLOADER_MIDDLEWARES = {
#    'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,
#     'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,
#     'myproxies.middlewares.ProxyMiddleWare':125,
#    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware':None
     'scrapy_crawlera.CrawleraMiddleware': 600
}

为了是crawlera生效,需要添加你创建的api信息(如果填写了API key的话,pass填空字符串便可)
  1. CRAWLERA_ENABLED = True  
  2. CRAWLERA_USER = '<API key>'  
  3. CRAWLERA_PASS = ''  
CRAWLERA_ENABLED = True
CRAWLERA_USER = '<API key>'
CRAWLERA_PASS = ''

其中CRAWLERA_USER是注册crawlera之后申请到的API Key:


CRAWLERA_PASS则代表crawlera的password,一般默认是不填写的,空白即可。

为了达到更高的抓取效率,可以禁用Autothrottle扩展和增加并发请求的最大数量,以及设置下载超时,代码如下

  1. CONCURRENT_REQUESTS = 32  
  2. CONCURRENT_REQUESTS_PER_DOMAIN = 32  
  3. AUTOTHROTTLE_ENABLED = False  
  4. DOWNLOAD_TIMEOUT = 600  
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 32
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 600
如果在代码中设置有 DOWNLOAD_DELAY的话,需要在settings.py中添加
  1. CRAWLERA_PRESERVE_DELAY = True  
CRAWLERA_PRESERVE_DELAY = True


如果你的spider中保留了cookies,那么需要在Headr中添加

  1. DEFAULT_REQUEST_HEADERS = {  
  2.   # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q= 0.8',  
  3.   # 'Accept-Language': 'zh-CN,zh;q=0.8',  
  4.   'X-Crawlera-Cookies': 'disable'  
  5. }  
DEFAULT_REQUEST_HEADERS = {
  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
  # 'Accept-Language': 'zh-CN,zh;q=0.8',
  'X-Crawlera-Cookies': 'disable'
}


三、运行爬虫

    这些都设置好了过后便可以运行你的爬虫了。这时所有的request都是通过crawlera发出的,信息如下:

  1. G:\Scrapy_work\myproxies>scrapy crawl proxie  
  2. 2017-04-16 15:49:40 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: myproxies)  
  3. 2017-04-16 15:49:40 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myproxies', 'NEWSPIDER_MODULE': 'myproxies.spiders', 'SPIDER_MODULES': ['myproxies.spiders']}  
  4. 2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled extensions:  
  5. ['scrapy.extensions.corestats.CoreStats',  
  6.  'scrapy.extensions.telnet.TelnetConsole',  
  7.  'scrapy.extensions.logstats.LogStats']  
  8. 2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled downloader middlewares:  
  9. ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',  
  10.  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',  
  11.  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',  
  12.  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',  
  13.  'scrapy.downloadermiddlewares.retry.RetryMiddleware',  
  14.  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',  
  15.  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',  
  16.  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',  
  17.  'scrapy_crawlera.CrawleraMiddleware',  
  18.  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',  
  19.  'scrapy.downloadermiddlewares.stats.DownloaderStats']  
  20. 2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled spider middlewares:  
  21. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',  
  22.  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',  
  23.  'scrapy.spidermiddlewares.referer.RefererMiddleware',  
  24.  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',  
  25.  'scrapy.spidermiddlewares.depth.DepthMiddleware']  
  26. 2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled item pipelines:  
  27. []  
  28. 2017-04-16 15:49:40 [scrapy.core.engine] INFO: Spider opened  
  29. 2017-04-16 15:49:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  
  30. 2017-04-16 15:49:40 [root] INFO: Using crawlera at http://proxy.crawlera.com:8010?noconnect (user: f3b8ff0381fc46c7b6834aa85956fc82)  
  31. 2017-04-16 15:49:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023  
  32. 2017-04-16 15:49:41 [scrapy.core.engine] DEBUG: Crawled (407) <GET http://www.655680.com /> (referer: None)  
  33. 2017-04-16 15:49:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <407 http://www.655680.com />: HTTP status code is not handled or not allowed  
  34. 2017-04-16 15:49:41 [scrapy.core.engine] INFO: Closing spider (finished)  
  35. 2017-04-16 15:49:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:  
  36. {'crawlera/request': 1,  
  37.  'crawlera/request/method/GET': 1,  
  38.  'crawlera/response': 1,  
  39.  'crawlera/response/error': 1,  
  40.  'crawlera/response/error/bad_proxy_auth': 1,  
  41.  'crawlera/response/status/407': 1,  
  42.  'downloader/request_bytes': 285,  
  43.  'downloader/request_count': 1,  
  44.  'downloader/request_method_count/GET': 1,  
  45.  'downloader/response_bytes': 196,  
  46.  'downloader/response_count': 1,  
  47.  'downloader/response_status_count/407': 1,  
  48.  'finish_reason': 'finished',  
  49.  'finish_time': datetime.datetime(2017, 4, 16, 7, 49, 41, 546403),  
  50.  'log_count/DEBUG': 2,  
  51.  'log_count/INFO': 9,  
  52.  'response_received_count': 1,  
  53.  'scheduler/dequeued': 1,  
  54.  'scheduler/dequeued/memory': 1,  
  55.  'scheduler/enqueued': 1,  
  56.  'scheduler/enqueued/memory': 1,  
  57.  'start_time': datetime.datetime(2017, 4, 16, 7, 49, 40, 827892)}  
  58. 2017-04-16 15:49:41 [scrapy.core.engine] INFO: Spider closed (finished)  
  59.   
  60. G:\Scrapy_work\myproxies>  
G:\Scrapy_work\myproxies>scrapy crawl proxie
2017-04-16 15:49:40 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: myproxies)
2017-04-16 15:49:40 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myproxies', 'NEWSPIDER_MODULE': 'myproxies.spiders', 'SPIDER_MODULES': ['myproxies.spiders']}
2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy_crawlera.CrawleraMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-16 15:49:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-16 15:49:40 [scrapy.core.engine] INFO: Spider opened
2017-04-16 15:49:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-16 15:49:40 [root] INFO: Using crawlera at http://proxy.crawlera.com:8010?noconnect (user: f3b8ff0381fc46c7b6834aa85956fc82)
2017-04-16 15:49:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-16 15:49:41 [scrapy.core.engine] DEBUG: Crawled (407) <GET http://www.655680.com/> (referer: None)
2017-04-16 15:49:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <407 http://www.655680.com/>: HTTP status code is not handled or not allowed
2017-04-16 15:49:41 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-16 15:49:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'crawlera/request': 1,
 'crawlera/request/method/GET': 1,
 'crawlera/response': 1,
 'crawlera/response/error': 1,
 'crawlera/response/error/bad_proxy_auth': 1,
 'crawlera/response/status/407': 1,
 'downloader/request_bytes': 285,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 196,
 'downloader/response_count': 1,
 'downloader/response_status_count/407': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 16, 7, 49, 41, 546403),
 'log_count/DEBUG': 2,
 'log_count/INFO': 9,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 4, 16, 7, 49, 40, 827892)}
2017-04-16 15:49:41 [scrapy.core.engine] INFO: Spider closed (finished)

G:\Scrapy_work\myproxies>

报407错误。。。。看了下文档,407没有说明。。在Google上找到了一种说法是,来自Crawlera的407错误代码是一个身份验证错误,APIKEY中可能会出现打字错误,或者您没有使用正确的错误代码。


然后又在网上找了下,发现自己创建的是Scrapy Cloud项目,而非crawlera,然后去创建crawlera发现要收费。。


收费就算了。。。网上还在以下两种策略: Scrapy+Goagent、Scrapy+Tor(高度匿名的免费代理)没有玩过。。