本文已参与「新人创作礼」活动,一起开启掘金创作之路。
经过进一步的探索研究,事情进行到了这一步:
1,代理IP先不要用,因为在requests请求参数里面的proxies添加动态ip的爬取效果还不如不加,加上代理ip好像只能爬几个或十几个公司,或者爬第一个公司就嗝屁了;不加代理ip就会默认使用本机电脑的ip,能爬几十条,
(本人怀疑是代理IP使用方法有问题,以下是我用IP的两种方法:)
# 方法1: http和https两种方法都加上
proxyHost_Port = "42.179.174.174:37798" # 先拿到代理IP
ip_host = proxyHost_Port.split(":")[0]
ip_port = proxyHost_Port.split(":")[1]
proxyMeta = "http://{}:{}".format(ip_host,ip_port)
proxies = {
"http": proxyMeta,
"https": proxyMeta
}
resp = requests.get(targetUrl, proxies=proxies,timeout=10)
# 方法2: 只加https方法
proxyHost_Port = "42.179.174.174:37798"
ip_host = proxyHost_Port.split(":")[0]
ip_port = proxyHost_Port.split(":")[1]
proxyMetas = "https://{}:{}".format(ip_host, ip_port)
ip_proxies = {"https":proxyMetas}
response = requests.get(url,headers=header,proxies=ip_proxies).content.decode()
如果有不对的地方希望各位大神指出
2,每次启动程序前本机电脑都需要连接不同网段的手机热点或WiFi,(手机重启后再开热点就会自动更换IP了)因为如果程序老是用公司WiFi或者是家里的网,那么用的这个ip所在的局域网可能会被封或被禁。而我们做的目的主要是每次程序爬数据的时候它所在的网络IP都不一样
3,需要更换cookie,每次运行程序都要用不同的cookie,建议借几个天眼查的账号和密码,每次的浏览器中登录后在Network中拿到最新的Cookie
4,时间间隔需要增大:如果每隔10秒请求一个公司的话,那么爬取二三十个公司就会返回以下JS代码,(可能是关于点击图片验证的),或者需要重新登录。如果相隔30秒的话,那么爬取100个公司左右才会出现这种情况
<html><script>
var arg1='5B779F8DF18794593B56C4F1ADC4C8239316DE99';
var _0x4818=['\x63\x73\x4b\x48\x77\x71\x4d\x49','\x5a\x73\x4b\x4a\x77\x72\x38\x56\x65\x41\x73\x79','\x55\x63\x4b\x69\x4e\x38\x4f\x2f\x77\x70\x6c\x77\x4d\x41\x3d\x3d','\x4a\x52\x38\x43\x54\x67\x3d\x3d','\x59\x73\x4f\x6e\x62\x53\x45\x51\x77\x37\x6f\x7a\x77\x71\x5a\x4b\x65\x73\x4b\x55\x77\x37\x6b\x77\x58\x38\x4f\x52\x49\x51\x3d\x3d','\x77\x37\x6f\x56\x53\x38\x4f\x53\x77\x6f\x50\x43\x6c\x33\x6a\x43\x68\x4d\x4b\x68\x77\x36\x48\x44\x6c\x73\x4b\x58\x77\x34\x73\x2f\x59\x73\x4f\x47','\x66\x77\x56\x6d\x49\x31\x41\x74\x77\x70\x6c\x61\x59\x38\x4f\x74\x77\x35\x63\x4e\x66\x53\x67\x70\x77\x36\x4d\x3d','\x4f\x63\x4f\x4e\x77\x72\x6a\x43\x71\x73\x4b\x78\x54\x47\x54\x43\x68\x73\x4f\x6a\x45\x57\x45\x38\x50\x63\x4f\x63\x4a\x38\x4b\x36','\x55\x38\x4b\x35\x4c\x63\x4f\x74\x77\x70\x56\x30\x45\x4d\x4f\x6b\x77\x34\x37\x44\x72\x4d\x4f\x58','\x48\x4d\x4f\x32\x77\x6f\x48\x43\x69\x4d\x4b\x39\x53\x6c\x58\x43\x6c\x63\x4f\x6f\x43\x31\x6b\x3d','\x61\x73\x4b\x49\x77\x71\x4d\x44\x64\x67\x4d\x75\x50\x73\x4f\x4b\x42\x4d\x4b\x63\x77\x72\x72\x43\x74\x6b\x4c\x44\x72\x4d\x4b\x42\x77\x36\x34\x64','\x77\x71\x49\x6d\x4d\x54\x30\x74\x77\x36\x52\x4e\x77\x35\x6b\x3d','\x44\x4d\x4b\x63\x55\x30\x4a\x6d\x55\x77\x55\x76','\x56\x6a\x48\x44\x6c\x4d\x4f\x48\x56\x63\x4f\x4e\x58\x33\x66\x44\x69\x63\x4b\x4a\x48\x51\x3d\x3d','\x77\x71\x68\x42\x48\x38\x4b\x6e\x77\x34\x54\x44\x68\x53\x44\x44\x67\x4d\x4f\x64\x77\x72\x6a\x43\x6e\x63\x4f\x57\x77\x70\x68\x68\x4e\x38\x4b\x43\x47\x63\x4b\x71\x77\x36\x64\x48\x41\x55\x35\x2b\x77\x72\x67\x32\x4a\x63\x4b\x61\x77\x34\x49\x45\x4a\x63\x4f\x63\x77\x72\x52\x4a\x77\x6f\x5a\x30\x77\x71\x46\x39\x59\x67\x41\x56','\x64\x7a\x64\x32\x77\x35\x62\x44\x6d\x33\x6a\x44\x70\x73\x4b\x33\x77\x70\x59\x3d','\x77\x34\x50\x44\x67\x63\x4b\x58\x77\x6f\x33\x43\x6b\x63\x4b\x4c\x77\x72\x35\x71\x77\x72\x59\x3d','\x77\x72\x4a\x4f\x54\x63\x4f\x51\x57\x4d\x4f\x67','\x77\x71\x54\x44\x76\x63\x4f\x6a\x77\x34\x34\x37\x77\x72\x34\x3d','\x77\x35\x58\x44\x71\x73\x4b\x68\x4d\x46\x31\x2f','\x77\x72\x41\x79\x48\x73\x4f\x66\x77\x70\x70\x63','\x4a\x33\x64\x56\x50\x63\x4f\x78\x4c\x67\x3d\x3d','\x77\x72\x64\x48\x77\x37\x70\x39\x5a\x77\x3d\x3d','\x77\x34\x72\x44\x6f\x38\x4b\x6d\x4e\x45\x77\x3d','\x49\x4d\x4b\x41\x55\x6b\x42\x74','\x77\x36\x62\x44\x72\x63\x4b\x51\x77\x70\x56\x48\x77\x70\x4e\x51\x77\x71\x55\x3d','\x64\x38\x4f\x73\x57\x68\x41\x55\x77\x37\x59\x7a\x77\x72\x55\x3d','\x77\x71\x6e\x43\x6b\x73\x4f\x65\x65\x7a\x72\x44\x68\x77\x3d\x3d','\x55\x73\x4b\x6e\x49\x4d\x4b\x57\x56\x38\x4b\x2f','\x77\x34\x7a\x44\x6f\x63\x4b\x38\x4e\x55\x5a\x76','\x63\x38\x4f\x78\x5a\x68\x41\x4a\x77\x36\x73\x6b\x77\x71\x4a\x6a','\x50\x63\x4b\x49\x77\x34\x6e\x43\x6b\x6b\x56\x62','\x4b\x48\x67\x6f\x64\x4d\x4f\x32\x56\x51\x3d\x3d','\x77\x70\x73\x6d\x77\x71\x76\x44\x6e\x47\x46\x71','\x77\x71\x4c\x44\x74\x38\x4f\x6b\x77\x34\x63\x3d','\x77\x37\x77\x31\x77\x34\x50\x43\x70\x73\x4f\x34\x77\x71\x41\x3d','\x77\x71\x39\x46\x52\x73\x4f\x71\x57\x4d\x4f\x71','\x62\x79\x42\x68\x77\x37\x72\x44\x6d\x33\x34\x3d','\x4c\x48\x67\x2b\x53\x38\x4f\x74\x54\x77\x3d\x3d','\x77\x71\x68\x4f\x77\x37\x31\x35\x64\x73\x4f\x48','\x55\x38\x4f\x37\x56\x73\x4f\x30\x77\x71\x76\x44\x76\x63\x4b\x75\x4b\x73\x4f\x71\x58\x38\x4b\x72','\x59\x69\x74\x74\x77\x35\x44\x44\x6e\x57\x6e\x44\x72\x41\x3d\x3d','\x59\x4d\x4b\x49\x77\x71\x55\x55\x66\x67\x49\x6b','\x61\x42\x37\x44\x6c\x4d\x4f\x44\x54\x51\x3d\x3d','\x77\x70\x66\x44\x68\x38\x4f\x72\x77\x36\x6b\x6b','\x77\x37\x76\x43\x71\x4d\x4f\x72\x59\x38\x4b\x41\x56\x6b\x35\x4f\x77\x70\x6e\x43\x75\x38\x4f\x61\x58\x73\x4b\x5a\x50\x33\x44\x43\x6c\x63\x4b\x79\x77\x36\x48\x44\x72\x51\x3d\x3d','\x77\x6f\x77\x2b\x77\x36\x76\x44\x6d\x48\x70\x73\x77\x37\x52\x74\x77\x6f\x39\x38\x4c\x43\x37\x43\x69\x47\x37\x43\x6b\x73\x4f\x52\x54\x38\x4b\x6c\x57\x38\x4f\x35\x77\x72\x33\x44\x69\x38\x4f\x54\x48\x73\x4f\x44\x65\x48\x6a\x44\x6d\x63\x4b\x6c\x4a\x73\x4b\x71\x56\x41\x3d\x3d','\x4e\x77\x56\x2b','\x77\x37\x48\x44\x72\x63\x4b\x74\x77\x70\x4a\x61\x77\x70\x5a\x62','\x77\x70\x51\x73\x77\x71\x76\x44\x69\x48\x70\x75\x77\x36\x49\x3d','\x59\x4d\x4b\x55\x77\x71\x4d\x4a\x5a\x51\x3d\x3d','\x4b\x48\x31\x56\x4b\x63\x4f\x71\x4b\x73\x4b\x31','\x66\x51\x35\x73\x46\x55\x6b\x6b\x77\x70\x49\x3d','\x77\x72\x76\x43\x72\x63\x4f\x42\x52\x38\x4b\x6b','\x4d\x33\x77\x30\x66\x51\x3d\x3d','\x77\x36\x78\x58\x77\x71\x50\x44\x76\x4d\x4f\x46\x77\x6f\x35\x64'];(function(_0x4c97f0,_0x1742fd){var _0x4db1c=function(_0x48181e){while(--_0x48181e){_0x4c97f0['\x70\x75\x73\x68'](_0x4c97f0['\x73\x68\x69\x66\x74']());}};var _0x3cd6c6=function(){var _0xb8360b={'\x64\x61\x74\x61':{'\x6b\x65\x79':'\x63\x6f\x6f\x6b\x69\x65','\x76\x61\x6c\x75\x65':'\x74\x69\x6d\x65\x6f\x75\x74'},'\x73\x65\x74\x43\x6f\x6f\x6b\x69\x65':function(_0x20bf34,_0x3e840e,_0x5693d3,_0x5e8b26){_0x5e8b26=_0x5e8b26||{};var _0xba82f0=_0x3e840e+'\x3d'+_0x5693d3;var _0x5afe31=0x0;for(var _0x5afe31=0x0,_0x178627=_0x20bf34['\x6c\x65\x6e\x67\x74\x68'];_0x5afe31<_0x178627;_0x5afe31++){var _0x41b2ff=_0x20bf34[_0x5afe31];_0xba82f0+='\x3b\x20'+_0x41b2ff;var _0xd79219=_0x20bf34[_0x41b2ff];_0x20bf34['\x70\x75\x73\x68'](_0xd79219);_0x178627=_0x20bf34['\x6c\x65\x6e\x67\x74\x68'];if(_0xd79219!==!![]){_0xba82f0+='\x3d'+_0xd79219;}}_0x5e8b26['\x63\x6f\x6f\x6b\x69\x65']=_0xba82f0;},'\x72\x65\x6d\x6f\x76\x65\x43\x6f\x6f\x6b\x69\x65':function(){return'\x64\x65\x76';},'\x67\x65\x74\x43\x6f\x6f\x6b\x69\x65':function(_0x4a11fe,_0x189946){_0x4a11fe=_0x4a11fe||function(_0x6259a2){return _0x6259a2;};var _0x25af93=_0x4a11fe(new RegExp('\x28\x3f\x3a\x5e\x7c\x3b\x20\x29'+_0x189946['\x72\x65\x70\x6c\x61\x63\x65'](/([.$?*|{}()[]\/+^])/g,'\x24\x31')+'\x3d\x28\x5b\x5e\x3b\x5d\x2a\x29'));var _0x52d57c=function(_0x105f59,_0x3fd789){_0x105f59(++_0x3fd789);};_0x52d57c(_0x4db1c,_0x1742fd);return _0x25af93?decodeURIComponent(_0x25af93[0x1]):undefined;}};var _0x4a2aed=function(){var _0x124d17=new RegExp('\x5c\x77\x2b\x20\x2a\x5c\x28\x5c\x29\x20\x2a\x7b\x5c\x77\x2b\x20\x2a\x5b\x27\x7c\x22\x5d\x2e\x2b\x5b\x27\x7c\x22\x5d\x3b\x3f\x20\x2a\x7d');return _0x124d17['\x74\x65\x73\x74'](_0xb8360b['\x72\x65\x6d\x6f\x76\x65\x43\x6f\x6f\x6b\x69\x65']['\x74\x6f\x53\x74\x72\x69\x6e\x67']());};_0xb8360b['\x75\x70\x64\x61\x74\x65\x43\x6f\x6f\x6b\x69\x65']=_0x4a2aed;var _0x2d67ec='';var _0x120551=_0xb8360b['\x75\x70\x64\x61\x74\x65\x43\x6f\x6f\x6b\x69\x65']();if(!_0x120551){_0xb8360b['\x73\x65\x74\x43\x6f\x6f\x6b\x69\x65'](['\x2a'],'\x63\x6f\x75\x6e\x74\x65\x72',0x1);}else if(_0x120551){_0x2d67ec=_0xb8360b['\x67\x65\x74\x43\x6f\x6f\x6b\x69\x65'](null,'\x63\x6f\x75\x6e\x74\x65\x72');}else{_0xb8360b['\x72\x65\x6d\x6f\x76\x65\x43\x6f\x6f\x6b\x69\x65']();}};_0x3cd6c6();}(_0x4818,0x15b));var _0x55f3=function(_0x4c97f0,_0x1742fd){var _0x4c97f0=parseInt(_0x4c97f0,0x10);var _0x48181e=_0x4818[_0x4c97f0];if(!_0x55f3['\x61\x74\x6f\x62\x50\x6f\x6c\x79\x66\x69\x6c\x6c\x41\x70\x70\x65\x6e\x64\x65\x64']){(function(){var _0xdf49c6=Function('\x72\x65\x74\x75\x72\x6e\x20\x28\x66\x75\x6e\x63\x74\x69\x6f\x6e\x20\x28\x29\x20'+'\x7b\x7d\x2e\x63\x6f\x6e\x73\x74\x72\x75\x63\x74\x6f\x72\x28\x22\x72\x65\x74\x75\x72\x6e\x20\x74\x68\x69\x73\x22\x29\x28\x29'+'\x29\x3b');var _0xb8360b=_0xdf49c6();var _0x389f44='\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5a\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x2b\x2f\x3d';_0xb8360b['\x61\x74\x6f\x62']||(_0xb8360b['\x61\x74\x6f\x62']=function(_0xba82f0){var _0xec6bb4=String(_0xba82f0)['\x72\x65\x70\x6c\x61\x63\x65'](/=+$/,'');for(var _0x1a0f04=0x0,_0x18c94e,_0x41b2ff,_0xd79219=0x0,_0x5792f7='';_0x41b2ff=_0xec6bb4['\x63\x68\x61\x72\x41\x74'](_0xd79219++);~_0x41b2ff&&(_0x18c94e=_0x1a0f04%0x4?_0x18c94e*0x40+_0x41b2ff:_0x41b2ff,_0x1a0f04++%0x4)?_0x5792f7+=String['\x66\x72\x6f\x6d\x43\x68\x61\x72\x43\x6f\x64\x65'](0xff&_0x18c94e>>(-0x2*_0x1a0f04&0x6)):0x0){_0x41b2ff=_0x389f44['\x69\x6e\x64\x65\x78\x4f\x66'](_0x41b2ff);}return _0x5792f7;});}());_0x55f3['\x61\x74\x6f\x62\x50\x6f\x6c\x79\x66\x69\x6c\x6c\x41\x70\x70\x65\x6e\x64\x65\x64']=!![];}if(!_0x55f3['\x72\x63\x34']){var _0x232678=function(_0x401af1,_0x532ac0){var _0x45079a=[],_0x52d57c=0x0,_0x105f59,_0x3fd789='',_0x4a2aed='';_0x401af1=atob(_0x401af1);for(var _0x124d17=0x0,_0x1b9115=_0x401af1['\x6c\x65\x6e\x67\x74\x68'];_0x124d17<_0x1b9115;_0x124d17++){_0x4a2aed+='\x25'+('\x30\x30'+_0x401af1['\x63\x68\x61\x72\x43\x6f\x64\x65\x41\x74'](_0x124d17)['\x74\x6f\x53\x74\x72\x69\x6e\x67'](0x10))['\x73\x6c\x69\x63\x65'](-0x2);}_0x401af1=decodeURIComponent(_0x4a2aed);for(var _0x2d67ec=0x0;_0x2d67ec<0x100;_0x2d67ec++){_0x45079a[_0x2d67ec]=_0x2d67ec;}for(_0x2d67ec=0x0;_0x2d67ec<0x100;_0x2d67ec++){_0x52d57c=(_0x52d57c+_0x45079a[_0x2d67ec]+_0x532ac0['\x63\x68\x61\x72\x43\x6f\x64\x65\x41\x74'](_0x2d67ec%_0x532ac0['\x6c\x65\x6e\x67\x74\x68']))%0x100;_0x105f59=_0x45079a[_0x2d67ec];_0x45079a[_0x2d67ec]=_0x45079a[_0x52d57c];_0x45079a[_0x52d57c]=_0x105f59;}_0x2d67ec=0x0;_0x52d57c=0x0;for(var _0x4e5ce2=0x0;_0x4e5ce2<_0x401af1['\x6c\x65\x6e\x67\x74\x68'];_0x4e5ce2++){_0x2d67ec=(_0x2d67ec+0x1)%0x100;_0x52d57c=(_0x52d57c+_0x45079a[_0x2d67ec])%0x100;_0x105f59=_0x45079a[_0x2d67ec];_0x45079a[_0x2d67ec]=_0x45079a[_0x52d57c];_0x45079a[_0x52d57c]=_0x105f59;_0x3fd789+=String['\x66\x72\x6f\x6d\x43\x68\x61\x72\x43\x6f\x64\x65'](_0x401af1['\x63\x68\x61\x72\x43\x6f\x64\x65\x41\x74'](_0x4e5ce2)^_0x45079a[(_0x45079a[_0x2d67ec]+_0x45079a[_0x52d57c])%0x100]);}return _0x3fd789;};_0x55f3['\x72\x63\x34']=_0x232678;}if(!_0x55f3['\x64\x61\x74\x61']){_0x55f3['\x64\x61\x74\x61']={};}if(_0x55f3['\x64\x61\x74\x61'][_0x4c97f0]===undefined){if(!_0x55f3['\x6f\x6e\x63\x65']){var _0x5f325c=function(_0x23a392){this['\x72\x63\x34\x42\x79\x74\x65\x73']=_0x23a392;this['\x73\x74\x61\x74\x65\x73']=[0x1,0x0,0x0];this['\x6e\x65\x77\x53\x74\x61\x74\x65']=function(){return'\x6e\x65\x77\x53\x74\x61\x74\x65';};this['\x66\x69\x72\x73\x74\x53\x74\x61\x74\x65']='\x5c\x77\x2b\x20\x2a\x5c\x28\x5c\x29\x20\x2a\x7b\x5c\x77\x2b\x20\x2a';this['\x73\x65\x63\x6f\x6e\x64\x53\x74\x61\x74\x65']='\x5b\x27\x7c\x22\x5d\x2e\x2b\x5b\x27\x7c\x22\x5d\x3b\x3f\x20\x2a\x7d';};_0x5f325c['\x70\x72\x6f\x74\x6f\x74\x79\x70\x65']['\x63\x68\x65\x63\x6b\x53\x74\x61\x74\x65']=function(){var _0x19f809=new RegExp(this['\x66\x69\x72\x73\x74\x53\x74\x61\x74\x65']+this['\x73\x65\x63\x6f\x6e\x64\x53\x74\x61\x74\x65']);return this['\x72\x75\x6e\x53\x74\x61\x74\x65'](_0x19f809['\x74\x65\x73\x74'](this['\x6e\x65\x77\x53\x74\x61\x74\x65']['\x74\x6f\x53\x74\x72\x69\x6e\x67']())?--this['\x73\x74\x61\x74\x65\x73'][0x1]:--this['\x73\x74\x61\x74\x65\x73'][0x0]);};_0x5f325c['\x70\x72\x6f\x74\x6f\x74\x79\x70\x65']['\x72\x75\x6e\x53\x74\x61\x74\x65']=function(_0x4380bd){if(!Boolean(~_0x4380bd)){return _0x4380bd;}return this['\x67\x65\x74\x53\x74\x61\x74\x65'](this['\x72\x63\x34\x42\x79\x74\x65\x73']);};_0x5f325c['\x70\x72\x6f\x74\x6f\x74\x79\x70\x65']['\x67\x65\x74\x53\x74\x61\x74\x65']=function(_0x58d85e){for(var _0x1c9f5b=0x0,_0x1ce9e0=this['\x73\x74\x61\x74\x65\x73']['\x6c\x65\x6e\x67\x74\x68'];_0x1c9f5b<_0x1ce9e0;_0x1c9f5b++){this['\x73\x74\x61\x74\x65\x73']['\x70\x75\x73\x68'](Math['\x72\x6f\x75\x6e\x64'](Math['\x72\x61\x6e\x64\x6f\x6d']()));_0x1ce9e0=this['\x73\x74\x61\x74\x65\x73']['\x6c\x65\x6e\x67\x74\x68'];}return _0x58d85e(this['\x73\x74\x61\x74\x65\x73'][0x0]);};new _0x5f325c(_0x55f3)['\x63\x68\x65\x63\x6b\x53\x74\x61\x74\x65']();_0x55f3['\x6f\x6e\x63\x65']=!![];}_0x48181e=_0x55f3['\x72\x63\x34'](_0x48181e,_0x1742fd);_0x55f3['\x64\x61\x74\x61'][_0x4c97f0]=_0x48181e;}else{_0x48181e=_0x55f3['\x64\x61\x74\x61'][_0x4c97f0];}return _0x48181e;};var arg3=null;var arg4=null;var arg5=null;var arg6=null;var arg7=null;var arg8=null;var arg9=null;var arg10=null;var l=function(){while(window[_0x55f3('0x1', '\x58\x4d\x57\x5e')]||window['\x5f\x5f\x70\x68\x61\x6e\x74\x6f\x6d\x61\x73']){};var _0x5e8b26=_0x55f3('0x3', '\x6a\x53\x31\x59');String[_0x55f3('0x5', '\x6e\x5d\x66\x52')][_0x55f3('0x6', '\x50\x67\x35\x34')]=function(_0x4e08d8){var _0x5a5d3b='';for(var _0xe89588=0x0;_0xe89588<this[_0x55f3('0x8', '\x29\x68\x52\x63')]&&_0xe89588<_0x4e08d8[_0x55f3('0xa', '\x6a\x45\x26\x5e')];_0xe89588+=0x2){var _0x401af1=parseInt(this[_0x55f3('0xb', '\x56\x32\x4b\x45')](_0xe89588,_0xe89588+0x2),0x10);var _0x105f59=parseInt(_0x4e08d8[_0x55f3('0xd', '\x58\x4d\x57\x5e')](_0xe89588,_0xe89588+0x2),0x10);var _0x189e2c=(_0x401af1^_0x105f59)[_0x55f3('0xf', '\x57\x31\x46\x45')](0x10);if(_0x189e2c[_0x55f3('0x11', '\x4d\x47\x72\x76')]==0x1){_0x189e2c='\x30'+_0x189e2c;}_0x5a5d3b+=_0x189e2c;}return _0x5a5d3b;};String['\x70\x72\x6f\x74\x6f\x74\x79\x70\x65'][_0x55f3('0x14', '\x5a\x2a\x44\x4d')]=function(){var _0x4b082b=[0xf,0x23,0x1d,0x18,0x21,0x10,0x1,0x26,0xa,0x9,0x13,0x1f,0x28,0x1b,0x16,0x17,0x19,0xd,0x6,0xb,0x27,0x12,0x14,0x8,0xe,0x15,0x20,0x1a,0x2,0x1e,0x7,0x4,0x11,0x5,0x3,0x1c,0x22,0x25,0xc,0x24];var _0x4da0dc=[];var _0x12605e='';for(var _0x20a7bf=0x0;_0x20a7bf<this['\x6c\x65\x6e\x67\x74\x68'];_0x20a7bf++){var _0x385ee3=this[_0x20a7bf];for(var _0x217721=0x0;_0x217721<_0x4b082b[_0x55f3('0x16', '\x61\x48\x2a\x4e')];_0x217721++){if(_0x4b082b[_0x217721]==_0x20a7bf+0x1){_0x4da0dc[_0x217721]=_0x385ee3;}}}_0x12605e=_0x4da0dc['\x6a\x6f\x69\x6e']('');return _0x12605e;};var _0x23a392=arg1[_0x55f3('0x19', '\x50\x67\x35\x34')]();arg2=_0x23a392[_0x55f3('0x1b', '\x7a\x35\x4f\x26')](_0x5e8b26);setTimeout('\x72\x65\x6c\x6f\x61\x64\x28\x61\x72\x67\x32\x29',0x2);};var _0x4db1c=function(){function _0x355d23(_0x450614){if((''+_0x450614/_0x450614)[_0x55f3('0x1c', '\x56\x32\x4b\x45')]!==0x1||_0x450614%0x14===0x0){(function(){}[_0x55f3('0x1d', '\x43\x4e\x55\x59')]((undefined+'')[0x2]+(!![]+'')[0x3]+([][_0x55f3('0x1e', '\x77\x38\x50\x52')]()+'')[0x2]+(undefined+'')[0x0]+(![]+[0x0]+String)[0x14]+(![]+[0x0]+String)[0x14]+(!![]+'')[0x3]+(!![]+'')[0x1])());}else{(function(){}['\x63\x6f\x6e\x73\x74\x72\x75\x63\x74\x6f\x72']((undefined+'')[0x2]+(!![]+'')[0x3]+([][_0x55f3('0x1f', '\x4c\x24\x28\x44')]()+'')[0x2]+(undefined+'')[0x0]+(![]+[0x0]+String)[0x14]+(![]+[0x0]+String)[0x14]+(!![]+'')[0x3]+(!![]+'')[0x1])());}_0x355d23(++_0x450614);}try{_0x355d23(0x0);}catch(_0x54c483){}};if(function(){var _0x470d8f=function(){var _0x4c97f0=!![];return function(_0x1742fd,_0x4db1c){var _0x48181e=_0x4c97f0?function(){if(_0x4db1c){var _0x55f3be=_0x4db1c['\x61\x70\x70\x6c\x79'](_0x1742fd,arguments);_0x4db1c=null;return _0x55f3be;}}:function(){};_0x4c97f0=![];return _0x48181e;};}();var _0x501fd7=_0x470d8f(this,function(){var _0x4c97f0=function(){return'\x64\x65\x76';},_0x1742fd=function(){return'\x77\x69\x6e\x64\x6f\x77';};var _0x55f3be=function(){var _0x3ad9a1=new RegExp('\x5c\x77\x2b\x20\x2a\x5c\x28\x5c\x29\x20\x2a\x7b\x5c\x77\x2b\x20\x2a\x5b\x27\x7c\x22\x5d\x2e\x2b\x5b\x27\x7c\x22\x5d\x3b\x3f\x20\x2a\x7d');return!_0x3ad9a1['\x74\x65\x73\x74'](_0x4c97f0['\x74\x6f\x53\x74\x72\x69\x6e\x67']());};var _0x1b93ad=function(){var _0x20bf34=new RegExp('\x28\x5c\x5c\x5b\x78\x7c\x75\x5d\x28\x5c\x77\x29\x7b\x32\x2c\x34\x7d\x29\x2b');return _0x20bf34['\x74\x65\x73\x74'](_0x1742fd['\x74\x6f\x53\x74\x72\x69\x6e\x67']());};var _0x5afe31=function(_0x178627){var _0x1a0f04=~-0x1>>0x1+0xff%0x0;if(_0x178627['\x69\x6e\x64\x65\x78\x4f\x66']('\x69'===_0x1a0f04)){_0xd79219(_0x178627);}};var _0xd79219=function(_0x5792f7){var _0x4e08d8=~-0x4>>0x1+0xff%0x0;if(_0x5792f7['\x69\x6e\x64\x65\x78\x4f\x66']((!![]+'')[0x3])!==_0x4e08d8){_0x5afe31(_0x5792f7);}};if(!_0x55f3be()){if(!_0x1b93ad()){_0x5afe31('\x69\x6e\x64\u0435\x78\x4f\x66');}else{_0x5afe31('\x69\x6e\x64\x65\x78\x4f\x66');}}else{_0x5afe31('\x69\x6e\x64\u0435\x78\x4f\x66');}});_0x501fd7();var _0x3a394d=function(){var _0x1ab151=!![];return function(_0x372617,_0x42d229){var _0x3b3503=_0x1ab151?function(){if(_0x42d229){var _0x7086d9=_0x42d229[_0x55f3('0x21', '\x4b\x4e\x29\x46')](_0x372617,arguments);_0x42d229=null;return _0x7086d9;}}:function(){};_0x1ab151=![];return _0x3b3503;};}();var _0x5b6351=_0x3a394d(this,function(){var _0x46cbaa=Function(_0x55f3('0x22', '\x26\x68\x5a\x59')+_0x55f3('0x23', '\x61\x48\x2a\x4e')+'\x29\x3b');var _0x1766ff=function(){};var _0x9b5e29=_0x46cbaa();_0x9b5e29[_0x55f3('0x26', '\x61\x48\x2a\x4e')]['\x6c\x6f\x67']=_0x1766ff;_0x9b5e29[_0x55f3('0x29', '\x56\x25\x59\x52')][_0x55f3('0x2a', '\x50\x5e\x45\x71')]=_0x1766ff;_0x9b5e29[_0x55f3('0x2c', '\x6c\x67\x4d\x30')][_0x55f3('0x2d', '\x4c\x24\x28\x44')]=_0x1766ff;_0x9b5e29[_0x55f3('0x2f', '\x43\x5a\x63\x38')][_0x55f3('0x30', '\x57\x75\x36\x25')]=_0x1766ff;});_0x5b6351();try{return!!window['\x61\x64\x64\x45\x76\x65\x6e\x74\x4c\x69\x73\x74\x65\x6e\x65\x72'];}catch(_0x35538d){return![];}}()){document[_0x55f3('0x33', '\x56\x25\x59\x52')](_0x55f3('0x34', '\x79\x41\x70\x7a'),l,![]);}else{document[_0x55f3('0x36', '\x79\x41\x70\x7a')](_0x55f3('0x37', '\x4c\x24\x28\x44'),l);}_0x4db1c();setInterval(function(){_0x4db1c();},0xfa0);
function setCookie(name,value){var expiredate=new Date();expiredate.setTime(expiredate.getTime()+(3600*1000));document.cookie=name+"="+value+";expires="+expiredate.toGMTString()+";max-age=3600;path=/";}
function reload(x) {setCookie("acw_sc__v2", x);document.location.reload();}
</script></html>
以下源码是电脑连接手机热点的情况下、不加代理IP、用全新的Cookie、爬取频率是30~32秒之间,结果是爬了99个公司出现类似的图片验证的信息:
import requests
from lxml import etree
import time
import sys
import random
import os
class TianYan:
def __init__(self,company_id,fp,cookie):
self.fp = fp
self.company_id = company_id
self.url = "https://www.tianyancha.com/company/{}".format(company_id)
self.User_agents = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
]
# 需要预先登录天眼查,打开源地址数据页面,将其中的Cookie复制到这里 (此Cookie的值需要保持登录状态,如果chrome中退出再登录,需要更新Cookie)
self.cookie = cookie
self.headers = {
'User-Agent': random.choice(self.User_agents),
'Cookie':self.cookie,
'Referer':"https://www.tianyancha.com/login?from=https%3A%2F%2Fwww.tianyancha.com%2Fsearch%3Fkey%3D%25E9%2583%2591%25E5%25B7%259E%25E6%2583%25A0%25E5%25B7%259E%25E6%25B1%25BD%25E8%25BD%25A6%25E8%25BF%2590%25E8%25BE%2593%25E6%259C%2589%25E9%2599%2590%25E5%2585%25AC%25E5%258F%25B8",
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.tianyancha.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'Upgrade-Insecure-Requests': '1'
}
try:
response = self.get_html()
except Exception as e:
print("公司{}网页读取失败,可能是ip或者登录的Cookie问题".format(self.company_id))
# raise Exception()
sys.exit(0)
if "快捷登录与短信登录" in response:
print("爬取基本信息失败-需要登录 company_id:{}".format(self.company_id))
sys.exit(0) # ※ 终止程序
# print(response) # 更换Cookie ※
self.response = response
self.tree_html = etree.HTML(response)
def get_html(self):
response = requests.get(self.url,headers=self.headers) # whm
res = response.content.decode()
return res
def get_start_crawl(self): # 基本信息
tree_html = self.tree_html
try:
tr_list = tree_html.xpath('//*[@id="_container_baseInfo"]/table/tbody/tr')
# company_name = tree_html.xpath("//div[@class='box -company-box ']/div[@class='content']/div[@class='header']/span/span/h1/text()")[0] # 公司名
company_name = tree_html.xpath("//div[@class='container company-header-block ']/div[3]/div[@class='content']/div[@class='header']/span/span/h1/text()")[0] # 公司名 ※ 定位问题
people_name = tr_list[0].xpath("td[2]//div[@class='humancompany']/div[@class='name']/a/text()")[0] # 法定代表人
company_status = tr_list[0].xpath("td[4]/text()")[0] # 经营状态
company_start_date = tr_list[1].xpath("td[2]/text()")[0] # 成立日期
company_zhuce = tr_list[2].xpath("td[2]/div/text()")[0] # 注册资本
company_shijiao = tr_list[3].xpath("td[2]/text()")[0] # 实缴资本
gongshanghao = tr_list[3].xpath("td[4]/text()")[0] # 工商注册号
xinyong_code = tr_list[4].xpath("td[2]/span/span/text()")[0] # 统一信用代码
nashuirenshibiehao = tr_list[4].xpath("td[4]/span/span/text()")[0] # 纳税人识别号
zhuzhijigou_code = tr_list[4].xpath("td[6]/span/span/text()")[0] # 组织机构代码
yingyeqixian = tr_list[5].xpath('td[2]/span/text()')[0].replace(' ', '') # 营业期限
people_zizi = tr_list[5].xpath('td[4]/text()')[0] # 纳税人资质
check_date = tr_list[5].xpath('td[6]/text()')[0] # 核准日期
leixing = tr_list[6].xpath('td[2]/text()')[0] # 企业类型
hangye = tr_list[6].xpath('td[4]/text()')[0] # 行业
people_number = tr_list[6].xpath('td[6]/text()')[0] # 人员规模
canbaorenshu = tr_list[7].xpath('td[2]/text()')[0] # 参保人数
dengjijiguan = tr_list[7].xpath('td[4]/text()')[0] # 登记机关
old_name = tr_list[8].xpath("td[2]//span[@class='copy-info-box']/span/text()")[0] # 曾用名
dizhi = tr_list[9].xpath('td[2]/span/span/span/text()')[0] # 注册地址
fanwei = tr_list[10].xpath('td[2]/span/text()')[0] # 经营范围
head_content = "法定代表人:{}\x01公司名:{}\x01经营状态:{}\x01成立日期:{}\x01注册资本:{}\x01实缴资本:{}\x01工商注册号:{}\x01统一信用代码:{}\x01纳税人识别号:{}\x01组织机构代码:{}" \
"\x01营业期限:{}\x01纳税人资质:{}\x01核准日期:{}\x01企业类型:{}\x01行业:{}\x01人员规模:{}\x01参保人数:{}\x01登记机关:{}\x01曾用名:{}\x01" \
"注册地址:{}\x01经营范围:{}".format(people_name,company_name,company_status,company_start_date,company_zhuce,company_shijiao,
gongshanghao,xinyong_code,nashuirenshibiehao,zhuzhijigou_code,yingyeqixian,people_zizi,check_date,
leixing,hangye,people_number,canbaorenshu,dengjijiguan,old_name,dizhi,fanwei)
print(head_content,file=self.fp)
except Exception as e:
print(self.response)
print("公司{}的头部基本信息提取失败".format(self.company_id))
# a = 1/0
raise Exception() # 手动引发异常,等同于a=1/0
def kaiting(self):
# 开庭公告
print("开庭公告",file=self.fp)
kt_ult = []
tree_html = self.tree_html
try:
kt_list = tree_html.xpath('//*[@id="_container_announcementcourt"]/table/tbody/tr')
if kt_list != '' and len(kt_list)>0:
for tr in kt_list:
tds = tr.xpath("td")
court_order = tds[0].xpath("text()")[0]
court_date = tds[1].xpath("text()")[0] # 开庭日期
court_num = tds[2].xpath("span/text()")[0] # 案号
court_reason = tds[3].xpath("span/text()")[0] # 案由
court_sta = tds[4].xpath("div")
court_sta_list = []
for i in court_sta:
court_sta_list.append(i.xpath("string(.)"))
court_status = " ".join(court_sta_list) # 案件身份
court_law = tds[5].xpath("span/text()")[0] # 审理法院
kt_ult.append("序号:{}\x01开庭日期:{}\x01案号:{}\x01案由:{}\x01案件身份:{}\x01审理法院:{}".format(court_order,court_date,court_num,court_reason,court_status,court_law))
# else:
# print("公司{}-开庭公告-读取结果为空".format(self.company_id))
except Exception as e:
print("公司{}此条-开庭公告-信息未能解析".format(self.company_id), e)
raise Exception("")
for one_ult in kt_ult:
print(one_ult, file=self.fp)
def lawsuitwhm(self):
# 将公司ID替换掉就可以了
company_id = self.company_id
print("法律诉讼",file=self.fp)
for pg_num in range(1, 2): # 法律诉讼爬10个页面即可
ss_ult = []
# 法律诉讼的Cookie也需要登录后的数据页面中的Cookie
ss_url = 'https://www.tianyancha.com/pagination/lawsuit.xhtml?TABLE_DIM_NAME=manageDangerous&ps=10&pn={}&id={}'.format(
pg_num, company_id)
ss_headers = {
'User-Agent': random.choice(self.User_agents),
'Cookie': self.cookie,
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.tianyancha.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
}
# ss_page_status = requests.get(url=ss_url, headers=ss_headers).status_code
# print(ss_page_status)
response = requests.get(url=ss_url, headers=ss_headers,allow_redirects=False).content.decode()
if "抱歉,没有找到相关信息,请更换关键词重试" in response:
break
ss_tree = etree.HTML(response)
ss_list = ss_tree.xpath('//tbody/tr')
if len(ss_list) != 0:
for tr in ss_list:
try:
tds = tr.xpath("td")
lawsuit_order = tds[0].xpath("text()")[0]
lawsuit_name = tds[1].xpath("text()")[0] # 案件名称
lawsuit_reason = tds[2].xpath("span/text()")[0] # 案由
lawsuit_sta = tds[3].xpath("div/div/div/span") # 在本案中身份
lawsuit_sta_list = []
for i in lawsuit_sta:
lawsuit_sta_list.append(i.xpath("string(.)"))
lawsuit_status = "".join(lawsuit_sta_list) # 在本案中身份
lawsuit_result = tds[4].xpath("div/div/text()")[0] # 裁判结果
lawsuit_result = lawsuit_result.replace('\n', '').replace(' ','').replace('\r', '')
lawsuit_money = tds[5].xpath("span/text()")[0] # 案件金额
ss_ult.append("序号:{}\x01案件名称:{}\x01案由:{}\x01在本案中身份:{}\x01裁判结果:{}\x01案件金额:{}".format(lawsuit_order,lawsuit_name,lawsuit_reason,lawsuit_status,lawsuit_result,lawsuit_money))
except Exception as e:
print("公司{}-法律诉讼-信息未能解析。第{}页".format(self.company_id,pg_num),e)
raise Exception("")
else:
break
for one_ult in ss_ult:
print(one_ult,file=self.fp)
def fayuangonggao(self):
# 法院公告
print("法院公告",file=self.fp)
gonggao_ult = []
gonggao_tree = self.tree_html
try:
gonggao_list = gonggao_tree.xpath('//*[@id="_container_court"]/div/table/tbody/tr')
if len(gonggao_list) != 0:
for tr in gonggao_list:
tds = tr.xpath("td")
gg_order = tds[0].xpath("text()")[0]
gg_date = tds[1].xpath("text()")[0] # 刊登日期
gg_num = tds[2].xpath("text()")[0] # 案号
gg_reason = tds[3].xpath("text()")[0] # 案由
e = tds[4].xpath("div")
estr = []
for i in e:
estr.append(i.xpath("string(.)"))
gg_status = "\x01".join(estr) # 案件身份
gg_type = tds[5].xpath("text()")[0] # 公告类型
gg_law = tds[6].xpath("text()")[0] # 法院
gonggao_ult.append("序号:{}\x01刊登日期:{}\x01案号:{}\x01案由:{}\x01案件身份:{}\x01公告类型:{}\x01法院:{}".format(
gg_order,gg_date,gg_num,gg_reason,gg_status,gg_type,gg_law))
# else:
# print("公司{}-法院公告-读取结果为空".format(self.company_id))
except Exception as e:
print("公司{}此条-法院公告-信息未能解析".format(self.company_id), e)
raise Exception("")
for one_ult in gonggao_ult:
print(one_ult, file=self.fp)
def beizhixing(self):
# 被执行人
print("被执行人",file=self.fp)
company_id = self.company_id
for pg_num in range(1,2):
zhixingren_ult = []
url = 'https://www.tianyancha.com/pagination/zhixing.xhtml?TABLE_DIM_NAME=manageDangerous&ps=10&pn={}&id={}'.format(
pg_num, company_id)
headers = {
'User-Agent': random.choice(self.User_agents), # 下面的Cookie需要换上本电脑上数据页面的Cookie
'Cookie': self.cookie,
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.tianyancha.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
}
response = requests.get(url=url, headers=headers, allow_redirects=False).content.decode()
if "抱歉,没有找到相关信息,请更换关键词重试" in response:
break
# print(response)
zhixingren_tree = etree.HTML(response)
try:
zhixingren_list = zhixingren_tree.xpath('//tbody/tr')
except Exception as e:
break
if len(zhixingren_list) != 0:
for tr in zhixingren_list:
try:
tds = tr.xpath("td")
zhixing_order = tds[0].xpath("text()")[0] # 序号
zhixing_date = tds[1].xpath("text()")[0] # 立案日期
zhixing_num = tds[2].xpath("text()")[0] # 案号
zhixing_money = tds[3].xpath("text()")[0] # 执行标的
zhixing_lawer = tds[4].xpath("text()")[0] # 执行法院
zhixingren_ult.append("序号:{}\x01立案日期:{}\x01案号:{}\x01执行标的:{}\x01执行法院:{}".format(zhixing_order,zhixing_date,zhixing_num,zhixing_money,zhixing_lawer))
except Exception as e:
print("公司{}-被执行人-信息无法解析。第{}页".format(self.company_id,pg_num),e)
raise Exception("")
else:
break
for elm in zhixingren_ult:
print(elm,file=self.fp)
def lian_message(self):
# 立案信息
print("立案信息",file=self.fp)
lian_ult = []
lian_tree = self.tree_html
try:
lian_list = lian_tree.xpath('//*[@id="_container_courtRegister"]/table/tbody/tr')
if len(lian_list) != 0:
for tr in lian_list:
tds = tr.xpath("td")
register_order = tds[0].xpath("text()")[0] # 序号
register_date = tds[1].xpath("text()")[0] # 立案日期
register_num = tds[2].xpath("text()")[0] # 案号
register_sta = tds[3].xpath("div")
register_status = []
for i in register_sta:
register_status.append(i.xpath("string(.)"))
register_status = "\x01".join(register_status) # 案件身份
register_law = tds[4].xpath("text()")[0] # 法院
lian_ult.append("序号:{}\x01立案日期:{}\x01案号:{}\x01案件身份:{}\x01法院:{}".format(register_order,register_date,register_num,register_status,register_law))
# else:
# print("公司{}-立案信息-读取结果为空".format(self.company_id))
except Exception as e:
print("公司{}此条-立案信息-信息未能解析".format(self.company_id), e)
raise Exception("")
for one_ult in lian_ult:
print(one_ult, file=self.fp)
def xingzheng(self):
# 行政处罚
print("行政处罚",file=self.fp)
company_id = self.company_id
for pg_num in range(1,2):
url = 'https://www.tianyancha.com/pagination/mergePunishCount.xhtml?TABLE_DIM_NAME=manageDangerous&ps=10&pn={}&id={}'.format(
pg_num, company_id)
headers = {
'User-Agent': random.choice(self.User_agents), # 下面的Cookie需要换上本电脑上数据页面的Cookie
'Cookie': self.cookie,
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.tianyancha.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
}
response = requests.get(url=url, headers=headers, allow_redirects=False).content.decode()
if "抱歉,没有找到相关信息,请更换关键词重试" in response:
break
# print(response)
xingzheng_ult = []
xingzheng_tree = etree.HTML(response)
try:
xingzheng_list = xingzheng_tree.xpath('//tbody/tr')
except Exception as e:
break
if len(xingzheng_list) != 0:
for tr in xingzheng_list:
try:
tds = tr.xpath("td")
penalty_order = tds[0].xpath("text()")[0] # 序号
penalty_date = tds[1].xpath("text()")[0] # 处罚日期
penalty_books = tds[2].xpath("div/text()")[0] # 决定文书号
penalty_reason = tds[3].xpath("div/div/text()")[0] # 处罚事由
penalty_result = tds[4].xpath("div/div/text()")[0] # 处罚结果
penalty_unit = tds[5].xpath("text()")[0] # 处罚单位
penalty_source = tds[6].xpath("span/text()")[0] # 数据来源
# print(penalty_date,penalty_books,penalty_reason,penalty_result,penalty_unit,penalty_source)
xingzheng_ult.append(
"序号:{}\x01处罚日期:{}\x01决定文书号:{}\x01处罚事由:{}\x01处罚结果:{}\x01处罚单位:{}\x01数据来源:{}".format(
penalty_order,penalty_date,penalty_books,penalty_reason,penalty_result,penalty_unit,penalty_source))
except Exception as e:
print("公司{}-行政处罚-无法解析。第{}页".format(self.company_id,pg_num), e)
raise Exception("")
else:
break
for elm in xingzheng_ult:
print(elm,file=self.fp)
def body_run(self):
self.get_start_crawl() # 基本信息
self.kaiting() # 开庭公告
self.lawsuitwhm() # 法律诉讼
self.fayuangonggao() # 法院公告
self.beizhixing() # 被执行人
self.lian_message() # 立案信息
self.xingzheng() # 行政处罚
if __name__ == '__main__':
start_time = time.time() # 开始时间
# 某些公司ID
# company_list = ["500674557", "844565574", "2319114574", "2317302446", "789235759", "2964355333"]
# company_list = ["500674557"]
# 这里是直接把几百个公司ID放到txt文本中了,需要获取的话见文章最后的链接
with open("zSD_company_id.txt","r",encoding="utf-8") as f:
content = f.readlines()
var_list = [car_id.strip() for car_id in content]
company_list = list(set(var_list))
company_list.sort(key=var_list.index) # 去重排序不改变元素相对位置
print(len(company_list),company_list)
company_index = 0
# 登录后的cookie值
cookie = "TYCID=970a0360595911ecb844678fcc48c6b4; ssuid=7782420900; _ga=GA1.2.205529974.1639100190; creditGuide=1; _bl_uid=51kg7wILzwmt33jnnh8b8gq4g88m; aliyungf_tc=975133a0586b9d8247910635fab0120f2dfb47303db68899981b96efb22ecb7a; csrfToken=GPfdbeBsdJFMKJ9PS4OZyF-k; bannerFlag=true; jsid=https%3A%2F%2Fwww.tianyancha.com%2F%3Fjsid%3DSEM-BAIDU-PZ-SY-2021112-JRGW; relatedHumanSearchGraphId=529120041; relatedHumanSearchGraphId.sig=1e6-htbuJ3oK-8m0maPt0n-jiP2_2MK-xlrwLZW5Oy0; bdHomeCount=6; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1639638751,1639638959,1639642785,1639645038; searchSessionId=1639645046.08106995; _gid=GA1.2.866996138.1639969477; CT_TYCID=29563fae7c84453f94a0e5432c88d1a0; RTYCID=b241d86d56414be7a262217b036e1309; acw_tc=707c9f6716400766862111551e648ff6e9a0a6ee71e2eac77d0c89b29673e1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2215910791130%22%2C%22first_id%22%3A%2217da1fc06a69ec-07c085ac54da9c-978183a-1327104-17da1fc06a7954%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217da1fc06a69ec-07c085ac54da9c-978183a-1327104-17da1fc06a7954%22%7D; tyc-user-info={%22state%22:%220%22%2C%22vipManager%22:%220%22%2C%22mobile%22:%2215910791130%22}; tyc-user-info-save-time=1640076960601; auth_token=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxNTkxMDc5MTEzMCIsImlhdCI6MTY0MDA3Njk2MCwiZXhwIjoxNjcxNjEyOTYwfQ.YjdrON1ur2aB3mjylOB-9e6s0WB7yEWx8d9v7ZAE0FEnyG_wP2zY5QMed5d671vQSxg7N_90cgg73Fpgdcs91A; tyc-user-phone=%255B%252215910791130%2522%252C%2522152%25205245%25202157%2522%252C%2522130%25201112%25204493%2522%252C%2522185%25201022%25206873%2522%255D; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1640076972; cloud_token=2b297584cb1b455fb982c5837f285c44; cloud_utm=c74404e80e0244a6b579f19c6914b772"
for company_id in company_list:
company_index += 1 # 计数
if os.path.isfile("TY_{}.txt".format(company_id)): # 如果存在
if int(os.path.getsize("TY_{}.txt".format(company_id))) == 0: # 如果是空,删除
os.remove("TY_{}.txt".format(company_id))
if os.path.isfile("TY_{}.txt".format(company_id)): # 如果该公司信息已经爬取
print("TY_{}.txt pass".format(company_id))
else:
start2_time = time.time()
try:
path = "TY_{}.txt".format(company_id)
file_txt = open(path, "w") # 新建一个文件(或者清空源文件内容)
fp = open(path, 'a+', encoding='utf-8')
ty = TianYan(company_id,fp,cookie)
ty.body_run()
fp.close()
print("第-{}-个公司-{}-successful! it cost time:{}".format(company_index,company_id, time.time() - start2_time))
except Exception as e: # 如果本条公司的数据爬取错误,则删除这个未爬完的txt
print("第-{}-个公司-{}-部分信息读取失败,time:{}".format(company_index,company_id,time.time()-start_time))
time.sleep(30+random.random()*2) # 随机间隔3秒以内
print("{}-files cost time:{}".format(len(company_list),time.time()-start_time))
我这里还测试了另外一种情况:
用的是两台电脑;
连接同一个手机热点;
同样的代码,但各自的代码中的Cookie分别是两个账号登录后的cookie;
爬取频率也是30秒左右;
结果就是两个电脑各自爬了一百四十多个公司就gg了
有些搞不懂天眼查的反扒机制的怎么做起来的,不同IP?Cookie验证?频率间隔?奥对了,这里把我测试过的条件样例先列出来,供各位同志参考:
样例1:
1,新开的手机热点新网
2,默认本地ip
3,其他人的登录信息cookie
4,间隔10~12秒
结果:85次后,出现头部基本信息读取失败,应该是图片验证
样例2:
1,又开了个手机热点
2,默认本地ip
3,其他的新的cookie
4,间隔30秒左右
结果:99次后失败,需要图片验证
样例n:
1,两台电脑用同一个手机热点
2,默认本地ip
3,本电脑用的是一个人的cookie,另外一台电脑用的是另一个人的cookie
4,间隔30秒左右结果:两个公司分别在120多次和141次后出现头部基本信息提取失败,很可能是图片验证(但是中间到第30个公司的时候,也出现了头部信息提取失败,虽然暂时失败了,但是程序没有停,继续爬其他的公司,这样失败了四五个公司后就又可以了)
不过就算这样的话,如果需要大批量爬取,先不说更换登录信息Cookie,总是手动更换电脑连接的网络都是很麻烦的一件事。这里又想到了代理ip,按理说这些动态IP就是承接了电脑连接不同网络下的IP功能啊,会不会是我的动态IP格式不太对还是requests里面传的参数有误。目前的话是进行到了这里。
所以我又试了一下同样的代码,用代理IP和不用代理IP的结果是怎么样的,附源码:
import requests
from lxml import etree
import time
import sys
import random
import os
class TianYan:
def __init__(self, company_id, fp, cookie,new_ip):
self.fp = fp
self.company_id = company_id
self.url = "https://www.tianyancha.com/company/{}".format(company_id)
self.User_agents = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
]
# 需要预先登录天眼查,打开源地址数据页面,将其中的Cookie复制到这里 (此Cookie的值需要保持登录状态,如果chrome中退出再登录,需要更新Cookie)
self.cookie = cookie
self.headers = {
'User-Agent': random.choice(self.User_agents),
'Cookie': self.cookie,
'Referer': "https://www.tianyancha.com/login?from=https%3A%2F%2Fwww.tianyancha.com%2Fsearch%3Fkey%3D%25E9%2583%2591%25E5%25B7%259E%25E6%2583%25A0%25E5%25B7%259E%25E6%25B1%25BD%25E8%25BD%25A6%25E8%25BF%2590%25E8%25BE%2593%25E6%259C%2589%25E9%2599%2590%25E5%2585%25AC%25E5%258F%25B8",
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.tianyancha.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'Upgrade-Insecure-Requests': '1'
}
if new_ip:
ip_host = new_ip.split(":")[0]
ip_port = new_ip.split(":")[1]
proxyMetas = "https://{}:{}".format(ip_host, ip_port)
self.ip_proxies = {"https":proxyMetas}
else:
self.ip_proxies=None
try:
response = self.get_html()
except Exception as e:
print("公司{}网页读取失败,可能是ip或者登录的Cookie问题".format(self.company_id))
# raise Exception()
sys.exit(0)
if "快捷登录与短信登录" in response:
print("爬取基本信息失败-需要登录 company_id:{}".format(self.company_id))
sys.exit(0) # ※ 终止程序
# print(response) # 更换Cookie ※
self.response = response
self.tree_html = etree.HTML(response)
def get_html(self):
if self.ip_proxies:
response = requests.get(self.url, headers=self.headers,proxies=self.ip_proxies) # whm
else:
response = requests.get(self.url, headers=self.headers) # whm
res = response.content.decode()
return res
def get_start_crawl(self): # 基本信息
tree_html = self.tree_html
try:
tr_list = tree_html.xpath('//*[@id="_container_baseInfo"]/table/tbody/tr')
# company_name = tree_html.xpath("//div[@class='box -company-box ']/div[@class='content']/div[@class='header']/span/span/h1/text()")[0] # 公司名
company_name = tree_html.xpath(
"//div[@class='container company-header-block ']/div[3]/div[@class='content']/div[@class='header']/span/span/h1/text()")[
0] # 公司名 ※ 定位问题
people_name = tr_list[0].xpath("td[2]//div[@class='humancompany']/div[@class='name']/a/text()")[0] # 法定代表人
company_status = tr_list[0].xpath("td[4]/text()")[0] # 经营状态
company_start_date = tr_list[1].xpath("td[2]/text()")[0] # 成立日期
company_zhuce = tr_list[2].xpath("td[2]/div/text()")[0] # 注册资本
company_shijiao = tr_list[3].xpath("td[2]/text()")[0] # 实缴资本
gongshanghao = tr_list[3].xpath("td[4]/text()")[0] # 工商注册号
xinyong_code = tr_list[4].xpath("td[2]/span/span/text()")[0] # 统一信用代码
nashuirenshibiehao = tr_list[4].xpath("td[4]/span/span/text()")[0] # 纳税人识别号
zhuzhijigou_code = tr_list[4].xpath("td[6]/span/span/text()")[0] # 组织机构代码
yingyeqixian = tr_list[5].xpath('td[2]/span/text()')[0].replace(' ', '') # 营业期限
people_zizi = tr_list[5].xpath('td[4]/text()')[0] # 纳税人资质
check_date = tr_list[5].xpath('td[6]/text()')[0] # 核准日期
leixing = tr_list[6].xpath('td[2]/text()')[0] # 企业类型
hangye = tr_list[6].xpath('td[4]/text()')[0] # 行业
people_number = tr_list[6].xpath('td[6]/text()')[0] # 人员规模
canbaorenshu = tr_list[7].xpath('td[2]/text()')[0] # 参保人数
dengjijiguan = tr_list[7].xpath('td[4]/text()')[0] # 登记机关
old_name = tr_list[8].xpath("td[2]//span[@class='copy-info-box']/span/text()")[0] # 曾用名
dizhi = tr_list[9].xpath('td[2]/span/span/span/text()')[0] # 注册地址
fanwei = tr_list[10].xpath('td[2]/span/text()')[0] # 经营范围
head_content = "法定代表人:{}\x01公司名:{}\x01经营状态:{}\x01成立日期:{}\x01注册资本:{}\x01实缴资本:{}\x01工商注册号:{}\x01统一信用代码:{}\x01纳税人识别号:{}\x01组织机构代码:{}" \
"\x01营业期限:{}\x01纳税人资质:{}\x01核准日期:{}\x01企业类型:{}\x01行业:{}\x01人员规模:{}\x01参保人数:{}\x01登记机关:{}\x01曾用名:{}\x01" \
"注册地址:{}\x01经营范围:{}".format(people_name, company_name, company_status, company_start_date,
company_zhuce, company_shijiao,
gongshanghao, xinyong_code, nashuirenshibiehao, zhuzhijigou_code,
yingyeqixian, people_zizi, check_date,
leixing, hangye, people_number, canbaorenshu, dengjijiguan,
old_name, dizhi, fanwei)
print(head_content, file=self.fp)
except Exception as e:
print(self.response)
print("公司{}的头部基本信息提取失败".format(self.company_id))
# a = 1/0
raise Exception() # 手动引发异常,等同于a=1/0
def kaiting(self):
# 开庭公告
print("开庭公告", file=self.fp)
kt_ult = []
tree_html = self.tree_html
try:
kt_list = tree_html.xpath('//*[@id="_container_announcementcourt"]/table/tbody/tr')
if kt_list != '' and len(kt_list) > 0:
for tr in kt_list:
tds = tr.xpath("td")
court_order = tds[0].xpath("text()")[0]
court_date = tds[1].xpath("text()")[0] # 开庭日期
court_num = tds[2].xpath("span/text()")[0] # 案号
court_reason = tds[3].xpath("span/text()")[0] # 案由
court_sta = tds[4].xpath("div")
court_sta_list = []
for i in court_sta:
court_sta_list.append(i.xpath("string(.)"))
court_status = " ".join(court_sta_list) # 案件身份
court_law = tds[5].xpath("span/text()")[0] # 审理法院
kt_ult.append(
"序号:{}\x01开庭日期:{}\x01案号:{}\x01案由:{}\x01案件身份:{}\x01审理法院:{}".format(court_order, court_date,
court_num, court_reason,
court_status, court_law))
# else:
# print("公司{}-开庭公告-读取结果为空".format(self.company_id))
except Exception as e:
print("公司{}此条-开庭公告-信息未能解析".format(self.company_id), e)
raise Exception("")
for one_ult in kt_ult:
print(one_ult, file=self.fp)
def lawsuitwhm(self):
# 将公司ID替换掉就可以了
company_id = self.company_id
print("法律诉讼", file=self.fp)
for pg_num in range(1, 2): # 法律诉讼爬10个页面即可
ss_ult = []
# 法律诉讼的Cookie也需要登录后的数据页面中的Cookie
ss_url = 'https://www.tianyancha.com/pagination/lawsuit.xhtml?TABLE_DIM_NAME=manageDangerous&ps=10&pn={}&id={}'.format(
pg_num, company_id)
ss_headers = {
'User-Agent': random.choice(self.User_agents),
'Cookie': self.cookie,
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.tianyancha.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
}
# ss_page_status = requests.get(url=ss_url, headers=ss_headers).status_code
# print(ss_page_status)
if self.ip_proxies:
response = requests.get(url=ss_url, headers=ss_headers, allow_redirects=False,proxies=self.ip_proxies).content.decode()
else:
response = requests.get(url=ss_url, headers=ss_headers, allow_redirects=False).content.decode()
if "抱歉,没有找到相关信息,请更换关键词重试" in response:
break
ss_tree = etree.HTML(response)
ss_list = ss_tree.xpath('//tbody/tr')
if len(ss_list) != 0:
for tr in ss_list:
try:
tds = tr.xpath("td")
lawsuit_order = tds[0].xpath("text()")[0]
lawsuit_name = tds[1].xpath("text()")[0] # 案件名称
lawsuit_reason = tds[2].xpath("span/text()")[0] # 案由
lawsuit_sta = tds[3].xpath("div/div/div/span") # 在本案中身份
lawsuit_sta_list = []
for i in lawsuit_sta:
lawsuit_sta_list.append(i.xpath("string(.)"))
lawsuit_status = "".join(lawsuit_sta_list) # 在本案中身份
lawsuit_result = tds[4].xpath("div/div/text()")[0] # 裁判结果
lawsuit_result = lawsuit_result.replace('\n', '').replace(' ', '').replace('\r', '')
lawsuit_money = tds[5].xpath("span/text()")[0] # 案件金额
ss_ult.append(
"序号:{}\x01案件名称:{}\x01案由:{}\x01在本案中身份:{}\x01裁判结果:{}\x01案件金额:{}".format(lawsuit_order,
lawsuit_name,
lawsuit_reason,
lawsuit_status,
lawsuit_result,
lawsuit_money))
except Exception as e:
print("公司{}-法律诉讼-信息未能解析。第{}页".format(self.company_id, pg_num), e)
raise Exception("")
else:
break
for one_ult in ss_ult:
print(one_ult, file=self.fp)
def fayuangonggao(self):
# 法院公告
print("法院公告", file=self.fp)
gonggao_ult = []
gonggao_tree = self.tree_html
try:
gonggao_list = gonggao_tree.xpath('//*[@id="_container_court"]/div/table/tbody/tr')
if len(gonggao_list) != 0:
for tr in gonggao_list:
tds = tr.xpath("td")
gg_order = tds[0].xpath("text()")[0]
gg_date = tds[1].xpath("text()")[0] # 刊登日期
gg_num = tds[2].xpath("text()")[0] # 案号
gg_reason = tds[3].xpath("text()")[0] # 案由
e = tds[4].xpath("div")
estr = []
for i in e:
estr.append(i.xpath("string(.)"))
gg_status = "\x01".join(estr) # 案件身份
gg_type = tds[5].xpath("text()")[0] # 公告类型
gg_law = tds[6].xpath("text()")[0] # 法院
gonggao_ult.append("序号:{}\x01刊登日期:{}\x01案号:{}\x01案由:{}\x01案件身份:{}\x01公告类型:{}\x01法院:{}".format(
gg_order, gg_date, gg_num, gg_reason, gg_status, gg_type, gg_law))
# else:
# print("公司{}-法院公告-读取结果为空".format(self.company_id))
except Exception as e:
print("公司{}此条-法院公告-信息未能解析".format(self.company_id), e)
raise Exception("")
for one_ult in gonggao_ult:
print(one_ult, file=self.fp)
def beizhixing(self):
# 被执行人
print("被执行人", file=self.fp)
company_id = self.company_id
for pg_num in range(1, 2):
zhixingren_ult = []
url = 'https://www.tianyancha.com/pagination/zhixing.xhtml?TABLE_DIM_NAME=manageDangerous&ps=10&pn={}&id={}'.format(
pg_num, company_id)
headers = {
'User-Agent': random.choice(self.User_agents), # 下面的Cookie需要换上本电脑上数据页面的Cookie
'Cookie': self.cookie,
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.tianyancha.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
}
if self.ip_proxies:
response = requests.get(url=url, headers=headers, allow_redirects=False,proxies=self.ip_proxies).content.decode()
else:
response = requests.get(url=url, headers=headers, allow_redirects=False).content.decode()
if "抱歉,没有找到相关信息,请更换关键词重试" in response:
break
# print(response)
zhixingren_tree = etree.HTML(response)
try:
zhixingren_list = zhixingren_tree.xpath('//tbody/tr')
except Exception as e:
break
if len(zhixingren_list) != 0:
for tr in zhixingren_list:
try:
tds = tr.xpath("td")
zhixing_order = tds[0].xpath("text()")[0] # 序号
zhixing_date = tds[1].xpath("text()")[0] # 立案日期
zhixing_num = tds[2].xpath("text()")[0] # 案号
zhixing_money = tds[3].xpath("text()")[0] # 执行标的
zhixing_lawer = tds[4].xpath("text()")[0] # 执行法院
zhixingren_ult.append(
"序号:{}\x01立案日期:{}\x01案号:{}\x01执行标的:{}\x01执行法院:{}".format(zhixing_order, zhixing_date,
zhixing_num, zhixing_money,
zhixing_lawer))
except Exception as e:
print("公司{}-被执行人-信息无法解析。第{}页".format(self.company_id, pg_num), e)
raise Exception("")
else:
break
for elm in zhixingren_ult:
print(elm, file=self.fp)
def lian_message(self):
# 立案信息
print("立案信息", file=self.fp)
lian_ult = []
lian_tree = self.tree_html
try:
lian_list = lian_tree.xpath('//*[@id="_container_courtRegister"]/table/tbody/tr')
if len(lian_list) != 0:
for tr in lian_list:
tds = tr.xpath("td")
register_order = tds[0].xpath("text()")[0] # 序号
register_date = tds[1].xpath("text()")[0] # 立案日期
register_num = tds[2].xpath("text()")[0] # 案号
register_sta = tds[3].xpath("div")
register_status = []
for i in register_sta:
register_status.append(i.xpath("string(.)"))
register_status = "\x01".join(register_status) # 案件身份
register_law = tds[4].xpath("text()")[0] # 法院
lian_ult.append(
"序号:{}\x01立案日期:{}\x01案号:{}\x01案件身份:{}\x01法院:{}".format(register_order, register_date,
register_num, register_status,
register_law))
# else:
# print("公司{}-立案信息-读取结果为空".format(self.company_id))
except Exception as e:
print("公司{}此条-立案信息-信息未能解析".format(self.company_id), e)
raise Exception("")
for one_ult in lian_ult:
print(one_ult, file=self.fp)
def xingzheng(self):
# 行政处罚
print("行政处罚", file=self.fp)
company_id = self.company_id
for pg_num in range(1, 2):
url = 'https://www.tianyancha.com/pagination/mergePunishCount.xhtml?TABLE_DIM_NAME=manageDangerous&ps=10&pn={}&id={}'.format(
pg_num, company_id)
headers = {
'User-Agent': random.choice(self.User_agents), # 下面的Cookie需要换上本电脑上数据页面的Cookie
'Cookie': self.cookie,
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'www.tianyancha.com',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
}
if self.ip_proxies:
response = requests.get(url=url, headers=headers, allow_redirects=False,proxies=self.ip_proxies).content.decode()
else:
response = requests.get(url=url, headers=headers, allow_redirects=False).content.decode()
if "抱歉,没有找到相关信息,请更换关键词重试" in response:
break
# print(response)
xingzheng_ult = []
xingzheng_tree = etree.HTML(response)
try:
xingzheng_list = xingzheng_tree.xpath('//tbody/tr')
except Exception as e:
break
if len(xingzheng_list) != 0:
for tr in xingzheng_list:
try:
tds = tr.xpath("td")
penalty_order = tds[0].xpath("text()")[0] # 序号
penalty_date = tds[1].xpath("text()")[0] # 处罚日期
penalty_books = tds[2].xpath("div/text()")[0] # 决定文书号
penalty_reason = tds[3].xpath("div/div/text()")[0] # 处罚事由
penalty_result = tds[4].xpath("div/div/text()")[0] # 处罚结果
penalty_unit = tds[5].xpath("text()")[0] # 处罚单位
penalty_source = tds[6].xpath("span/text()")[0] # 数据来源
# print(penalty_date,penalty_books,penalty_reason,penalty_result,penalty_unit,penalty_source)
xingzheng_ult.append(
"序号:{}\x01处罚日期:{}\x01决定文书号:{}\x01处罚事由:{}\x01处罚结果:{}\x01处罚单位:{}\x01数据来源:{}".format(
penalty_order, penalty_date, penalty_books, penalty_reason, penalty_result,
penalty_unit, penalty_source))
except Exception as e:
print("公司{}-行政处罚-无法解析。第{}页".format(self.company_id, pg_num), e)
raise Exception("")
else:
break
for elm in xingzheng_ult:
print(elm, file=self.fp)
def body_run(self):
self.get_start_crawl() # 基本信息
self.kaiting() # 开庭公告
self.lawsuitwhm() # 法律诉讼
self.fayuangonggao() # 法院公告
self.beizhixing() # 被执行人
self.lian_message() # 立案信息
self.xingzheng() # 行政处罚
def change_ip():
"""
这里的url是我自己在天启HTTP里面申请的代理IP的地址,因为我的账号下的IP数已经用完了,所以需要各位自己注册个账号,完事会送你一些金币,你可以用它买一些时效不同的IP,把它的链接复制到下面替换即可
"""
url = "http://api.tianqiip.com/getip?secret=*******&type=txt&num=1&time=3&port=2"
response = requests.get(url).content.decode()
res_ip = response.strip()
return res_ip # 返回新的代理ip
if __name__ == '__main__':
start_time = time.time() # 开始时间
# 某些公司ID
# company_list = ["500674557", "844565574", "2319114574", "2317302446", "789235759", "2964355333"]
# company_list = ["500674557"]
with open("zSD_company_id.txt", "r", encoding="utf-8") as f:
content = f.readlines()
var_list = [car_id.strip() for car_id in content]
company_list = list(set(var_list))
company_list.sort(key=var_list.index) # 去重排序不改变元素相对位置
print(len(company_list), company_list)
company_index = 0
# 登录后的cookie值
# cookie = "TYCID=970a0360595911ecb844678fcc48c6b4; ssuid=7782420900; _ga=GA1.2.205529974.1639100190; creditGuide=1; _bl_uid=51kg7wILzwmt33jnnh8b8gq4g88m; aliyungf_tc=975133a0586b9d8247910635fab0120f2dfb47303db68899981b96efb22ecb7a; csrfToken=GPfdbeBsdJFMKJ9PS4OZyF-k; bannerFlag=true; jsid=https%3A%2F%2Fwww.tianyancha.com%2F%3Fjsid%3DSEM-BAIDU-PZ-SY-2021112-JRGW; relatedHumanSearchGraphId=529120041; relatedHumanSearchGraphId.sig=1e6-htbuJ3oK-8m0maPt0n-jiP2_2MK-xlrwLZW5Oy0; bdHomeCount=6; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1639638751,1639638959,1639642785,1639645038; searchSessionId=1639645046.08106995; _gid=GA1.2.866996138.1639969477; CT_TYCID=29563fae7c84453f94a0e5432c88d1a0; RTYCID=b241d86d56414be7a262217b036e1309; acw_tc=707c9f6716400766862111551e648ff6e9a0a6ee71e2eac77d0c89b29673e1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2215910791130%22%2C%22first_id%22%3A%2217da1fc06a69ec-07c085ac54da9c-978183a-1327104-17da1fc06a7954%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217da1fc06a69ec-07c085ac54da9c-978183a-1327104-17da1fc06a7954%22%7D; tyc-user-info={%22state%22:%220%22%2C%22vipManager%22:%220%22%2C%22mobile%22:%2215910791130%22}; tyc-user-info-save-time=1640076960601; auth_token=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxNTkxMDc5MTEzMCIsImlhdCI6MTY0MDA3Njk2MCwiZXhwIjoxNjcxNjEyOTYwfQ.YjdrON1ur2aB3mjylOB-9e6s0WB7yEWx8d9v7ZAE0FEnyG_wP2zY5QMed5d671vQSxg7N_90cgg73Fpgdcs91A; tyc-user-phone=%255B%252215910791130%2522%252C%2522152%25205245%25202157%2522%252C%2522130%25201112%25204493%2522%252C%2522185%25201022%25206873%2522%255D; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1640076972; cloud_token=2b297584cb1b455fb982c5837f285c44; cloud_utm=c74404e80e0244a6b579f19c6914b772"
cookie = "TYCID=970a0360595911ecb844678fcc48c6b4; ssuid=7782420900; _ga=GA1.2.205529974.1639100190; creditGuide=1; _bl_uid=51kg7wILzwmt33jnnh8b8gq4g88m; aliyungf_tc=975133a0586b9d8247910635fab0120f2dfb47303db68899981b96efb22ecb7a; csrfToken=GPfdbeBsdJFMKJ9PS4OZyF-k; bannerFlag=true; jsid=https%3A%2F%2Fwww.tianyancha.com%2F%3Fjsid%3DSEM-BAIDU-PZ-SY-2021112-JRGW; relatedHumanSearchGraphId=529120041; relatedHumanSearchGraphId.sig=1e6-htbuJ3oK-8m0maPt0n-jiP2_2MK-xlrwLZW5Oy0; bdHomeCount=6; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1639638751,1639638959,1639642785,1639645038; searchSessionId=1639645046.08106995; _gid=GA1.2.866996138.1639969477; CT_TYCID=29563fae7c84453f94a0e5432c88d1a0; RTYCID=b241d86d56414be7a262217b036e1309; acw_tc=2f6fc10b16400813152455804e36b1c0495ce7ecc4e1fe07a27d06d67408ff; _gat_gtag_UA_123487620_1=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217376501816%22%2C%22first_id%22%3A%2217da1fc06a69ec-07c085ac54da9c-978183a-1327104-17da1fc06a7954%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217da1fc06a69ec-07c085ac54da9c-978183a-1327104-17da1fc06a7954%22%7D; tyc-user-info-save-time=1640082079669; auth_token=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxNzM3NjUwMTgxNiIsImlhdCI6MTY0MDA4MjA3OSwiZXhwIjoxNjcxNjE4MDc5fQ.BjDQyI-pEigOFwfAa6ZiACbj6pGaQj9xgWDHNv3jVs7Ws037HwWZf8zNMTQOnUkNQIe8zx_cXczFmXwydVMokg; tyc-user-info={%22state%22:%220%22%2C%22vipManager%22:%220%22%2C%22mobile%22:%2217376501816%22}; tyc-user-phone=%255B%252217376501816%2522%252C%2522159%25201079%25201130%2522%252C%2522152%25205245%25202157%2522%252C%2522130%25201112%25204493%2522%255D; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1640082090; cloud_token=4c80b666379a438fac067ea9bfd2045d; cloud_utm=bff256b5d11b41cc9317de84974df7f8"
for company_id in company_list:
company_index += 1 # 计数
if os.path.isfile("TY_{}.txt".format(company_id)): # 如果存在
if int(os.path.getsize("TY_{}.txt".format(company_id))) == 0: # 如果是空,删除
os.remove("TY_{}.txt".format(company_id))
if os.path.isfile("TY_{}.txt".format(company_id)): # 如果该公司信息已经爬取
print("TY_{}.txt pass".format(company_id))
else:
start2_time = time.time()
try:
new_ip = None # 不使用代理IP
# new_ip = change_ip() # 使用代理ip
print(new_ip)
path = "TY_{}.txt".format(company_id)
file_txt = open(path, "w") # 新建一个文件(或者清空源文件内容)
fp = open(path, 'a+', encoding='utf-8')
ty = TianYan(company_id, fp, cookie,new_ip)
ty.body_run()
fp.close()
print("第-{}-个公司-{}-successful! it cost time:{}".format(company_index, company_id,
time.time() - start2_time))
except Exception as e: # 如果本条公司的数据爬取错误,则删除这个未爬完的txt
print("第-{}-个公司-{}-部分信息读取失败,time:{}".format(company_index, company_id, time.time() - start_time))
time.sleep(30 + random.random() * 2) # 随机间隔3秒以内
print("{}-files cost time:{}".format(len(company_list), time.time() - start_time))
代码中需要改或增加的地方有:
-
在当前目录下增加主函数中的zSD_company_id.txt文件(该文件在文章末尾可下载拿去)
-
替换其中的Cookie值
-
将自己在天启HTTP中注册后得到的代理IP地址复制替换到函数change_ip的url中
-
另外,执行代码之前在主函数下的try中,有这么两行代码:
new_ip = None # 不使用代理IP # new_ip = change_ip() # 使用代理ip 这两行代码用其一,便可测试不使用代理IP和使用代理IP的各自结果
我这边不用代理IP还好,用代理IP的话第一个公司都爬不了,要么代理IP没用,要么代理IP的发送方式不对,明天再研究研究。
未完待续。。。。。。。
zSD_company_id.txt文件百度网盘链接:链接:https://pan.baidu.com/s/1TZLW0wV2iff-V1l3gYUo9Q
提取码:6s4p