spider-flow实践总结-案例分析

1,537 阅读3分钟

spider-flow 使用有一年多的时间。在使用中发现网上教程不多。selenium 节点变量不好使用,经常不明所以的就采不到数据。

今天分析下:www.zoomlion.com/other/searc… 反爬,并用spider-flow 实现翻页采集。

列表页

查看列表页源码,发现新闻列表内容并不在源码,打开chrome 开发者工具,查看网络请求,

屏幕快照 2022-07-03 上午8.47.49.png

分析

发现第一个请求最有可能是,但这%3Cli 这是啥东西,不认识,猜可能是unicode 编码,或是base64编码之类的。在网上打开unicode 解码工具,一看正是。

屏幕快照 2022-07-03 上午8.50.54.png 就是这个请求,我们深挖下这个请求,应该有关键词,翻页之类的变量。

屏幕快照 2022-07-03 上午8.52.10.png

一看很蒙,这应该加密处理了。不过我们不怕,spider-flow 有selenium节点,支持渲染请求。

处理

屏幕快照 2022-07-03 上午8.54.36.png

输入url, 并给变量命名resp2,(在这重命名,是为了避免被后面覆盖).

屏幕快照 2022-07-03 上午8.59.17.png 变量抽取,首先 resp2 是个SeleniumResponse对象,需要转成SpiderResponse对象respele,(我发现SeleniumResponse 抽取时不太好用) urls 是抽取的列表页url 数组。 inext 是翻页变量,这的翻页需要点击下面的下一页按钮。

屏幕快照 2022-07-03 上午8.59.26.png

屏幕快照 2022-07-03 上午8.54.45.png

函数节点,跟selenium 节点配合使用,在函数节点编写命令操作selenium 页面。

屏幕快照 2022-07-03 上午9.05.35.png

第一个函数表示选择a.next 这个节点,并点击。 第二个函数表示休息10s, 让页面完成渲染。 在执行条件处写上, ${inext < 10} 表示循环10 次。

详情页

详情页就比较简单了,在源码中可以找到标题,正文,时间等字段。 用网络请求节点,变量节点

屏幕快照 2022-07-03 上午9.10.49.png

屏幕快照 2022-07-03 上午9.10.59.png

源码

  • 最后,附上完成的xml配置。
<mxGraphModel>
  <root>
    <mxCell id="0">
      <JsonProperty as="data">
        {&quot;spiderName&quot;:&quot;中联重科&quot;,&quot;submit-strategy&quot;:&quot;child&quot;,&quot;threadCount&quot;:&quot;1&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="1" parent="0"/>
    <mxCell id="2" value="开始" style="start" parent="1" vertex="1">
      <mxGeometry x="80" y="80" width="24" height="24" as="geometry"/>
      <JsonProperty as="data">
        {&quot;shape&quot;:&quot;start&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="7" value="提取项目名、详情地址" style="variable" parent="1" vertex="1">
      <mxGeometry x="330" y="80" width="24" height="24" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;提取项目名、详情地址&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;respele&quot;,&quot;urls&quot;,&quot;inext&quot;],&quot;variable-description&quot;:[&quot;&quot;,&quot;&quot;,&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;${resp2.html.element()}&quot;,&quot;${respele.selectors(&#39;ul.search_list li a&#39;)}&quot;,&quot;${inext==null?0:inext+1}&quot;],&quot;shape&quot;:&quot;variable&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="9" value="抓取详情页" style="request" parent="1" vertex="1">
      <mxGeometry x="450.16668701171875" y="80" width="24" height="24" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;抓取详情页&quot;,&quot;loopVariableName&quot;:&quot;index&quot;,&quot;method&quot;:&quot;GET&quot;,&quot;sleep&quot;:&quot;300&quot;,&quot;timeout&quot;:&quot;120000&quot;,&quot;response-charset&quot;:&quot;&quot;,&quot;retryCount&quot;:&quot;3&quot;,&quot;retryInterval&quot;:&quot;&quot;,&quot;body-type&quot;:&quot;none&quot;,&quot;body-content-type&quot;:&quot;text/plain&quot;,&quot;loopCount&quot;:&quot;${urls.size()}&quot;,&quot;url&quot;:&quot;https://www.zoomlion.com${urls[index].attr(&#39;href&#39;)}&quot;,&quot;proxy&quot;:&quot;&quot;,&quot;request-body&quot;:&quot;&quot;,&quot;follow-redirect&quot;:&quot;1&quot;,&quot;tls-validate&quot;:&quot;1&quot;,&quot;cookie-auto-set&quot;:&quot;1&quot;,&quot;repeat-enable&quot;:&quot;0&quot;,&quot;shape&quot;:&quot;request&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="10" value="" parent="1" source="7" target="9" edge="1">
      <mxGeometry relative="1" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;&quot;,&quot;condition&quot;:&quot;&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="12" value="提取详情页" style="variable" parent="1" vertex="1">
      <mxGeometry x="550" y="80" width="24" height="24" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;提取详情页&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;title&quot;,&quot;date1&quot;,&quot;content&quot;],&quot;variable-description&quot;:[&quot;&quot;,&quot;&quot;,&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;${resp.selector(&#39;h1,h4&#39;).text()}&quot;,&quot;${resp.regx(&#39;(\\d{1,2}\\/\\d{1,2}\\.\\d{4})&#39;)}&quot;,&quot;${resp.selector(&#39;#main_content,div.details&#39;)}&quot;],&quot;shape&quot;:&quot;variable&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="13" value="" style="strokeWidth=2;sharp=1;" parent="1" source="9" target="12" edge="1">
      <mxGeometry relative="1" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="14" value="输出" style="output" parent="1" vertex="1">
      <mxGeometry x="660.1666870117188" y="80" width="24" height="24" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;输出&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;datasourceId&quot;:&quot;da9568e3380ea467cc18817a76443f61&quot;,&quot;tableName&quot;:&quot;f_news&quot;,&quot;csvName&quot;:&quot;C:\\DataChange\\32.csv&quot;,&quot;csvEncoding&quot;:&quot;UTF-8&quot;,&quot;output-name&quot;:[&quot;title&quot;,&quot;edittime&quot;,&quot;content1&quot;,&quot;url&quot;,&quot;source&quot;,&quot;channel&quot;,&quot;nohtml&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;output-value&quot;:[&quot;${title}&quot;,&quot;${date1}&quot;,&quot;${content.pichtml()}&quot;,&quot;${resp.url}&quot;,&quot;中联重科&quot;,&quot;news&quot;,&quot;${content.text()}&quot;],&quot;output-all&quot;:&quot;0&quot;,&quot;output-database&quot;:&quot;0&quot;,&quot;output-csv&quot;:&quot;0&quot;,&quot;shape&quot;:&quot;output&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="15" value="" parent="1" source="12" target="14" edge="1">
      <mxGeometry relative="1" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;&quot;,&quot;condition&quot;:&quot;&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="34" value="Selenium" style="selenium" parent="1" vertex="1">
      <mxGeometry x="190" y="64" width="32" height="32" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;Selenium&quot;,&quot;nodeVariableName&quot;:&quot;resp2&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;loopCount&quot;:&quot;&quot;,&quot;pageLoadTimeout&quot;:&quot;&quot;,&quot;implicitlyWaitTimeout&quot;:&quot;&quot;,&quot;driverType&quot;:&quot;chrome&quot;,&quot;window-size&quot;:&quot;&quot;,&quot;user-agent&quot;:&quot;&quot;,&quot;arguments&quot;:&quot;&quot;,&quot;url&quot;:&quot;https://www.zoomlion.com/other/search.html?key=%u8A79%u7EAF%u65B0&quot;,&quot;proxy&quot;:&quot;&quot;,&quot;cookie-auto-set&quot;:&quot;1&quot;,&quot;repeat-enable&quot;:&quot;0&quot;,&quot;headless&quot;:&quot;0&quot;,&quot;javascript-disabled&quot;:&quot;0&quot;,&quot;image-disabled&quot;:&quot;0&quot;,&quot;plugin-disable&quot;:&quot;1&quot;,&quot;java-disable&quot;:&quot;1&quot;,&quot;incognito&quot;:&quot;0&quot;,&quot;sandbox&quot;:&quot;0&quot;,&quot;hide-scrollbar&quot;:&quot;0&quot;,&quot;maximized&quot;:&quot;0&quot;,&quot;shape&quot;:&quot;selenium&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="35" value="" style="strokeWidth=2;sharp=1;" parent="1" source="2" target="34" edge="1">
      <mxGeometry relative="1" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="36" value="" style="strokeWidth=2;sharp=1;" parent="1" source="34" target="7" edge="1">
      <mxGeometry relative="1" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="37" value="执行函数" style="function" parent="1" vertex="1">
      <mxGeometry x="222" y="154" width="32" height="32" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;执行函数&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;loopCount&quot;:&quot;&quot;,&quot;function&quot;:[&quot;${resp2.selector(&#39;a.next&#39;).click()}&quot;,&quot;${resp2.sleep(10000)}&quot;],&quot;shape&quot;:&quot;function&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="41" value="" style="strokeWidth=2;sharp=1;" parent="1" source="37" target="7" edge="1">
      <mxGeometry relative="1" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
      </JsonProperty>
    </mxCell>
    <mxCell id="45" value="" style="strokeWidth=2;strokeColor=blue;sharp=1;" edge="1" parent="1" source="7" target="37">
      <mxGeometry relative="1" as="geometry"/>
      <JsonProperty as="data">
        {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;blue&quot;,&quot;condition&quot;:&quot;${inext&lt;10}&quot;,&quot;transmit-variable&quot;:&quot;1&quot;}
      </JsonProperty>
    </mxCell>
  </root>
</mxGraphModel>