Looping over items to extract contents of html table

20 阅读1分钟

我是Scrapy的新手,很愿意去学习使用Scrapy。我现在可以做一些非常简单的事情,比如构建教程。我也可以通过scrapy shell "website"启动scrapy shell。

  • 下面的表格被其他更高层次的东西包围,比如divs。使用Scrapy,我该如何提取表格?我需要读取divs还是可以直接跳到表格中提取信息?
  • 我希望得到的是像这样以字典形式返回表格中所有项目,最好是放在一个完全包含的代码中,我可以运行和学习:
def parse(self, response):
    hxs = HtmlXPathSelector(response)
    divs = hxs.select('//tr[@class="someclass"]')
    for div in divs:
        item['var1'] = div.select('//table/tbody/tr[*]/td[2]/p/span[2]
        yield item

注意:我删除了重复的结尾。

<div class="full arrangeable" data-id="calendar"> <div class="full row" data-row="0"> <div class="full column" data-column="0"> <div class="cell" data-cell="0" data-compid="Calendar"> <a name="Calendar" class="anchor"></a> <div class="flexShell"> <div class="flexBox calendar" id="flexBox_flex_calendar_mainCal" data-more="0" data-checkstate="0" data-initcallback="calendar" data-updatecallback="calendar" data-visiblejs="[]" data-disablejs="[]"> <form action="flex.php" method="post" onsubmit="return Flex.prepareSubmit(this);" data-submit="options"> <input name="s" value="" type="hidden"> <input name="securitytoken" value="guest" type="hidden"> <input name="do" value="saveoptions" type="hidden"> <input name="setdefault" value="no" type="hidden"> <input name="ignoreinput" value="no" type="hidden"> <input name="flex[Calendar_mainCal][idSuffix]" value="" type="hidden"> <input name="flex[Calendar_mainCal][_flexForm_]" value="flexForm" type="hidden"> <input name="flex[Calendar_mainCal][modelData]" value="YToxMDp7czoxMToicGFfY29udHJvbHMiO3M6MTc6ImNhbGVuZGFyfENhbGVuZGFyIjtzOjE2OiJwYV9pbmplY3RyZXZlcnNlIjtiOjA7czoxNDoidmlld2luZ0RlZmF1bHQiO3M6OToiVGhpcyBXZWVrIjtzOjExOiJwcmV2Q2FsTGluayI7czoxNDoiZGF5PW5vdjMwLjIwMTEiO3M6MTE6Im5leHRDYWxMaW5rIjtzOjEzOiJkYXk9ZGVjMi4yMDExIjtzOjc6InByZXZBbHQiO3M6MjY6Ik5vdjMzMCwgMjAxMSAtIERlYyAxLCAyMDExIjtzOjc6Im5leHRBbHQiO3M6MjU6IkRlYyAyLCAyMDExIC0gRGVjIDMsIDIwMTEiO3M6MTA6Im5leHRIaWRkZW4iO2I6MDtzOjEwOiJwcmV2SGlkZGVuIjtiOjA7czo5OiJyaWdodExpbmsiO047fQ==" type="hidden"> <div class="head"> <ul> <li class="left pagination"><a title="Nov 30, 2011 - Dec 1, 2011" class="prev" href="calendar.php?day=nov30.2011"><span>&lt;</span></a></li> <li class="left"><a class="highlight light options flexTitle"><span><strong>Dec 1, 2011</strong></span></a></li> <li class="left pagination shadow"><a title="Dec 2, 2011 - Dec 3, 2011" class="next" href="calendar.php?day=dec2.2011"><span>&lt;</span></a></li> <li class="loader"></li> <li class="right imagefade noborder"><a class="highlight noborder filters flexFilter"><div class="fade"></div><span>Filter</span></a></li> <li class="right"> <a class="highlight noborder menu"> <span>This Week</span> <span class="dropdown"></span> <div> <div class="title">Default View:</div> <div data-value="yesterday">Yesterday</div> <div data-value="today">Today</div> <div data-value="tomorrow">Tomorrow</div> <div data-value="thisweek">This Week</div> </div> </a> </li> <li class="right shadow"><a class="highlight noborder upnext"><span>Up Next</span></a></li> <li class="layoutcontrols"><div class="pagearrange_homepage_controls"> </div> <div class="pagearrange_controls"> <span data-registered="1" class="onHomepage" title="Copy Block to Your Homepage"></span> </div></li></ul> </div> <div class="options sidebyside"> <div class="half"> <div class="shell flexoptions"> <div class="frame"> <input name="flex[Calendar_mainCal][calendardefault]" id="flex[Calendar_mainCal][calendardefault]" value="thisweek" type="hidden"> <div class="half"> <div class="pad"> <p class="title"><strong>Begin Date</strong></p> <input data-enterhandled="1" data-pickerid="flexDatePicker_1" name="flex[Calendar_mainCal][begindate]" data-container="Calendar_mainCal_begindate" class="bginput flexDatePicker" value="December 1, 2011" data-range="2007,2015" type="text"> <div class="minicalendar" id="flexDatePicker_Calendar_mainCal_begindate"><div