23,882
个编辑
更改
无编辑摘要
description = Field()
size = Field()
撰写蜘蛛
撰写一个蜘蛛来抓取数据
下一步是写一个指定起始网址的蜘蛛,包含follow链接规则和数据提取规则。
例如/tor/\d+.来提取规则
使用Xpath,从页面的HTML Source里面选取要要抽取的数据,选取众多数据页面中的一个。
根据页面HTML 源码,建立XPath,选取:torrent name, description , size,这些数据
通过带可以看到
<h1>Home[2009][Eng]XviD-ovd</h1>
name属性包含在H1 标签内,使用 XPath expression提取:
//h1/text()
description在id=”description“的div中
<h2>Description:</h2> <div id="description"> "HOME" - a documentary film by Yann Arthus-Bertrand <br/> <br/> *** <br/> <br/> "We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate. ...
XPath提取
//div[@id='description']
size属性在第二个<p>tag,id=specifications的div内
<div id="specifications"> <p> <strong>Category:</strong> <a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a> </p> <p> <strong>Total size:</strong> 699.79 megabyte</p>
XPath expression提取
//div[@id='specifications']/p[2]/text()[2]
如果要了解更多的XPath 参考这里 XPath reference.
蜘蛛代码如下:
class MininovaSpider(CrawlSpider):
name = '参考阅读4'
allowed_domains = ['参考阅读4']
start_urls = ['参考阅读1']
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[@id='description']").extract()
torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
yield torrent
因为很简单的原因,我们有意把重要的数据定义放在了上面。
[https://baike.baidu.com/item/scrapy/7914913?ivk_sa=1022817p 百度百科Scrapy]