更改

百度百科Scrapy

添加2,027字节, 2020年1月14日 (二) 03:24

无编辑摘要

description = Field()

size = Field()

撰写蜘蛛

撰写一个蜘蛛来抓取数据

下一步是写一个指定起始网址的蜘蛛，包含follow链接规则和数据提取规则。

例如/tor/\d+.来提取规则

使用Xpath，从页面的HTML Source里面选取要要抽取的数据，选取众多数据页面中的一个。

根据页面HTML 源码，建立XPath，选取：torrent name， description ， size，这些数据

通过带可以看到

name属性包含在H1 标签内，使用 XPath expression提取：

//h1/text()

description在id=”description“的div中

<h2>Description:</h2> <div id="description"> "HOME" - a documentary film by Yann Arthus-Bertrand <br/> <br/> *** <br/> <br/> "We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live， avert the depletion of natural resources and the catastrophic evolution of the Earth's climate. ...

XPath提取

//div[@id='description']

size属性在第二个<p>tag，id=specifications的div内

<div id="specifications"> <p> <strong>Category:</strong> <a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a> </p> <p> <strong>Total size:</strong> 699.79 megabyte</p>

XPath expression提取

//div[@id='specifications']/p[2]/text()[2]

如果要了解更多的XPath 参考这里 XPath reference.

蜘蛛代码如下：

class MininovaSpider(CrawlSpider):

name = '参考阅读4'

allowed_domains = ['参考阅读4']

start_urls = ['参考阅读1']

rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

def parse_torrent(self， response):

x = HtmlXPathSelector(response)

torrent = TorrentItem()

torrent['url'] = response.url

torrent['name'] = x.select("//h1/text()").extract()

torrent['description'] = x.select("//div[@id='description']").extract()

torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()

yield torrent

因为很简单的原因，我们有意把重要的数据定义放在了上面。

[https://baike.baidu.com/item/scrapy/7914913?ivk_sa=1022817p 百度百科Scrapy]

明华

管理员

23,882

个编辑

更改

百度百科Scrapy

导航菜单

个人工具

名字空间

变种

视图

更多

搜索

导航

站群链接

工具