Scrapy: How To Run Two Crawlers One After Another?
I have two spiders within the same project. One of them depends on the other running first. They use different pipelines. How can I make sure they are run sequentially?
Solution 1:
Just from the doc:https://doc.scrapy.org/en/1.2/topics/request-response.html
Same example but running the spiders sequentially by chaining the deferreds:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
classMySpider1(scrapy.Spider):
# Your first spider definition
...
classMySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacksdefcrawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Solution 2:
Solution 1
[Spider2 list] --depends on --> [spider1 list]
How about just make spider2 run after spider1 finished successfully by shell:
#scrapy crawl Spider1 && scrapy crawl Spider2
Solution 2
individual [spider 2] --depends on --> [spider 1 item]
and you know the url
the individual spider 2 to scrape, when you scraped the spider 1 item.
How about merge two spiders into one?
with the request meta
attribute.
spider.py
classMergedSpider(scrapy.Spider):# name, etc..deffirst_spider_parse(self, response):
# your code...
item = FirstSpiderItem()
# yield the item first, and the pipeline will handle ityield item
# then request the spider2 requestyield scrapy.Request(secondSpiderItemURL, callback=self.second_spider_parse, dont_filter=True, meta={'firstItem': item})
defsecond_spider_parse(self, response):
item = SecondSpiderItem()
firstItem = response.meta['firstItem']
return item
pipelines.py
classFirstPipeline(object):
defprocess_item(self, item, spider):
# or you can isinstance the spiderifisinstance(item, FirstSpiderItem):
# your codepassreturn item
classSecondPipeline(object):
defprocess_item(self, item, spider):
ifisinstance(item, SecondSpiderItem):
# your codepassreturn item
Post a Comment for "Scrapy: How To Run Two Crawlers One After Another?"