Skip to content Skip to sidebar Skip to footer

Scrapy: How To Run Two Crawlers One After Another?

I have two spiders within the same project. One of them depends on the other running first. They use different pipelines. How can I make sure they are run sequentially?

Solution 1:

Just from the doc:https://doc.scrapy.org/en/1.2/topics/request-response.html

Same example but running the spiders sequentially by chaining the deferreds:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

classMySpider1(scrapy.Spider):
    # Your first spider definition
    ...

classMySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacksdefcrawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Solution 2:

Solution 1

[Spider2 list] --depends on --> [spider1 list]

How about just make spider2 run after spider1 finished successfully by shell:

#scrapy crawl Spider1 && scrapy crawl Spider2

Solution 2

individual [spider 2] --depends on --> [spider 1 item]

and you know the url the individual spider 2 to scrape, when you scraped the spider 1 item.

How about merge two spiders into one?

with the request meta attribute.

spider.py

classMergedSpider(scrapy.Spider):# name, etc..deffirst_spider_parse(self, response):
        # your code...
        item = FirstSpiderItem()
        # yield the item first, and the pipeline will handle ityield item
        # then request the spider2 requestyield scrapy.Request(secondSpiderItemURL, callback=self.second_spider_parse, dont_filter=True, meta={'firstItem': item})


    defsecond_spider_parse(self, response):
        item = SecondSpiderItem()
        firstItem = response.meta['firstItem']
        return item

pipelines.py

classFirstPipeline(object):
    defprocess_item(self, item, spider):
        # or you can isinstance the spiderifisinstance(item, FirstSpiderItem):
            # your codepassreturn item


classSecondPipeline(object):
    defprocess_item(self, item, spider):
        ifisinstance(item, SecondSpiderItem):
            # your codepassreturn item

Post a Comment for "Scrapy: How To Run Two Crawlers One After Another?"