Skip to content Skip to sidebar Skip to footer

Not Able To Follow Link Using Scrapy

I am not able to follow the link and get back the values. I tried using the below code I am able to crawl the first link after that it doesnt redirect to the second follow link(fun

Solution 1:

You forgot to return your Request in the parse() method. Try this code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request


classScrapyOrgSpider(BaseSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/abcd"]

    defparse(self, response):
        self.log('@@ Original response: %s' % response)
        req = Request("http://www.example.com/follow", callback=self.a_1)
        self.log('@@ Next request: %s' % req)
        return req

    defa_1(self, response):
        hxs = HtmlXPathSelector(response)
        self.log('@@ extraction: %s' %
            hxs.select("//a[@class='channel-link']").extract())

Log output:

2012-11-22 12:20:06-0600 [scrapy] INFO:Scrapy0.17.0started(bot:oneoff)2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled extensions:LogStats,TelnetConsole,CloseSpider,WebService,CoreStats,SpiderState2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled downloader middlewares:HttpAuthMiddleware,DownloadTimeoutMiddleware,UserAgentMiddleware,RetryMiddleware,DefaultHeadersMiddleware,RedirectMiddleware,CookiesMiddleware,HttpCompressionMiddleware,ChunkedTransferMiddleware,DownloaderStats2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled spider middlewares:HttpErrorMiddleware,OffsiteMiddleware,RefererMiddleware,UrlLengthMiddleware,DepthMiddleware2012-11-22 12:20:06-0600 [scrapy] DEBUG: Enabled item pipelines:2012-11-22 12:20:06-0600 [example.com] INFO:Spideropened2012-11-22 12:20:06-0600 [example.com] INFO:Crawled0pages(at0pages/min),scraped0items(at0items/min)2012-11-22 12:20:06-0600 [scrapy] DEBUG:Telnetconsolelisteningon0.0.0.0:60232012-11-22 12:20:06-0600 [scrapy] DEBUG:Webservicelisteningon0.0.0.0:60802012-11-22 12:20:07-0600 [example.com] DEBUG:Redirecting(302)to<GEThttp://www.iana.org/domains/example/>from<GEThttp://www.example.com/abcd>2012-11-22 12:20:07-0600 [example.com] DEBUG:Crawled(200)<GEThttp://www.iana.org/domains/example/>(referer:None)2012-11-22 12:20:07-0600 [example.com] DEBUG:@@Original response:<200http://www.iana.org/domains/example/>2012-11-22 12:20:07-0600 [example.com] DEBUG:@@Next request:<GEThttp://www.example.com/follow>2012-11-22 12:20:07-0600 [example.com] DEBUG:Redirecting(302)to<GEThttp://www.iana.org/domains/example/>from<GEThttp://www.example.com/follow>2012-11-22 12:20:08-0600 [example.com] DEBUG:Crawled(200)<GEThttp://www.iana.org/domains/example/>(referer:http://www.iana.org/domains/example/)2012-11-22 12:20:08-0600 [example.com] DEBUG:@@extraction: []
2012-11-22 12:20:08-0600 [example.com] INFO:Closingspider(finished)

Solution 2:

The parse function must return the request, not just print it.

defparse(self, response):
    hxs = HtmlXPathSelector(response)
    res1 = Request("http://www.example.com/follow", callback=self.a_1)
    print res1  # if you wantreturn res1

Post a Comment for "Not Able To Follow Link Using Scrapy"