Skip to content Skip to sidebar Skip to footer

Pin Down Exact Content Location In Html For Web Scraping Urllib2 Beautiful Soup

I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version o

Solution 1:

I understand that the question was initially about BeautifulSoup, but since you haven't had any success using it in this particular situation, I suggest taking a look at selenium.

Selenium uses a real browser - you don't have to deal with parsing the results of ajax calls. For example, here's how you can get the list of review titles and ratings from the first reviews page:

from selenium.webdriver.firefox import webdriver


driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')

for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
    title = review.find_element_by_class_name('BVRRReviewTitle').text
    rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
    print title, rating

driver.close()

prints:

Renee Culver loves Clorox Wipes 5out of 5
Men at work 5out of 5
clorox wipes 5out of 5
...

Also, take into account that you can use a headless PhantomJS browser (example).


Another option is to make use of Walmart API.

Hope that helps.

Solution 2:

The reviews are loaded using AJAX call. You can not find those on the link that you provided. The reviews are loaded from the following link:

http://walmart.ugc.bazaarvoice.com/1336/29701960/reviews.djs?format=embeddedhtml&dir=desc&sort=relevancy

Here 29701960 is found from the html source of your current source like this way:

<meta property="og:url" content="http://www.walmart.com/ip/29701960" />
                                                           +------+ this one

or

trackProductId : '29701960',
                  +------+ or this one

And 1336 is from the source:

WALMART.BV.scriptPath =  'http://walmart.ugc.bazaarvoice.com/static/1336/';
                                                                    +--+ here

Using the values, build the Above url and parse the data from there using BeautifulSoup.

Post a Comment for "Pin Down Exact Content Location In Html For Web Scraping Urllib2 Beautiful Soup"