Pin Down Exact Content Location In Html For Web Scraping Urllib2 Beautiful Soup
Solution 1:
I understand that the question was initially about BeautifulSoup
, but since you haven't had any success using it in this particular situation, I suggest taking a look at selenium.
Selenium uses a real browser - you don't have to deal with parsing the results of ajax calls. For example, here's how you can get the list of review titles and ratings from the first reviews page:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
driver.close()
prints:
Renee Culver loves Clorox Wipes 5out of 5
Men at work 5out of 5
clorox wipes 5out of 5
...
Also, take into account that you can use a headless PhantomJS browser (example).
Another option is to make use of Walmart API.
Hope that helps.
Solution 2:
The reviews are loaded using AJAX call. You can not find those on the link that you provided. The reviews are loaded from the following link:
http://walmart.ugc.bazaarvoice.com/1336/29701960/reviews.djs?format=embeddedhtml&dir=desc&sort=relevancy
Here 29701960
is found from the html source of your current source like this way:
<meta property="og:url" content="http://www.walmart.com/ip/29701960" />
+------+ this one
or
trackProductId : '29701960',
+------+ or this one
And 1336
is from the source:
WALMART.BV.scriptPath = 'http://walmart.ugc.bazaarvoice.com/static/1336/';
+--+ here
Using the values, build the Above url and parse the data from there using BeautifulSoup
.
Post a Comment for "Pin Down Exact Content Location In Html For Web Scraping Urllib2 Beautiful Soup"