Web Scrape Page With Multiple Sections
Solution 1:
This happens with many web pages. It's because some of the content is downloaded by Javascript code that is part of the initial download. By doing does this designers are able to show visitors the most important parts of a page without waiting for the entire page to download.
When you want to scrape a page the first thing you should do is to examine the source code for it (often using Ctrl-u in a Windows environment) to see if the content you require is available. If not then you will need to use something beyond BeautifulSoup.
>>>getzlafURL = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'>>>import requests>>>import selenium.webdriver as webdriver>>>import lxml.html as html>>>import lxml.html.clean as clean>>>browser = webdriver.Chrome()>>>browser.get(getzlafURL)>>>content = browser.page_source>>>cleaner = clean.Cleaner()>>>content = cleaner.clean_html(content)>>>doc = html.fromstring(content)>>>type(doc)
<class 'lxml.html.HtmlElement'>
>>>open('c:/scratch/temp.htm', 'w').write(content)
775838
By searching within the file temp.htm
for the heading 'Ryan Getzlaf Game Logs' I was able to find this section of HTML code. As you can see, it's about what you expected to find in the original downloaded HTML. However, this additional step is required to get at it.
</div></li></ul><h5class="statistics__subheading">Ryan Getzlaf Game Logs</h5><divid="gamelogsTable"><divclass="responsive-datatable">
I should mention that there are alternative ways of accessing such code, one of them being dryscrape
. I simply can't be bothered installing that one on this Windows machine.
Post a Comment for "Web Scrape Page With Multiple Sections"