Web Scrape Page With Multiple Sections

October 21, 2024 Post a Comment

Pretty new to python... and I'm trying to my hands at my first project. Been able to replicate few simple demo... but i think there are few extra complexities with what I'm trying

Solution 1:

This happens with many web pages. It's because some of the content is downloaded by Javascript code that is part of the initial download. By doing does this designers are able to show visitors the most important parts of a page without waiting for the entire page to download.

When you want to scrape a page the first thing you should do is to examine the source code for it (often using Ctrl-u in a Windows environment) to see if the content you require is available. If not then you will need to use something beyond BeautifulSoup.

>>>getzlafURL = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'>>>import requests>>>import selenium.webdriver as webdriver>>>import lxml.html as html>>>import lxml.html.clean as clean>>>browser = webdriver.Chrome()>>>browser.get(getzlafURL)>>>content = browser.page_source>>>cleaner = clean.Cleaner()>>>content = cleaner.clean_html(content)>>>doc = html.fromstring(content)>>>type(doc)
<class 'lxml.html.HtmlElement'>
>>>open('c:/scratch/temp.htm', 'w').write(content)
775838

By searching within the file temp.htm for the heading 'Ryan Getzlaf Game Logs' I was able to find this section of HTML code. As you can see, it's about what you expected to find in the original downloaded HTML. However, this additional step is required to get at it.

</div></li></ul><h5class="statistics__subheading">Ryan Getzlaf Game Logs</h5><divid="gamelogsTable"><divclass="responsive-datatable">

I should mention that there are alternative ways of accessing such code, one of them being dryscrape. I simply can't be bothered installing that one on this Windows machine.

Python Courses, Training, and Tutorials

Web Scrape Page With Multiple Sections

Solution 1:

Post a Comment for "Web Scrape Page With Multiple Sections"