Web Crawler To Get Links From New Website
Solution 1:
I believe you may want to try accessing the text inside the list item like so:
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.string
Edited: General Comments on getting links from a page
Probably the easiest data type to use to gather a bunch of links and retrieve them later is a dictionary.
To get links from a page using BeautifulSoup, you could do something like the following:
link_dictionary = {}
withurlopen(url_source) as f:
soup = BeautifulSoup(f)
for link in soup.findAll('a'):
link_dictionary[link.string] = link.get('href')
This will provide you with a dictionary named link_dictionary
, where every key in the dictionary is a string that is simply the text contents between the <a> </a>
tags and every value is the the value of the href
attribute.
How to combine this what your previous attempt
Now, if we combine this with the problem you were having before, we could try something like the following:
link_dictionary = {}
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
for link in tag.findAll('a'):
link_dictionary[link.string] = link.get('href')
If this doesn't make sense, or you have a lot more questions, you will need to experiment first and try to come up with a solution before asking another new, clearer question.
Solution 2:
You might want to use the powerful XPath query language with the faster lxml
module. As simple as that:
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[@data-section='Business']/a"):
print'{} ({})'.format(link.text, link.attrib['href'])
Update for @data-section='Chennai'
#!/usr/bin/pythonimport urllib2
from lxml import etree
url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[@data-section='Chennai']/a"):
print'{} => {}'.format(link.text, link.attrib['href'])
Post a Comment for "Web Crawler To Get Links From New Website"