Despite Utf8 Encoding Some Characters Fail To Be Recognized
Solution 1:
I finally believe to have found the problem. These characters above are escaped HTML inside an XML. What a mess. If you look at Independent's RSS most titles are affected like that.
So this is not an UTF8 problem. How can I encode any html characters in my title above before converting to to UTF8?
head_line=i.title.text.encode('utf-8').strip(),
I solved it by unescaping the title with HTMLParser and then encoding it with UTF8. Marco's answer did essentially the same. But the html
library didn't work for me.
head_line=HTMLParser.HTMLParser().unescape(i.title.text).encode('utf-8').strip(),
I don't recommend using from_encoding='latin-1'
as it causes other problems. The solution with unescaping
and encode('utf-8')
is enough to decode the £ into \xa3
, which is proper Unicode chars.
Solution 2:
For the example you provide, this works for me:
from bs4 importBeautifulSoupimport html
xml='<title>Photo of iceberg that is believed to have sunk Titanic sold at auction for £21,000 alongside &#039;world&#039;s most valuable biscuit&#039;</title>'
soup = BeautifulSoup(xml, 'lxml')
print(html.unescape(soup.get_text()))
html.unescape
handles the HTML entities. If Beautiful Soup is not handling the pound sign correctly, you may need to specify the encoding when creating the BeautifulSoup
object, e.g.
soup = BeautifulSoup(xml, "lxml", from_encoding='latin-1')
Post a Comment for "Despite Utf8 Encoding Some Characters Fail To Be Recognized"