views:

21

answers:

2

hi all I'm trying to extract the "META" description from a webpage using libxml for python. When it encounters UTF chars it seems to choke and display garbage chars. However when getting the data via a regex I get the unicode chars just fine. Am I doing something wrong with libxml?

thanks

''' test encoding issues with utf8 '''

from lxml.html import fromstring
from lxml.html.clean import Cleaner
import urllib2
import re

url = 'http://www.youtube.com/watch?v=LE-JN7_rxtE'
page = urllib2.urlopen(url).read()


xmldoc = fromstring(page)
desc = xmldoc.xpath('/html/head/meta[@name="description"]/@content')
meta_description = desc[0].strip()

print "**** LIBXML TEST ****\n" 
print meta_description


print "**** REGEX TEST ******"
reg = re.compile(r'<meta name="description" content="(.*)">')
for desc in reg.findall(page):
  print desc

OUTPUTS:

**** LIBXML TEST ****

My name is Hikakin.<br>I'm Japanese Beatboxer.<br><br>HIKAKIN Official Blog<br>http://ameblo.jp/hikakin/&lt;br&gt;&lt;br&gt;ãã³çã³ãã¥&lt;br&gt;http://com.nicovideo.jp/community/co313576&lt;br&gt;&lt;br&gt;â»å¾¡ç¨ã®æ¹ã¯Youtubeã®ã¡ãã»ã¼ã¸ã¾ã...
**** REGEX TEST ******
My name is Hikakin.&lt;br&gt;I'm Japanese Beatboxer.&lt;br&gt;&lt;br&gt;HIKAKIN Official Blog&lt;br&gt;http://ameblo.jp/hikakin/&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;ニコ生コミュ&amp;lt;br&amp;gt;http://com.nicovideo.jp/community/co313576&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;※御用の方はYoutubeのメッセージまた...
A: 

It is very possible that the problem is that your console does not support the display of Unicode characters. Try piping the output to a file and then open it in something that can display Unicode.

Stargazer712
+1  A: 

Does this help?

xmldoc = fromstring(page.decode('utf-8'))
Daniel Newby
worked! thanks so much