views:

98

answers:

3
theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

Please take a look at the unicode portion. I've tried those two options...but doesn't work.

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

and also this when I try the longer method of encoding...

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1
+3  A: 

Not decode? htmlSource = htmlSource.decode('utf8')

decode mean "decode htmlSource from utf8 encoding"

encode mean "encode htmlSource to utf8 encoding"

since you are extracting the existing data (crawling from website), you need to decode it, and when you insert to mysql, you may need to encode as utf8 according to your mysql db/table/fields collations.

S.Mark
I want to encode it so that I can insert it into the database
TIMEX
I have update it, pls read it again.
S.Mark
+1  A: 

Probably you want to decode Utf8, not encode it:

htmlSource =  htmlSource.decode('utf8')
sth
+4  A: 

Your html data is a string that comes from the internet already encoded with some encoding. Before encoding it to utf-8, you must decode it first.

Python is implicity trying to decode it (That's why you get a UnicodeDecodeError not UnicodeEncodeError).

You can solve the problem by explicity decoding your bytestring (using the appropriate encoding) before trying to reencode it to utf-8.

Example:

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

Use the correct encoding the page was encoded in first place, instead of 'some_encoding'.

You have to know which encoding a string is using before you can decode it.

nosklo