views:

2403

answers:

1

I need to store the content of a site that can be in any language. And I need to be able to search the content for a Unicode string.

I have tried something like:

import urllib2

req = urllib2.urlopen('http://lenta.ru')
content = req.read()

The content is a byte stream, so I can search it for a Unicode string.

I need some way that when I do urlopen and then read to use the charset from the headers to decode the content and encode it into UTF-8.

+15  A: 

After the operations you performed, you'll see:

>>> req.headers['content-type']
'text/html; charset=windows-1251'

and so:

>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)

ucontent is now a Unicode string (of 140655 characters) -- so for example to display a part of it, if your terminal is UTF-8:

>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>

and you can search, etc, etc.

Edit: Unicode I/O is usually tricky (this may be what's holding up the original asker) but I'm going to bypass the difficult problem of inputting Unicode strings to an interactive Python interpreter (completely unrelated to the original question) to show how, once a Unicode string IS correctly input (I'm doing it by codepoints -- goofy but not tricky;-), search is absolutely a no-brainer (and thus hopefully the original question has been thoroughly answered). Again assuming a UTF-8 terminal:

>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93
Alex Martelli
Hey Alex, thanks for the reply.But if I do:u'Главное' in ucontentit returns False.Is there a better way to do the search?
Vitaly Babiy
How are you inputting that u'...' string? Unicode I/O is tricky, as your terminal AND Python must be on identical wavelengths. Using explicit Unicode codepoints (boring but NOT tricky) works fine, let me edit my answer to show that.
Alex Martelli
I am inputing using the console, If I need to do this for a unit test what should I set the coding: to at the top of the file?
Vitaly Babiy
Depends entirely on how your terminal/console's encoding is set up! See http://www.python.org/dev/peps/pep-0263/ -- e.g. for utf-8 use the comment # -*- coding: utf-8 -*- at file start.
Alex Martelli
Thanks Alex for all your help i have solved all my problems with unicode thanks a lot for you help.
Vitaly Babiy
Solution is misleading. Your variable name "req" implies you're reading from the request, which is nonsensical. You want to read from the response object, not the request object.
Chris S
@Chris, I just used the same variable name `req` as Vitaly used in the Q -- no idea what it stands for in Vitaly's native language, and it doesn't really matter to me _what_ it stands for (obviously it matters to _you_, since you invested in a downvote, but if you're protesting against Vitaly's native language -- Russian, I imagine, but I don't speak Russian -- it would have been minutely less absurd to downvote the Q introducing the variable name you hate, rather than the A just using the same variable name to help the OP follow and use the solution!-).
Alex Martelli
@AlexGood code uses clear descriptive unambiguous words. Regardless what the author's language is, it's silly to think "req" means something either than "request".
Chris S
@Chris, I'm amazed you have such a mastery of all the many languages spoken in the Russian Federation to know what "req" can possibly mean in each and every one of them (and a better one than somebody who's presumably a native speaker of such a language, too). Are you also one of those kindly native speakers of English who, when somebody else (who doesn't have the good fortune to have been born such a native speaker) appears to have problem understanding an English sentence, just shouts it out again, but louder?-)
Alex Martelli
Oh and BTW, @Chris, again, why ever are you delivering your diatribe against me ("guilty", at worst, only of using exactly the same variable name as the original poster), and not at the original posterm himself...?
Alex Martelli
@Alex, Dude, calm down. Just some friendly advice. Feel free to ignore it and write all the ambiguous and confusing code you want.
Chris S