views:

16

answers:

1

I am writing a simple search algorithm for wikipedia. I am having trouble when I send a query with characters that have accents and other characters that are not seen in regular english. Queries that return in error are:
http://en.wikipedia.org/w/api.php?action=query&titles=Albrecht%20Dürer&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Ancien%20Régime&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Feigenbaum-Cvitanović&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Banach–Tarski%20paradox&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Grundzüge%20der%20Mengenlehre&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Grundzüge%20einer%20Theorie%20der%20geordneten%20Mengen&prop=links&pllimit=33&format=xml
http://en.wikipedia.org/w/api.php?action=query&titles=Karl%20Bögel&prop=links&pllimit=33&format=xml

But the query works fine if there are simple character such as "Fractals". How should I change the format of the query to make this work?

My code is open sourced at: http://code.google.com/p/wikipediafoundation/source/browse/. Please look at hg/src/list.py.

+1  A: 

I don't see any trace in your Python source of how you're encoding any non-ascii characters you're sending in the query. For URLs (including query strings in them) using anything beyond ascii, you need to (make them unicode if they already aren't, then) encode them in utf-8 and percent-escape the result (for the latter use function urllib.quote_plus from the standard Python library module urllib, and for encoding, of course, the unicode string's .encode('utf8') method -- if you need to make a unicode string from a differently-encoded byte string, use the byte string's .decode('latin-1') -- or whatever the name of the encoding it's in, of course;-).

Alex Martelli
Venkat S. Rao