tags:

views:

62

answers:

1

Hi

I have an old C# program that is being ported to Python 3 for different reasons. Basically, what the program does is to fetch a website and search its content (and process it, but that is not really relevant). I have never really had any issues with the actual fetch-and-search routine, but once I ported it to Python it started complaining about invalid unicode at certain locations.

This is not really a problem since the actual source webpage-data is the same as in the old C# application and the old program achieved its goal with the broken data. However, what I want is the Python 3 decode() method to behave as similar to the internal handling of such cases in C#. Unfortunately, after reading the Python manual and looking into the 'ignore' and 'replace" error handling methods I really don't get which is better to best mimic the C# behavior (which I also have failed to identify).

To add some code into the discussion, here is the C# code that handles everything transparently:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
response = (HttpWebResponse)request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
string html = reader.ReadToEnd();

The corresponding Python 3 code is as follows:

req = Request(url)
r = urlopen(req)
data = r.read().decode("utf_8")

However, I want to find out which of the following pieces of code that will best mimic the unicode behavior of the C# code:

data = r.read().decode("utf_8", "replace")

or

data = r.read().decode("utf_8", "ignore")

Anyone with in-depth unicode experience which can give me some pointers on which method is better? The Python manual does describe the behavior, but not so that I understand which I should use...

Thanks in advance for any help!

+2  A: 

According to http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx the default UTF-8 decoder of C# ignores invalid bytes.

Python's 'ignore' option for decoding unicode is the same as this.

Paul Hankin