Hi
I have an old C# program that is being ported to Python 3 for different reasons. Basically, what the program does is to fetch a website and search its content (and process it, but that is not really relevant). I have never really had any issues with the actual fetch-and-search routine, but once I ported it to Python it started complaining about invalid unicode at certain locations.
This is not really a problem since the actual source webpage-data is the same as in the old C# application and the old program achieved its goal with the broken data. However, what I want is the Python 3 decode() method to behave as similar to the internal handling of such cases in C#. Unfortunately, after reading the Python manual and looking into the 'ignore' and 'replace" error handling methods I really don't get which is better to best mimic the C# behavior (which I also have failed to identify).
To add some code into the discussion, here is the C# code that handles everything transparently:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
response = (HttpWebResponse)request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
string html = reader.ReadToEnd();
The corresponding Python 3 code is as follows:
req = Request(url)
r = urlopen(req)
data = r.read().decode("utf_8")
However, I want to find out which of the following pieces of code that will best mimic the unicode behavior of the C# code:
data = r.read().decode("utf_8", "replace")
or
data = r.read().decode("utf_8", "ignore")
Anyone with in-depth unicode experience which can give me some pointers on which method is better? The Python manual does describe the behavior, but not so that I understand which I should use...
Thanks in advance for any help!