I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
The actual line of code was
s=unicodestring.replace('\xa0','')
Anyway-I decided that I needed to preface it with an r so I ran this line of code:
s=unicodestring.replace(r'\xa0','')
It runs without error but I when I look at a slice of s I see that the \xaO is still there