I have a text file with ; used as the delimiter. The problem is that it has some html text formating in it such as >
Obviously the ; in this causes problems.
The text file is large and I don't have a list of these html strings, that is there are many different examples such as $amp;
. How can I remove all of them using python.
The file is a list of names, addresses, phone number and a few more fields. I am looking for the crap.html.remove(textfile) module
views:
460answers:
3
+3
A:
Take a look at the code from here:
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except (ValueError, OverflowError):
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
Of course, this only takes care of HTML entities. You may have other semicolons in the text that mess with your CSV parser. But I guess you already know that...
UPDATE: added catch for possible OverflowError
.
itsadok
2009-10-28 13:39:17
I get an error/Users/vmd/Dropbox/Marketing Material/Leads/formatleaddata.py in removehtml(text) 40 pass 41 return text # leave as is---> 42 return re.sub("?\w+;", fixup, text) /Library/Frameworks/Python.framework/Versions/5.1.0/lib/python2.5/re.pyc in sub(pattern, repl, string, count) 148 if a callable, it's passed the match object and must return 149 a replacement string to be used."""--> 150 return _compile(pattern, 0).sub(repl, string, count) 152 def subn(pattern, repl, string, count=0):
Vincent
2009-10-28 16:37:23
That is quite a mouthful, and it's not clear to me what the error is. Do you have an exception type? Maybe you should try posting your exception details in a separate answer, just so we can have proper formatting.
itsadok
2009-10-29 07:57:27
+3
A:
The quickest way is probably to use the undocumented but so far stable unescape
method in HTMLParser:
import HTMLParser
s= HTMLParser.HTMLParser().unescape(s)
Note this will necessarily output a Unicode string, so if you have any non-ASCII bytes in there you will need to s.decode(encoding)
first.
bobince
2009-10-28 13:41:44
+1
A:
On most Unix systems (including your Mac OS X), you can recode the input text file with:
recode html.. file_with_html.txt
This replaces > by ">", etc.
You can call this through Python's subprocess module, for instance.
EOL
2010-01-02 10:59:58