Hey guys, I've been trying to parse through HTML files to scrape text from them, and every so often, I get some really weird characters like à€œ
. I determined that its the "smart quotes" or curly punctuation that is causing the all of my problems, so my temporary fix has been to search for and replace all of these characters with their corresponding HTML codes individually. My question is that is there such a way to use one regular expression (or something else) to search through the string only once and replaces what it needs to based on what is there? My solution right now looks like this:
line = line.replaceAll( "“", "“" ).replaceAll( "”", "”" );
line = line.replaceAll( "–", "–" ).replaceAll( "—", "—" );
line = line.replaceAll( "‘", "‘" ).replaceAll( "’", "’" );
For some reason or another, there just seems like there could be a better and possibly more efficient way of doing this. Any input is greatly appreciated.
Thanks,
-Brett