views:

135

answers:

4

Hey guys, I've been trying to parse through HTML files to scrape text from them, and every so often, I get some really weird characters like à€œ. I determined that its the "smart quotes" or curly punctuation that is causing the all of my problems, so my temporary fix has been to search for and replace all of these characters with their corresponding HTML codes individually. My question is that is there such a way to use one regular expression (or something else) to search through the string only once and replaces what it needs to based on what is there? My solution right now looks like this:

line = line.replaceAll( "“", "“" ).replaceAll( "”", "”" );
line = line.replaceAll( "–", "–" ).replaceAll( "—", "—" );
line = line.replaceAll( "‘", "‘" ).replaceAll( "’", "’" ); 

For some reason or another, there just seems like there could be a better and possibly more efficient way of doing this. Any input is greatly appreciated.

Thanks,
-Brett

A: 

Don't use regular expressions for HTML. Use a real parser.

This will also help you getting around any character encodings you might encounter.

Thorbjørn Ravn Andersen
Converting certain characters to entities isn't parsing HTML with regex.
eyelidlessness
The "weird characters" looked like handling UTF-8 incorrectly.
Thorbjørn Ravn Andersen
@Thorbjørn, I realize that. That's still not parsing HTML.
eyelidlessness
@eyelidlessness, op clearly said: "I've been trying to parse through HTML files to scrape text from them." He IS trying to parse HTML using regex and because of this, he is having problems such as this. Who knows what other problems could arise when the recommended way is to use an external library.
Coding District
@Coding, if you read the code (which, as we all know, is more correct than its comments), you can see that the OP is replacing characters in text, not parsing HTML. They are parsing characters, which happen to be in an HTML document, but literally none of the HTML parsing rules apply to the question and its solution.What does an external library have to do with the fact that the question doesn't have anything to do with parsing HTML markup?
eyelidlessness
@eyelidlessness, the phrasing *the "smart quotes" ... is causing all my problems* is a dead give-away that the regexps are just about to break.
Thorbjørn Ravn Andersen
@Thorbjørn, why? Regex is perfectly capable of detecting single characters like curly quotes. Please explain to me how this has anything whatsoever to do with parsing HTML or any other irregular language.
eyelidlessness
+2  A: 

There's a huge thread over here that shows you why it is a bad idea to use regex to parse HTML.

Look for external libraries to do this task. An example would be: JSoup. There's also a tutorial included in their webpage that you can use.

Coding District
Converting certain characters to entities isn't parsing HTML with regex.
eyelidlessness
The regex was for the special multi-byte characters and not to parse my HTML, but thanks a ton for the JSoup reference--hands down, tons better than the Java API HTMLEditorKit.
Brett
+2  A: 

Your file appears to be UTF-8 encoded, but you're reading it as though it were in a single-byte encoding like windows-1252. UTF-8 uses three bytes to encode each of those characters, but when you decode it as windows-1252, each byte is treated as a separate character.

When working with text, you should always specify an encoding if possible; don't let the system use its default encoding. In Java, that means using InputStreamReader and OutputStreamWriter instead of FileReader and FileWriter. Any reasonably good text editor should let you specify an encoding as well.

As for your actual question, no, Java doesn't have a built-in facility for dynamic replacements (unlike most other regex flavors). But it's not too difficult to write your own, or even better, use one that someone else wrote. I posted one from Elliott Hughes in this answer.

One last thing: In your sample code you use replaceAll() to do the replacements, which is overkill and a possible source of bugs. Since you're matching literal text and not regexes, you should be using replace(CharSequence,CharSequence) instead. That way you never have to worry about accidentally including a regex metacharacter and going blooey.

Alan Moore
That bit of advice went a long way last night. After a bit digging on the readers vs input streams, I determined that it would be better if I backed off the input/output streams in favor of readers and writers. Thanks.
Brett
+2  A: 

As stated by others; The recommended method to take care of those characters is to configure your encoding settings.

For comparison, here is a method to re-code UTF-8 sequences as HTML entities using regex:

import java.util.regex.*;

public class UTF8Fixer {
    static String fixUTF8Characters(String str) {
        // Pattern to match most UTF-8 sequences:
        Pattern utf8Pattern = Pattern.compile("[\\xC0-\\xDF][\\x80-\\xBF]{1}|[\\xE0-\\xEF][\\x80-\\xBF]{2}|[\\xF0-\\xF7][\\x80-\\xBF]{3}");

        Matcher utf8Matcher = utf8Pattern.matcher(str);
        StringBuffer buf = new StringBuffer();

        // Search for matches
        while (utf8Matcher.find()) {
            // Decode the character
            String encoded = utf8Matcher.group();
            int codePoint = encoded.codePointAt(0);
            if (codePoint >= 0xF0) {
                codePoint &= 0x07;
            }
            else if (codePoint >= 0xE0) {
                codePoint &= 0x0F;
            }
            else {
                codePoint &= 0x1F;
            }
            for (int i = 1; i < encoded.length(); i++) {
                codePoint = (codePoint << 6) | (encoded.codePointAt(i) & 0x3F);
            }
            // Recode it as an HTML entity
            encoded = String.format("&#%d;", codePoint);
            // Add it to the buffer
            utf8Matcher.appendReplacement(buf,encoded);
        }
        utf8Matcher.appendTail(buf);
        return buf.toString();
    }

    public static void main(String[] args) {
        String subject = "String with \u00E2\u0080\u0092strange\u00E2\u0080\u0093 characters";
        String result = UTF8Fixer.fixUTF8Characters(subject);
        System.out.printf("Subject: %s%n", subject);
        System.out.printf("Result: %s%n", result);
    }
}

Output:

Subject: String with “strange” characters
Result: String with &#8210;strange&#8211; characters

MizardX