ansaurus

Question

One regular expression to rule them all (efficiently)?

Answer 1

A:

Don't use regular expressions for HTML. Use a real parser.

This will also help you getting around any character encodings you might encounter.

Thorbjørn Ravn Andersen 2010-09-02 05:24:11

Converting certain characters to entities isn't parsing HTML with regex.

eyelidlessness 2010-09-02 05:52:50

The "weird characters" looked like handling UTF-8 incorrectly.

Thorbjørn Ravn Andersen 2010-09-02 06:21:59

@Thorbjørn, I realize that. That's still not parsing HTML.

eyelidlessness 2010-09-03 02:17:21

@eyelidlessness, op clearly said: "I've been trying to parse through HTML files to scrape text from them." He IS trying to parse HTML using regex and because of this, he is having problems such as this. Who knows what other problems could arise when the recommended way is to use an external library.

Coding District 2010-09-04 05:48:44

@Coding, if you read the code (which, as we all know, is more correct than its comments), you can see that the OP is replacing characters in text, not parsing HTML. They are parsing characters, which happen to be in an HTML document, but literally none of the HTML parsing rules apply to the question and its solution.What does an external library have to do with the fact that the question doesn't have anything to do with parsing HTML markup?

eyelidlessness 2010-09-04 07:44:31

@eyelidlessness, the phrasing *the "smart quotes" ... is causing all my problems* is a dead give-away that the regexps are just about to break.

Thorbjørn Ravn Andersen 2010-09-05 19:47:08

@Thorbjørn, why? Regex is perfectly capable of detecting single characters like curly quotes. Please explain to me how this has anything whatsoever to do with parsing HTML or any other irregular language.

eyelidlessness 2010-09-05 23:01:11

Answer 2

+2 A:

There's a huge thread over here that shows you why it is a bad idea to use regex to parse HTML.

Look for external libraries to do this task. An example would be: JSoup. There's also a tutorial included in their webpage that you can use.

Coding District 2010-09-02 05:29:10

Converting certain characters to entities isn't parsing HTML with regex.

eyelidlessness 2010-09-02 05:53:06

The regex was for the special multi-byte characters and not to parse my HTML, but thanks a ton for the JSoup reference--hands down, tons better than the Java API HTMLEditorKit.

Brett 2010-09-04 22:56:37

Answer 3

+2 A:

Your file appears to be UTF-8 encoded, but you're reading it as though it were in a single-byte encoding like windows-1252. UTF-8 uses three bytes to encode each of those characters, but when you decode it as windows-1252, each byte is treated as a separate character.

When working with text, you should always specify an encoding if possible; don't let the system use its default encoding. In Java, that means using InputStreamReader and OutputStreamWriter instead of FileReader and FileWriter. Any reasonably good text editor should let you specify an encoding as well.

As for your actual question, no, Java doesn't have a built-in facility for dynamic replacements (unlike most other regex flavors). But it's not too difficult to write your own, or even better, use one that someone else wrote. I posted one from Elliott Hughes in this answer.

One last thing: In your sample code you use replaceAll() to do the replacements, which is overkill and a possible source of bugs. Since you're matching literal text and not regexes, you should be using replace(CharSequence,CharSequence) instead. That way you never have to worry about accidentally including a regex metacharacter and going blooey.

Alan Moore 2010-09-02 06:45:59

That bit of advice went a long way last night. After a bit digging on the readers vs input streams, I determined that it would be better if I backed off the input/output streams in favor of readers and writers. Thanks.

Brett 2010-09-04 22:59:25

Answer 4

+2 A:

As stated by others; The recommended method to take care of those characters is to configure your encoding settings.

For comparison, here is a method to re-code UTF-8 sequences as HTML entities using regex:

import java.util.regex.*;

public class UTF8Fixer {
    static String fixUTF8Characters(String str) {
        // Pattern to match most UTF-8 sequences:
        Pattern utf8Pattern = Pattern.compile("[\\xC0-\\xDF][\\x80-\\xBF]{1}|[\\xE0-\\xEF][\\x80-\\xBF]{2}|[\\xF0-\\xF7][\\x80-\\xBF]{3}");

        Matcher utf8Matcher = utf8Pattern.matcher(str);
        StringBuffer buf = new StringBuffer();

        // Search for matches
        while (utf8Matcher.find()) {
            // Decode the character
            String encoded = utf8Matcher.group();
            int codePoint = encoded.codePointAt(0);
            if (codePoint >= 0xF0) {
                codePoint &= 0x07;
            }
            else if (codePoint >= 0xE0) {
                codePoint &= 0x0F;
            }
            else {
                codePoint &= 0x1F;
            }
            for (int i = 1; i < encoded.length(); i++) {
                codePoint = (codePoint << 6) | (encoded.codePointAt(i) & 0x3F);
            }
            // Recode it as an HTML entity
            encoded = String.format("&#%d;", codePoint);
            // Add it to the buffer
            utf8Matcher.appendReplacement(buf,encoded);
        }
        utf8Matcher.appendTail(buf);
        return buf.toString();
    }

    public static void main(String[] args) {
        String subject = "String with \u00E2\u0080\u0092strange\u00E2\u0080\u0093 characters";
        String result = UTF8Fixer.fixUTF8Characters(subject);
        System.out.printf("Subject: %s%n", subject);
        System.out.printf("Result: %s%n", result);
    }
}

Output:

Subject: String with “strange” characters
Result: String with ‒strange– characters

MizardX 2010-09-02 10:30:11

ansaurus

tags:

views:

answers:

One regular expression to rule them all (efficiently)?

related questions