ansaurus

Question

Answer 1

+1 A:

Don't use regular expressions to parse HTML! Try using JTidy or any of these open-source HTML parsers for Java, that fit your requirements.

Using these libraries will make your life much easier because they are designed to parse HTML. You can grab the nodes you want and extract text from them.

Vivin Paliath 2010-10-29 22:20:17

Don't you think you're being a little bit harsh?

tchrist 2010-10-30 04:18:21

@tchrist How am I being harsh?

Vivin Paliath 2010-10-30 15:57:05

You're being harsh because you do not know the poster's exact circumstances. There are limited situations where patterns *can* be used to match HTML to good effect. The standard SO dogma, while well-intensioned, can be unkind overkill. If you know your input set, it's not too hard. If you do not, then it is. See my other, longer comment.

tchrist 2010-10-30 16:12:41

@tchrist, my position is not dogma. Regardless of circumstances, I honestly believe that using regexes to parse HTML is a bad idea. While I agree with you that they may be useful under limited input-sets (I myself have used them in perl one-liners or in sed), I like to err on the side of caution and mention why using regexes in this case is not such a good idea. I have especially, never used regexes to parse HTML in production code. I prefer to go with a less error-prone, more maintainable, and more reliable method (HTML parser).

Vivin Paliath 2010-10-30 16:37:15

Vivin, I have on occasion used pattern matching on discrete HTML **that I myself programmatically generated**. It's like the old Twilight Zone refrain: *“We control the vertical; we control the horizontal.”* That is perfectly safe, but I'm probably more careful about this than 99.9% of programmers. Otherwise, I do make use of HTML parsing classes. Couldn't imagine using patterns on unknown HTML. I just reject the standard SO dogma on this as being overly fastidious and facile. I suppose I should go find the sources of all this dogma and amend them with strong provisos.

tchrist 2010-10-30 18:03:43

@tchrist I agree. As long as the problem domain is small then a regular expression suffices; I usually find that to be the exception rather than the rule though :)

Vivin Paliath 2010-10-30 19:45:53

Answer 2

+1 A:

This is how you would use a regex to extract the text between the title tags:

    String s = "<title>test title</title>";
    Pattern p = Pattern.compile("<title>(.*?)</title>");
    Matcher m = p.matcher(s);
    while(m.find()){
        System.out.println(m.group(1));
    }

dogbane 2010-10-29 23:35:24

This is the gentleman's way!

Csaryus 2010-10-29 23:46:50

Errors in that answer: (1) HTML is not case sensitive; (2) dot will not match line terminators; (3) you forgot to account for standard attributes; (4) you should not match within comments or script tags; (5) a minimal match doesn't guarantee that it will not contain a duplicate open tag on malformed input (6) you should not match within quoted attributes. There are probably more errors, but those are just off the top of my head. Nevertheless, this will probably solve his problem. So what does that tell you?

tchrist 2010-10-30 04:26:45

Answer 3

A:

It is inadvisable to parse XML/HTML with regular expressions. However, if you absolutely must do this thing that you have asked, try this:

package org.apache.people.mclark.examples.regex;
import java.util.regex.*;
public class Regex1 {
    public static void main(String[] args) {
        final String subjectString = "<title>test title</title>\n" +
          "blabla bla more text"; 
        Pattern regex = Pattern.compile("<title>(.*?)</title>(.*)",
                Pattern.DOTALL);
        Matcher regexMatcher = regex.matcher(subjectString);
        if (regexMatcher.find()) {
            String pageTitle = regexMatcher.group(1);
            String leftOvers = regexMatcher.group(2);
            System.out.println("pageTitle[" + pageTitle + "]");
            System.out.println("leftOvers[" + leftOvers + "]");
        } else {
            System.out.println("no match");
        }
    }
}

I wash my hands of any misbehavior!

Mike Clark 2010-10-30 00:27:49

You don't mean not possible; you mean incredibly difficult to get right in the general case. (Or you're talking only about textbook regular expressions, not modern patterns.) It may be somewhat easier than impossible if we're talking about a rigged demo with a known, finite input set. Perhaps he has one of those. Perhaps he doesn't.

tchrist 2010-10-30 04:57:29

tchrist, regex is not recursive and so it cannot, for example, match nested balanced tags. There are some flavors of regex that have very recently added recursive constructs, but they are difficult to use. Perhaps "impossible" is a strong word, many things are possible (but not advisable) with regex. In fact, I was merely quoting the standard SO dogma from the regex tag's wiki @ http://stackoverflow.com/tags/regex/info. The level of difficulty for some problems is so high as to be practically impossible for novices to implement correctly.

Mike Clark 2010-10-30 06:09:18

Modern patterns certainly *are* [recursive](http://stackoverflow.com/questions/4031112/regular-expression-matching/4034386#4034386). But using them for [matching HTML](http://stackoverflow.com/questions/4044946/regex-to-split-html-tags/4045840#4045840) is so error-prone and difficult in the general case as not to be worth the effort. Only for fully restricted input sets of known characteristics does it work easily for, although in those cases it can often do a good job. The rub is input is seldom as limited as people believe.

tchrist 2010-10-30 16:07:43

http://stackoverflow.com/questions/4031112/regular-expression-matching/4034386#4034386 << Good info, thanks.

Mike Clark 2010-10-30 16:59:00

ansaurus

tags:

views:

answers:

Java String Manipulating HTML Tags

related questions