views:

576

answers:

3

I have a string with some markup which looks like this:

The quick brown <a href="www.fox.org">fox</a> jumped over the lazy <a href="entry://id=6000009">dog</a> <img src="dog.png" />.

I'm trying to strip away everything except the anchor elements with "entry://id=" inside. Thus the desired output from the above example would be:

The quick brown fox jumped over the lazy <a href="entry://id=6000009">dog</a>.

Writing this match, the closest I've come so far is:

<.*?>!<a href=\"entry://id=\\d+\">.*?<\\/a>

But I can't figure out why this doesn't work. Any help (apart from the "why don't you use a parser" :) would be greatly appreciated!

+6  A: 

I would really not use regexps for parsing HTML. HTML isn't regular and there are no end of edge cases to trip you up.

Check out JTidy instead.

Brian Agnew
+1. Questions like this are posted several times a day to SO. Believe it or not, you simply can't parse [X][HT]ML with regex, and trying to do so sets you up for weird errors, confusion and security holes. Don't do it. There are HTML parsers.
bobince
Excuse me, you seem to be using "regular" as a technical term. If you are, could you point me to a reference?
Beta
I'm now going to show my ignorance, and say that regexps won't handle arbitrarily nested structures (you can nest <,> via CDATA sections etc.). I'm not *totally* familiar with the proper definition of 'regular' in this scenario and would welcome comments from a more qualified SOer!
Brian Agnew
He is not looking to validate the html or to understand it's semantic, he is only trying to remove the tags. the tag structure itself is regular. the only cases I can think of where you can actually embbed '>' or '<' in something matched by '<.*>' is with CDATA and <!-- --> which can be handled by the regular expression. If he was trying to do something more complex on the page, Jtidy would be the way, but for stripping tags I think a regex can work (see my answer).
Jean
@Beta: *regular* in this context means *regular language*, as used in computer science and explained here: http://en.wikipedia.org/wiki/Regular_language.
Otto Allmendinger
@Otto - thanks for that.
Brian Agnew
Jean: Tag structures are not regular. You can perfectly legitimately put a ‘>’ character inside an attribute value and many pages do; it is ‘<’ that you aren't allowed to use (although, some bad pages still even do that...). And technically in SGML-oriented HTML (as opposed to how most browsers actually implement it, and also to what XHTML allows), you can also cause endless havoc with custom entity references in an attribute value.
bobince
Vielen Dank, Otto!
Beta
bobince : interestig. hopefully he comes back and read on this. but then his html may come from a controlled source. in which case he can make assumptions we can't he seems happy with it :)
Jean
+1  A: 

Not easily possible with regex. I recommend a parser that understands the semantics of HTML/XML.

If you insist, you could do a multi-step approach, something like:

  • Replace "<(a\s*href="entry:.*?/a)>" with "{{{{\1}}}}"
  • Replace "<(?!/a}}}})[^>]*>" with ""
  • Replace "{{{{" with "<"
  • Replace "}}}}" with ">"

Be warned that the above is error-prone and will fail at some point. Consider it an ugly hack, not a real solution. Something like the above is okay for a one-off edit of some text file in a regex-aware text editor, but for repeated, real-world use as part of data processing in an app - not so much.

Tomalak
+1  A: 

Using this :

((<a href="entry://id=\d+">.*?</a>)|<!\[CDATA\[.*?\]\]>|<!--.*?-->|<.*?>)

and combining it with a replace all $2 would work for your example. The code below proves it:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static org.junit.Assert.*;
import org.junit.Test;


public class TestStack1305864 {

    @Test
    public void matcherWithCdataAndComments(){
     String s="The quick <span>brown</span> <a href=\"www.fox.org\">fox</a> jumped over the lazy <![CDATA[ > ]]> <a href=\"entry://id=6000009\">dog</a> <img src=\"dog.png\" />.";
     String r="The quick brown fox jumped over the lazy <a href=\"entry://id=6000009\">dog</a> .";
     String pattern="((<a href=\"entry://id=\\d+\">.*?</a>)|<!\\[CDATA\\[.*?\\]\\]>|<!--.*?-->|<.*?>)";
     Pattern p = Pattern.compile(pattern);
     Matcher m = p.matcher(s);

     String t = s.replaceAll(pattern, "$2");
     System.out.println(t);
     System.out.println(r);
     assertEquals(r, t);
    }
}

The idea is to capture all the elements you are interested to keep in a specific group so you can insert them back in the string.
This way you can replace all :
For every element which doesn't match the interesting ones the group will be empty and the element will be replaced with ""
For the interesting elements the group will not be empty and will be appended to the result String.

edit: handle nested < or > in CDATA and comments
edit: see http://martinfowler.com/bliki/ComposedRegex.html for a regex composition pattern, designed to make regex more readable for maintenance purposes.

Jean
Thank you so much! This made my day, and yesterday as well :-)
thomax