tags:

views:

1279

answers:

11

Greetings!

I'm working on a regular expression in a .NET project to get a specific tag. I would like to match the entire DIV tag and its contents:

<html>
   <head><title>Test</title></head>
   <body>
     <p>The first paragraph.</p>
     <div id='super_special'>
        <p>The Store paragraph</p>
     </div>
     </body>
  </head>

Code:

    Regex re = new Regex("(<div id='super_special'>.*?</div>)", RegexOptions.Multiline);


    if (re.IsMatch(test))
        Console.WriteLine("it matches");
    else
        Console.WriteLine("no match");

I want to match this:

<div id="super_special">
   <p>Anything could go in here...doesn't matter.  Let's get it all</p>
</div>

I thought . was supposed to get all characters, but it seems to having trouble with the cariage returns. What is my regex missing?

Thanks.

+1  A: 

Depends what language you're working in. For example, in perl you'd use the regex modifier s:

m{<div id="super_special">.*?</span>}s
mopoke
+1  A: 

What language are you using? In .NET you must set an option to ensure that it isn't single line.

Mitchel Sellers
A: 

. (dot) Matches any single character except line break characters \r and \n. Most regex flavors have an option to make the dot match line break characters too. . matches x or (almost) any other character

Nescio
A: 

maybe: .[\r\n].[\r\n]

dimarzionist
+1  A: 

Depends on the language. If on python, you are missing the re.S flag, like this (to remove the match):

re.compile('<div id="super_special">.*?</div>',re.S).sub(your_html,'')

Similar flags exist for other regexps implementations, they are called "Single Line" or "Multi Line" or something like that.

But DO NOT USE REGEXPS TO PARSE HTML. It's a direct path to maintenance hell. Use a HTML parser like Beautiful Soup. Check these links for useful resources in that direction.

Vinko Vrsalovic
+6  A: 

Please, pretty please, do yourself a huge favor: use an HTML parser for parsing HTML. Seriously. That's what they are there for.

HTML is a very complex language. No matter how long you will be tweaking, fiddling, fixing, honing your Regexp, there will always be a case you're missing.

Anyway, you have to tell your Regexp engine to match multiple lines instead of just one. In some of the most popular ones you do that by applying the /m modifier.

But let me repeat: please use an HTML parser. Everytime someone uses a Regexp to parse HTML, a kitten dies ...

Jörg W Mittag
That might make me revisit my approach. I hate kittens!
Vinko Vrsalovic
+1  A: 

The problem is that the . metacharacter doesn't match newlines by default. You have to use the single-line modifier to achieve this. In .NET, you can either use RegexOptions.SingleLine as the last parameter to the method you're using, or use the modifier directly in the pattern, e.g:

(?s)(<div id="super_special">.*?</div>)
Bennor McCarthy
+1  A: 

Most languages have some way to make . match newlines:

  • In Java: Pattern.compile("pattern", Pattern.MULTILINE);
  • In Perl and Ruby: /pattern/m
  • In VB: Regex.IsMatch(s, "pattern", RegexOptions.Multiline)

In general it's not a good idea to use regexp to match XML/HTML, because XML/HTML tags can be nested, for example:

  <div id="super_special">
     <div>Nothing</div>
     <p>Anything could go in here...doesn't matter.  Let's get it all</p>
  </div>

... here you could easily end up matching:

  <div id="super_special">
     <div>Nothing</div>

On the other hand, if you know for sure that the HTML you are matching will always be safe for your regexp, then don't let me stop you (although, even then you should think twice about saving your future self from a potential debugging headache).

fd
+1  A: 

Out-of-the-box, without special modifiers, most regex implementations don't go beyond the end-of-line to match text. You probably should look in the documentation of the regex engine you're using for such modifier.

I have one other advice: beware of greed! Traditionally, regex are greedy which means that your regex would probably match this:

<div id="super_special">
  I'm the wanted div!
</div>
<div id="not_special">
  I'm not wanted, but I've been caught too :(
</div>

You should check for a "not-greedy" modifier, so that your regex would stop matching text at the first occurence of </div>, not at the last one.

Also, as others have said, consider using an HTML parser instead of regexes. It will save you a lot of headache.

Edit: even a non-greedy regex wouldn't work as expected either, if <div>s are nested! Another reason to consider using an HTML parser.

André Neves
A: 

None of these regex suggestions will work. Depending on whether they're greedy or not, they will match either the very last </div> in the document, or the very first </div> after your starting string, which may be a div nested inside the one you're interested in.

Regular expressions are not really the ideal tool for this purpose, but if your situation is simple enough that you don't really want to parse the HTML, you can do this using a Microsoft-proprietary extension to regex available in .NET. For a nice explanation, see this nice article by Morten Maate.

Mike Kantor
A: 

Regular expressions alone are simply not powerful enough to solve your problem. You need something more powerful, such as context-free grammars. See Chomsky hierarchy at Wikipedia.

In other words (as has been said before), don't use regex to parse HTML.

Martijn