tags:

views:

168

answers:

2

I have a string that contains normal characters, white charsets and newline characters between and . This regular expression doesn't work: /<div>(.*)<\/div>. It is because .* doesn't match newline characters. My question is, how to do this?

+7  A: 

You need to use the DOTALL modifier.

'/<div>(.*)<\/div>/s'

This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:

'/<div>(.*?)<\/div>/s'

You could also solve this by matching everything except '<' if there aren't other tags:

'/<div>([^<]*)<\/div>/'

Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':

'#<div>([^<]*)</div>#'

However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

Mark Byers
Can I ask you why there is [^<] in '/<div>[^<]*<\/div>/m'? I know what it means but I don't understand why you are using it. I think that it can cause problems for example with <div>some <b>bold</b> text</div>
Gaim
@Gaim, yes but the original also has problems with `<div>a<div>b</div>c</div>`. Both solutions are wrong, they're just wrong in different ways. Regex can't parse HTML correctly.
Mark Byers
+1 for the regex vs. HTML insight. If you need to work with HTML, you probably want some sort of DOM.
Williham Totland
Ok, I know that it is subjective but I think that problems only with `<div>a<div>b</div>c</div>` are lesser evil than problems with all nested tags. Btw I think that you are missing `/m` in the last expression because it will be still only single-line.
Gaim
@Gaim: Well spotted with the missing m, but it's not an error. The [^<] construct matches everything except `<` - including new lines, so the m modifier is no longer necessary.
Mark Byers
`/s` is the modifier that lets the dot match newlines (single-line or DOTALL mode); `/m` changes the behavior of `^` and `$` (multiline mode). Unless you're working with Ruby, where multiline mode is always on and `/m` turns on DOTALL mode. Or JavaScript, which has no DOTALL mode.
Alan Moore
Oops! I concentrated on everything but the question. ;) Thanks for the correction, and I've included a link to the docs.
Mark Byers
@Mark, if you're using the negated character class `[^<]*`, which may be a good idea depending on the nature of the subject string, I'd make it possessive to prevent possible needless backtracking: `#<div>([^<]*+)</div>#`.
Geert
A: 

There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

pau.estalella