ansaurus

Question

How to make dot match newline characters using regular expressions

Answer 1

+7 A:

You need to use the DOTALL modifier.

'/<div>(.*)<\/div>/s'

This might not give you exactly what you want because you are greedy matching. You might instead try a non-greedy match:

'/<div>(.*?)<\/div>/s'

You could also solve this by matching everything except '<' if there aren't other tags:

'/<div>([^<]*)<\/div>/'

Another observation is that you don't need to use / as your regular expression delimiters. Using another character means that you don't have to escape the / in </div>, improving readability. This applies to all the above regular expressions. Here's it would look if you use '#' instead of '/':

'#<div>([^<]*)</div>#'

However all these solutions can fail due to nested divs, extra whitespace, HTML comments and various other things. HTML is too complicated to parse with Regex, so you should consider using an HTML parser instead.

Mark Byers 2009-12-31 16:04:24

Can I ask you why there is [^<] in '/<div>[^<]*<\/div>/m'? I know what it means but I don't understand why you are using it. I think that it can cause problems for example with <div>some <b>bold</b> text</div>

Gaim 2009-12-31 16:10:56

@Gaim, yes but the original also has problems with `<div>a<div>b</div>c</div>`. Both solutions are wrong, they're just wrong in different ways. Regex can't parse HTML correctly.

Mark Byers 2009-12-31 16:12:58

+1 for the regex vs. HTML insight. If you need to work with HTML, you probably want some sort of DOM.

Williham Totland 2009-12-31 16:16:07

Ok, I know that it is subjective but I think that problems only with `<div>a<div>b</div>c</div>` are lesser evil than problems with all nested tags. Btw I think that you are missing `/m` in the last expression because it will be still only single-line.

Gaim 2009-12-31 16:21:20

@Gaim: Well spotted with the missing m, but it's not an error. The [^<] construct matches everything except `<` - including new lines, so the m modifier is no longer necessary.

Mark Byers 2009-12-31 16:31:07

`/s` is the modifier that lets the dot match newlines (single-line or DOTALL mode); `/m` changes the behavior of `^` and `$` (multiline mode). Unless you're working with Ruby, where multiline mode is always on and `/m` turns on DOTALL mode. Or JavaScript, which has no DOTALL mode.

Alan Moore 2009-12-31 16:39:04

Oops! I concentrated on everything but the question. ;) Thanks for the correction, and I've included a link to the docs.

Mark Byers 2009-12-31 16:48:40

@Mark, if you're using the negated character class `[^<]*`, which may be a good idea depending on the nature of the subject string, I'd make it possessive to prevent possible needless backtracking: `#<div>([^<]*+)</div>#`.

Geert 2010-01-01 06:35:46

Answer 2

A:

There is usually a flag in the regular expression compiler to tell it that dot should match newline characters.

pau.estalella 2009-12-31 16:05:56

ansaurus

tags:

views:

answers:

How to make dot match newline characters using regular expressions

related questions