ansaurus

Question

Regex detect linebreak inside a XML node

Answer 1

A:

I'm not sure, but think this should work:

<item name='GenMsgText'>(<textlist>)?<text>(.*\n.*)</text>

Max 2008-12-17 10:11:26

Nope, this catches way more than what I need.

Antoine 2008-12-17 10:48:32

Answer 2

+3 A:

I should consider using some SAX parser for that. Regex is too fragile to handle xml input.

Eider Oliveira 2008-12-17 10:24:05

It's not because regex would be fragile, it's more because it can't parse nested structures in a sensible way.

Tomalak 2008-12-17 12:58:43

Answer 3

+5 A:

Regex is not the right tool for this task, it simply can't handle nested structures very well. If you have a DOM API your disposal, this XPath would find the right nodes:

If you are looking for <break/> elements, as your example suggests:

//item[@name='GenMsgText']/text[break]

For "real" line breaks, being CR (0xD) or LF (0xA):

//item[@name='GenMsgText']/text[contains(., '&#xD;') or contains(., '&#xA;')]

Tomalak 2008-12-17 13:03:32

Answer 4

A:

The problem is that your s-mode .*? can match angle brackets as well as newlines. If the regex starts to match an element that can't match, there's nothing to stop it from continuing the match attempt in the next element. If you know there will never be angle brackets in the text, you can confine the match to a single element like this:

<item name='GenMsgText'><text>([^<>\n]*\n[^<>]*)</text></item>

EDIT: It's important to note that the regexes offered by Max and Kibbee should not be applied in s-mode (/s, single-line, DOTALL...). That's what keeps them from matching beyond the end of the "item" element: in order to reach the next one they would have to match the line separators between the elements.

But even without the /s modifier, both regexes can fail if there are two elements without internal linefeeds on successive lines (i.e., with only one linefeed between them). For example, these two lines would be matched as one:

<item name='GenMsgText'><text>foo</text></item>
<item name='GenMsgText'><text>bar</text></item>

On the other hand, what if there are more than two lines in text? The other regexes match exactly one linefeed, so they would fail. In my regex, I explicitly match the first linefeed to make sure there is one, but if there are any more linefeeds, they'll be matched by the second character class: [^<>]*

This kind of thing is why I tend to avoid using .* or .*?.

Alan Moore 2008-12-17 13:56:45

Answer 5

A:

Along the same lines as what Alan mentioned, you can use a lazy capture to only capture as much as necessary before matching the closing text statement

<item name='GenMsgText'><text>(.*?\n.*?)</text></item>

But again, regex is probably completely the wrong tool for the job, and you should be using a proper XML parser.

Kibbee 2008-12-17 14:36:28

ansaurus

tags:

views:

answers:

Regex detect linebreak inside a XML node

related questions