tags:

views:

1112

answers:

5

I'm having troubles with a regexp. I'm looking through a set of XML files, and trying to detect some text inside specific nodes that contain a line break.

Here is some sample data:

<item name='GenMsgText'><text>The signature will be discarded.</text></item>

<item name='GenMsgText'><text>The signature will be discarded.<break/>
Do you want to continue?</text></item>

In that sample, I want to catch only the text in the second node. I've come up with the below solution that uses a second regexp, but I'd like to know if I can do the same thing using only one.

if ($content =~m{<item name='GenMsgText'>(<textlist>)?<text>(.*?)</text>}si)
  {
    $t = $2;
    if ($t =~m {\n}i)
    {
     print G $t."\n\n";
    }
}


This is for a one-shot tool that isn't meant to be reused, so I'd like to avoid having to write any parsing code that's more than a few lines. Besides, the code above already works, I asked the question for personal knowledge more than for real use.

A: 

I'm not sure, but think this should work:

<item name='GenMsgText'>(<textlist>)?<text>(.*\n.*)</text>
Max
Nope, this catches way more than what I need.
Antoine
+3  A: 

I should consider using some SAX parser for that. Regex is too fragile to handle xml input.

Eider Oliveira
It's not because regex would be fragile, it's more because it can't parse nested structures in a sensible way.
Tomalak
+5  A: 

Regex is not the right tool for this task, it simply can't handle nested structures very well. If you have a DOM API your disposal, this XPath would find the right nodes:

If you are looking for <break/> elements, as your example suggests:

//item[@name='GenMsgText']/text[break]

For "real" line breaks, being CR (0xD) or LF (0xA):

//item[@name='GenMsgText']/text[contains(., '&#xD;') or contains(., '&#xA;')]
Tomalak
A: 

The problem is that your s-mode .*? can match angle brackets as well as newlines. If the regex starts to match an element that can't match, there's nothing to stop it from continuing the match attempt in the next element. If you know there will never be angle brackets in the text, you can confine the match to a single element like this:

<item name='GenMsgText'><text>([^<>\n]*\n[^<>]*)</text></item>

EDIT: It's important to note that the regexes offered by Max and Kibbee should not be applied in s-mode (/s, single-line, DOTALL...). That's what keeps them from matching beyond the end of the "item" element: in order to reach the next one they would have to match the line separators between the elements.

But even without the /s modifier, both regexes can fail if there are two elements without internal linefeeds on successive lines (i.e., with only one linefeed between them). For example, these two lines would be matched as one:

<item name='GenMsgText'><text>foo</text></item>
<item name='GenMsgText'><text>bar</text></item>

On the other hand, what if there are more than two lines in text? The other regexes match exactly one linefeed, so they would fail. In my regex, I explicitly match the first linefeed to make sure there is one, but if there are any more linefeeds, they'll be matched by the second character class: [^<>]*

This kind of thing is why I tend to avoid using .* or .*?.

Alan Moore
A: 

Along the same lines as what Alan mentioned, you can use a lazy capture to only capture as much as necessary before matching the closing text statement

<item name='GenMsgText'><text>(.*?\n.*?)</text></item>

But again, regex is probably completely the wrong tool for the job, and you should be using a proper XML parser.

Kibbee