tags:

views:

110

answers:

5

Hi guys!

Help with regular expressions needed. I'm trying using regular expressions and preg_match_all find blocks <character>...</character>. Here is how my data looks like:

<character>
杜塞尔多夫
杜塞爾多夫
    <div class="hp">dùsàiěrduōfū<div class="hp">dkfjdkfj</div></div>
    <div class="tr"><span class="green"><i>г.</i></span> Duesseldorf (<i>Deutschland</i>)</div>
    <div class="tr"></div>
</character>

<character>
    我, 是谁
    <div class="hp">текст</div>
    <div class="tr">some text in different languages</div>
</character>

I tried \<character\>.*\<\/character> but unfortunately it didn't work. Any suggestions?

A: 

You are currently doing it with greedy regexps. Use not greedy regexps instead.

Lyubomyr Shaydariv
A: 

You may need to use the "/u" option to correctly process UTF8 text.

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

Tim Sylvester
+2  A: 

Try

<character>(.*?)<\/character>

The question mark is an ungreedy qualifier, meaning it'll match a string as short as possible. Also < and > doesn't need escaping.

Jonas
I just wanted to say the same, but I have lost my sample source code that was pretty easy to find. )))
Lyubomyr Shaydariv
+3  A: 

If using the preg family of functions, your regular expression should be:

/\<character>(.*?)\<\/character>/s

The non-greedy operator ? will prevent you from only getting one match starting from the first <character> and ending at the last </character>.The /s flag will allow your dot to match line breaks.

BipedalShark
`<` needs no escaping.
Bart Kiers
+5  A: 

Unless you're required at gunpoint to use regular expressions to do this, DOMDocument will be far more accurate.

<?php

$dom = new DOMDocument;
$dom->loadXML($data);

$character_nodes = $dom->getElementsByTagName('character');

// use $character_nodes...
?>
seanmonstar
even at gunpoint there's no good reason to use regexes for parsing xml, but it remains possible that the data just looks like xml, but isn't quite valid xml...
Kris
@Kris, I think "not getting shot" remains a good reason to do something when at gunpoint. ;)
BipedalShark
+1 for giving a proper answer. There are DOM parsers for HTML, too. RegEx is a great tool... for other tasks.
TrueWill
My document isn't a valid HTML, so I got a lot of errors...
Anthony
Anthony, I asked a similar question before. The aim of the question was loading a not-strict XML/HTML document into DOM. Check this: http://stackoverflow.com/questions/1473214/how-to-parse-not-strict-html-documents-indulgently
Lyubomyr Shaydariv
This is a knee-jerk, pat answer as far I'm concerned. Regexing through HTML can be perilous, sure, but for limited cases like Anthony's, regular expressions are perfectly fine.
BipedalShark