




Hey there, I'm trying to write a regular expression for a code like this:

    <td>I'm some text</td>
    <td>I'm some text</td>

(I know, bad html code... )

Now I want an expression that looks for every table row and can handle dynamic numbers of ([0-9]{4}). So if there are two cells, I'd like to get an array with the two values, if there are three, there should be all three values inside my array.

My regexp HAS TO start and end with:

!<tr> ..... </tr>!sU

Is that possible?


+1  A: 

regexp is notoriously bad at evaluating hierarchical structures and especially so with xml. You are much better off using SimpleXML, or DOMDocument with DOMXPath

See http://www.php.net/manual/en/simplexmlelement.xpath.php for how to use Xpath with SimpleXML


http://www.php.net/manual/en/domxpath.evaluate.php for how it can be done with DOMXPath.

Note that if your case is as simple as given in the question, then SimpleXML is the better choice. There are some cases where DOMDocument would be more appropriate so it'd be good to have more info for that decision

For example:

$string = <<<XML
    <td>I'm some text</td>
    <td>I'm some text</td>

$xml = new SimpleXMLElement($string);

/* Search for <a><b><c> */
$result = $xml->xpath('//tr/td[text() = number(text())');

while(list( , $node) = each($result)) {
    echo $node,"\n";

Jonathan Fingland
Actually, using an object tree representation to extract `<td>` elements within `<tr>` is overkill. I would suggest a event-based API like SAX or PHP's expat-based XML parser.
Ferdinand Beyer
as I'm trying to extract some values out of a html sourcecode, which is much bigger than my example, regex is (imo) the appropriate and fastest (concidering programming time) way to get this done.
+2  A: 

this should help you get started

$html = ...as above
preg_match_all('~<tr>.+?(\d+).+?</tr>~si', $html, $matches);
Add a U to the pattern modifiers list as per the askers requirement.
Ollie Saunders
Then he only finds the first column of numbers.

Now I want an expression that looks for every table row and can handle dynamic numbers of ([0-9]{4}). So if there are two cells, I'd like to get an array with the two values, if there are three, there should be all three values inside my array. (...) Is that possible?

No, it's not. You cannot write a pattern with a dynamic number of sub-patterns.

My regexp HAS TO start and end with:
!<tr> ..... </tr>!sU

Why is that?

If you really want to use regular expressions instead of using a XML parser or something more forgiving like Tidy, I suggest a two-step approach.

First step: Find <tr> rows:


Second step: Iterate over the results and look for <td>s:


This will find sequences of 4 decimal characters (0-9) within <td> and also matches nested formatting tags like

Ferdinand Beyer