tags:

views:

33

answers:

3

Hey guys,

For the last couple of hours i've been messing around with regex. I've never dared to lay my hands on this so please bare with me.

Basicly i'm trying to get some info from the following source

<random htmlcode here>
<td style="BORDER-RIGHT-STYLE:none;">
      <a id="dgWachtlijstFGI_ctl03_hlVolnaam" title="Klant wijzigen" class="wl" href="javascript: Pop(600,860,'klantwijzig','FrmKlant.aspx','?  Wijzig=true&amp;lcSchermTitel=&amp;zoekPK=+++140+12++8',false,true); ">FIRST LINE A</a>
      (SECOND LINE A)<br>
      THIRD LINE A        </td>
<random htmlcode here>
<td style="BORDER-RIGHT-STYLE:none;">
      <a id="dgWachtlijstFGI_ctl04_hlVolnaam" title="Klant wijzigen" class="wl" href="javascript: Pop(600,860,'klantwijzig','FrmKlant.aspx','?Wijzig=true&amp;lcSchermTitel=&amp;zoekPK=+++140+12++8',false,true); ">FIRST LINE B</a>
       (SECOND LINE B)<br>
      THIRD LINE B        </td>
<random htmlcode here>

What i came up with this far is the following (thanks to rubular.com)

<?php $bestand = 'input.htm';
$fd = fopen($bestand,"r");
$message = fread($fd, filesize    ($bestand));
$regexp = "FrmKlant.aspx.*\">(.*)<\/a>\s(.*)<br>\s(.*)\s\s(.*)"; 
if   (preg_match_all("#$regexp#siU", $message, $matches)) 
{   
print_r($matches);
}?
>

This actually seems to put the first and second line i need in a multidimensional array. So far so good, because i want a multidimensional array. However, it doesn't seem to capture the 3rd line. And somehow it creates array[4]

[1] => Array ( [0] => FIRST LINE A [1] => FIRST LINE B ) 
[2] => Array ( [0] =>  (SECOND LINE A) [1] => (SECOND LINE B) ) 
[3] => Array ( [0] => [1] => ) [4] => Array ( [0] => [1] => )

What i'm looking for is this:

[0] => Array ( [0] => FIRST LINE A [1] => FIRST LINE B ) 
[1] => Array ( [0] =>  (SECOND LINE A) [1] =>  (SECOND LINE B) ) 
[2] => Array ( [0] => THIRD LINE A [1] => THIRD LINE B ) )

As you might have noticed, i'm lost! Any help would be greatly appreciated.

A: 
$regexp = "FrmKlant.aspx.*\">(.*)<\/a>\s(.*)<br>\s(.*)\s\s(.*)</td>"; 
amphetamachine
A: 

It is usually not a good idea, to try and extract information from HTML/XML using regular expressions. They a renot well suited to deal with nested structures. Everything you can try will horribly break if your "random html" parts are evil enough, so use them only if have very good control over the html.

Try a parser instead. (Google found me http://simplehtmldom.sourceforge.net/, I have not tried it, though)

Jens
+3  A: 

Use PHP's DOM parser

Incomplete example, but something to get you started:

$dom = new DOMDocument();
$dom->loadHTML($yourHtmlDocument);

$xPath = new DOMXPath($dom);
$elements = $xPath->query('\\random\td\a'); // Or whatever your real path would be

foreach($elements as $node) {
  echo $node->nodeValue;
}

By the way, look at this.

Ivar Bonsaksen