tags:

views:

26

answers:

3

Hello guys,

I am trying to extract just the names result from the hypothetical HTML file below.

<ul class="cat">
<li>sport</li>
<li>movie</li>
</ul>
<ul class="person-list">
<li>name 1</li>
<li>name 2</li>
<li>name 3</li>
<li>name 4</li>
<li>name 5</li>
<li>name 6</li>
</ul>

Ideally, the result should come in an array format like the one below: Array( name 1 , name 2 , name 3 , .......... )

OK I can easily do this with 2 regex matches but I was wondering if I can do it with just one.

Thanks in advance!

A: 

This would be far easier and far more robust using an HTML parser like DOMDocument. Regexes are a poor tool for parsing HTML because HTML is not a regular language. Try something like:

$html = <<<END
<ul class="cat">
<li>sport</li>
<li>movie</li>
</ul>
<ul class="person-list">
<li>name 1</li>
<li>name 2</li>
<li>name 3</li>
<li>name 4</li>
<li>name 5</li>
<li>name 6</li>
</ul>
END;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$items = $xpath->query("//li[starts-with(.,'name ')]/text()");
foreach ($items as $item) {
  echo $item->wholeText . "\n";
}

Output:

name 1
name 2
name 3
name 4
name 5
name 6
cletus
A: 

$pattern = '/<ul class=\"person\-list\">\s*(<li>(.*?)<\/li>)*\s*<\/ul>/ms'; preg_match_all($pattern, $TXT, $array); echo '<pre>', print_r($array, true), '</pre>';

Alex
A: 

Here is a sample perl script to do this. Assuming your html is in my.html

open FILE, "<", "my.html" or die $!;
my @arr;
while (my $line = <FILE>) {
  if ($line =~ /<li>\s*(name[^>]+)<\/li>/) {
     push(@arr, $1);
  }
}
print "Array (@arr)\n";

Explanation - each line of the html file is read into $line, and then we use the regex

/<li>\s*(name[^>]+)<\/li>/

to see if current line matches what we want (i.e. the string 'name followed by some characters' enclosed inside li tags). Simultaneously the "name followed by some characters" sub string is captured into a backreference. If we find a match, the captured string is appended to the array.

Jasmeet