tags:

views:

45

answers:

6

Hi,

I'm having 800 entries that are very similar, but they need some stuff done to them. The format is like this:

<td class="description">
Describing text.
Might very well be 2 paragraphs
</td>

I need to do some stuff to the text inside the cell. I've tried to use preg_replace('/(.+)<\/td>/'). It ends up with two problems.

  1. I don't manage to fetch what's inside the parenthesis, but it will also fetch the cell tags.
  2. It will fetch everything until the last </td> in the document. I just want it to go to the first occurrence of </td>

Thanks in advance

+1  A: 

First of all, .+ will grab everything... it won't just start at <td>. You will want to add a regex to pull the beginning of the table col:

<td[^>]*?>

(note, [^>]* means match non-> characters until we find one.)

Also, .+ and .* are greedy, meaning that it will grab as much as possible. To change this behavior, add a ? after it, like such: .+?. This makes it satisfy only as much as it needs to.

So, you will have

<td[^>]*)>(.*?)<\/td>

This was a lesson on regex, but I really think you shouldn't be using regex for this. Regex can break pretty easily once you start having nested tables or anything more complicated than simple html.

orangeoctopus
A: 

If you're certain that there is no HTML in the table cells, the following non-regex code may help:

// $entries contains all of the table cell entries.
$newentries = "";
$cells = split("</td>",$entries);
while (list(,$data) = each($cells)) {
    $newentries .= "<td class=\"description\">";
    $text = substr($data,strpos($data, ">") + 1);
    // perform modifications on $text
    // i.e. $text = "<B>" . $text . "</B>";
    $newentries .= $text;
    $newentries .= "</td>";
}

// $newentries now contains the modified cell entries.

This probably isn't 100% what you're looking for, but maybe it will help.

Fosco
A: 

You may use:

preg_replace(
  '/<td (.*?)>(.*?)<\/td>/sm',
  '<td class="description"><strong>$2</strong></td>',
  $data
)

If what you are trying to do with the text inside is complicate, use a callback function.

narcisradu
A: 

As all the other ones have said: RegExp is bad, at least here!

So, basic Regex is

#<td[^>]*>(.*?)</td>#s

(Note I used the s-Modifier, otherwise the RegExp wouldn't work.)

Now, this RegExp is wrong, even though it may be okay for your purposes. To be more strict you have to know, that > is allowed in attributes. Therefore this Regex may break things.

#<td(\s+\w+="[^"]+")\s*>(.*?)</td>#s

I think this now will be quite secure if you're dealing with XML. But sure, it may break on rare occasions, which I right now can't think off.

nikic
A: 
$d = new DOMDocument();
$d->loadHTML($htmlstring);
$x = new DOMXPath($d);
$tds = $x->query("//td[@class='description']//text()");
for($i = 1; $i <= $tds->length; $i++){
    $tds->item($i)->replaceData(0,mb_strlen($tds->item($i)->wholeText),strtoupper($tds->item($i)->wholeText));   
}
var_dump($d->saveHTML());
Wrikken