views:

25

answers:

2

The site I'm working on has a database table filled with glossary terms. I am building a function that will take some HTML and replace the first instances of the glossary terms with tooltip links.

I am running into a problem though. Since it's not just one replace, the function is replacing text that has been inserted in previous iterations, so the HTML is getting mucked up.

I guess the bottom line is, I need to ignore text if it:

  • Appears within the < and > of any HTML tag, or
  • Appears within the text of an <a></a> tag.

Here's what I have so far. I was hoping someone out there would have a clever solution.

function insertGlossaryLinks($html)
{
    // Get glossary terms from database, once per request
    static $terms;
    if (is_null($terms)) {
        $query = Doctrine_Query::create()
            ->select('gt.title, gt.alternate_spellings, gt.description')
            ->from('GlossaryTerm gt');
        $glossaryTerms = $query->rows();

        // Create whole list in $terms, including alternate spellings
        $terms = array();
        foreach ($glossaryTerms as $glossaryTerm) {

            // Initialize with title
            $term = array(
                'wordsHtml' => array(
                    h(trim($glossaryTerm['title']))
                    ),
                'descriptionHtml' => h($glossaryTerm['description'])
                );

            // Add alternate spellings
            foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
                $alternateSpelling = h(trim($alternateSpelling));
                if (empty($alternateSpelling)) {
                    continue;
                }
                $term['wordsHtml'][] = $alternateSpelling;
            }

            $terms[] = $term;
        }
    }

    // Do replacements on this HTML
    $newHtml = $html;
    foreach ($terms as $term) {
        $callback = create_function('$m', 'return \'<a href="javascript:void(0);" class="glossary-term" title="'.$term['descriptionHtml'].'"><span>\'.$m[0].\'</span></a>\';');
        $term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
        $pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
        $newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
    }

    return $newHtml;
}
+1  A: 

Using Regexes to process HTML is always risky business. You will spend a long time fiddling with the greediness and laziness of your Regexes to only capture text that is not in a tag, and not in a tag name itself. My recommendation would be to ditch the method you are currently using and parse your HTML with an HTML parser, like this one: http://simplehtmldom.sourceforge.net/. I have used it before and have recommended it to others. It is a much simpler way of dealing with complex HTML.

SimpleCoder
I couldn't figure out how that library you mentioned would help me with this specific problem.
mattalexx
You would have used it to parse the HTML and access the DOM. There, you could perform the operations you want on the DOM explicitly.
SimpleCoder
A: 

I ended up using preg_replace_callback to replace all existing links with placeholders. Then I inserted the new glossary term links. Then I put back the links that I had replaced.

It's working great!

mattalexx