tags:

views:

181

answers:

3

Hi, i try to extract text from html text stored in a db.

This is an example:

<P style="FONT-SIZE: 13px; MARGIN-LEFT: 6px"><FONT color=#073b66><STRONG><A 
href="/generic.asp?page_id=p00497">Practice Exams</A> - </STRONG><FONT 
color=#000000>ours are the most realistic exam simulations, and the best way to 
prepare for your exams. Get detailed correct and incorrect answers and 
explanations. Free Flash Cards are included.</FONT></FONT> </P>

If i search "generic" this regexp must find if this text is over the html tag.

Please help

A: 

I suggest parsing the HTML using a proper parser in the language you're programming in before injecting it into your database.

If you post in what language you're working, perhaps I, or someone else, can make a recommendation.

Bart Kiers
A: 

The following MySQL regex string will match all the html tags, so you can strip them out

"<" +       -- Match the character “<” literally
"[^>]" +    -- Match any character that is NOT a “>”
   "*" +       -- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
">"         -- Match the character “>” literally

OR

I know this does not answer your question directly, but if you have access to scripting languages, they normal have built in functions for stripping html tags from text.

eg. in php you can do this...

$htmltext = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
$plaintext = strip_tags($text);

// or use regex...
$result = preg_replace('/<[^>]*>/i', '', $text);

http://php.net/manual/en/function.strip-tags.php

rikh
A: 

I'd suggest adding another column to db with a text-only copy of the html column and use that column for full-text queries. Regular expressions are the wrong tool for this.

For large amounts of texts you also might consider Sphinx http://www.sphinxsearch.com which has a built-in option to ignore html while searching.

stereofrog