ansaurus

Question

Answer 1

A:

I suggest parsing the HTML using a proper parser in the language you're programming in before injecting it into your database.

If you post in what language you're working, perhaps I, or someone else, can make a recommendation.

Bart Kiers 2009-10-11 10:45:41

Answer 2

A:

The following MySQL regex string will match all the html tags, so you can strip them out

"<" +       -- Match the character “<” literally
"[^>]" +    -- Match any character that is NOT a “>”
   "*" +       -- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
">"         -- Match the character “>” literally

OR

I know this does not answer your question directly, but if you have access to scripting languages, they normal have built in functions for stripping html tags from text.

eg. in php you can do this...

$htmltext = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
$plaintext = strip_tags($text);

// or use regex...
$result = preg_replace('/<[^>]*>/i', '', $text);

http://php.net/manual/en/function.strip-tags.php

rikh 2009-10-11 10:47:09

Answer 3

A:

I'd suggest adding another column to db with a text-only copy of the html column and use that column for full-text queries. Regular expressions are the wrong tool for this.

For large amounts of texts you also might consider Sphinx http://www.sphinxsearch.com which has a built-in option to ignore html while searching.

stereofrog 2009-10-11 11:10:45

ansaurus

tags:

views:

answers:

Mysql text extract with Regex

related questions