Processing like this should not be done in the database. I would recommend creating a separate field containing only the text contents.
In response to @Nissan Fans comment: Extracting text from HTML is not the database's job IMO. It's too complex a job for it, and it has too many variables. I'm not well versed in reading stored procedures but if I read the code correctly, it will have problems with an (invalid but still often occuring) unencoded <
in the source code. And it will most likely break for invalid HTML.
Or imagine one day, the customer comes and wants img
elements' ALT
properties indexed too. Or title
s. Start building that with a "start position, end position" algorithm. You will go crazy.
I say, if this is needed to process HTML from varying sources outside your control on a day-to-day basis, leave this to a layer above the DB that is better equipped to handle this stuff. A DOM based approach - perhaps using BeautifulSoup to be able to deal with invalid HTML - parsing out all nodeValue
s would be the most reliable thing.
Maybe this is overkill, and the stored procedure will work just fine in the OP's case - it looks like it from his comment, and that's perfectly all right. I'm just saying, if you can't control the incoming HTML, don't strip HTML with the limited means the database offers for the job.