ansaurus

Question

How do I query a database field but ignore the HTML markup?

Answer 1

+3 A:

Here's a User Defined Function for just that which you could leverage:

http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

Nissan Fan 2010-02-11 18:32:11

I was thinking a function like this would be much harder, but reading through it makes me realize that assuming I have well formed HTML this should always work.We are testing this out now. Thanks!

JoshBaltzell 2010-02-11 18:58:25

Every time you try to parse HTML with Regular Expressions, God kills a kitten.

Aaronaught 2010-02-11 19:26:11

Luckily there's no regular expression involved :)

Nissan Fan 2010-02-11 19:55:19

While parsing a field that contains HTML can never be perfect, this simple function works perfectly almost always. Thanks a lot!

JoshBaltzell 2010-02-11 20:27:38

Answer 2

+4 A:

Processing like this should not be done in the database. I would recommend creating a separate field containing only the text contents.

In response to @Nissan Fans comment: Extracting text from HTML is not the database's job IMO. It's too complex a job for it, and it has too many variables. I'm not well versed in reading stored procedures but if I read the code correctly, it will have problems with an (invalid but still often occuring) unencoded < in the source code. And it will most likely break for invalid HTML.

Or imagine one day, the customer comes and wants img elements' ALT properties indexed too. Or titles. Start building that with a "start position, end position" algorithm. You will go crazy.

I say, if this is needed to process HTML from varying sources outside your control on a day-to-day basis, leave this to a layer above the DB that is better equipped to handle this stuff. A DOM based approach - perhaps using BeautifulSoup to be able to deal with invalid HTML - parsing out all nodeValues would be the most reliable thing.

Maybe this is overkill, and the stored procedure will work just fine in the OP's case - it looks like it from his comment, and that's perfectly all right. I'm just saying, if you can't control the incoming HTML, don't strip HTML with the limited means the database offers for the job.

Pekka 2010-02-11 18:33:17

Duplicating the data because you have to query a subset of it seems irregular. That would be like breaking a date down into each component because someone wants to query on only the month. If this is not a extremely large scale database it shouldn't be an issue.

Nissan Fan 2010-02-11 18:37:31

This is a good idea, stripping out html every time you query is bound to be slow.

HLGEM 2010-02-11 18:41:04

It's also doubling the space used to store the same information. There are many other things to consider ... perhaps this search is a feature used once for every 5,000 times someone uses the HTML data? Imagine an app that's displaying a list of data with tags, but allows a textual search that is rarely if every used. There's not enough context to warrant breaking this out. And besides, my disagreement has less to do with his statement about a separate field and more to do with the fact that querying data like this is EXACTLY what a database is for.

Nissan Fan 2010-02-11 18:46:47

@Nissan Fan, you make valid points. Still, in this case I think the DB is not the right place for this. See my updated answer.

Pekka 2010-02-11 19:18:29

I think you are absolutely right that this is too much complicated processing to be imposing on the database. Were I searching a lot more rows and if the content of those rows were more important than they are I would do this.In my case though this is a description field that contains dubious data and after we tested a simple function watching for brackets everything worked fine.Therefore you get a +1 for this answer, but we used the simple function linked to by @Nissan Fan.

JoshBaltzell 2010-02-11 20:26:51

Answer 3

A:

If you can run regular expressions in your query, you can strip out the HTML and return only the text using the examples here: http://www.regular-expressions.info/examples.html

iKnowKungFoo 2010-02-11 18:35:31

Answer 4

A:

If you attempt to index one of these columns and access it by removing the html:

WHERE dbo.anyRemoveHtml(yourColumn)='your search text'

the index will not be used and you will table scan. this might not be a problem when the application has little data, but will result in slower and slower SELECTs as more data is added to the table.

note: dbo.anyRemoveHtml is just a made up name representing the function that you select to remove the HTML, and does not really exist

KM 2010-02-11 18:44:59

Answer 5

+1 A:

I agree with Pekka's; this isn't something that your database should be dealing with.

Cons to doing this parsing in the DB:

Performance issues. Using UDFs can degrade performance and lead to table scans. And even if you avoid table scans, you're still asking the DB to do a bunch of stuff (string manipulation) that it wasn't designed to do.
Harder to get right. Correctly parsing HTML is a tough job. True you can get 95% of the way there with a UDF, but handling this in the application layer might get you 100% of the way there.
Harder to test. I'd much rather write unit tests for HTML stripping code that execute in C# against string literals, rather than having to round-trip to the DB.

If you must do this in the DB...

If doing this in the DB is a requirement, consider this approach:

Add a second field to your DB to hold the plain text version of the contents.
Add a trigger so that each time the HTML value is changed, the text version is regenerated.
Write your queries against the plain text field.

You'll get better performance because you're only doing the parsing at write time, rather than on every search, and your DB will make better use of any indexes you define on the plain text field.

Seth Petry-Johnson 2010-02-11 20:41:11

ansaurus

tags:

views:

answers:

How do I query a database field but ignore the HTML markup?

related questions