views:

249

answers:

4

I have a table in SQL Server 2005 with hundreds of rows with HTML content. Some of the content has HTML like:

<span class=heading-2>Directions</span>

where "Directions" changes depending on page name.

I need to change all the <span class=heading-2> and </span> tags to <h2> and </h2> tags.

I wrote this query to do content changes in the past, but it doesn't work for my current problem because of the ending HTML tag:

Update  ContentManager
Set ContentManager.Content = replace(Cast(ContentManager.Content AS NVARCHAR(Max)), 'old text', 'new text')

Does anyone know how I could accomplish the span to h2 replacing purely in T-SQL? Everything I found showed I would have to do CLR integration. Thanks!

A: 

Gah, use jquery! Don't make life difficult.. there's some sample replace code on the jquery homepage and you can just include the jquery-1.4.2.js in the <head> section

Stiggz
A: 

Indeed T-SQL does not natively support regular expressions and this is the sort of problem in which regular expressions would be the tool of choice. First, I'll say that the level of complication in the solution depends greatly on how consistent your data is. For example, suppose we search for items with the heading:

Select ..
From ...
Where HtmlContent Like '<span class="heading-2">%'

This assumes no additional spacing between span and class as well as no additional spacing after the final double quote before the end bracket. We could write '%<span%class="heading-2"%>%' to account for the spaces but that would also find div tags marked as heading-2 in the same content as any span tag. If this later scenario shouldn't happen but you might have varying spaces, then use this revised pattern. Where we will really run into troubles is the closing tag. Suppose our content looks like so:

<span class="heading-2"> Foo <span class="heading-3">Bar</span> And Gamma Too</span> .... <span class="heading-4">Fubar Is the right way!</span>...

It is not so simple to find the correct closing span tag to change to </h2>. You cannot simply find the first </span> and change it to </h2>. If you knew that you had no nested span tags, then you could write a user-defined function that would do it:

Create Function ReplaceSpanToH2( @HtmlContent nvarchar(max) )
Returns nvarchar(max)
As
Begin
    Declare @StartPos int
    Declare @EndBracket int

    Set @StartPos = CharIndex('<span class="heading-2">', @HtmlContent)
    If @StartPos = 0
        Return @HtmlContent

    Set @HtmlContent = Replace(@HtmlContent, '<span class="heading-2">', '<h2>')

    -- find next </span>
    Set @StartPos = CharIndex('</span>', @HtmlContent, @StartPos)

    Set @HtmlContent = Stuff(@HtmlContent, @StartPos, 7, '</h2>')
    Return @HtmlContent
End
Thomas
Please consider revising the line "this is the sort of problem in which regular expressions would be the tool of choice". Regular expressions are ill-suited to parsing HTML or XML. A DOM parser (such as the Html Agility Pack) is a much better choice. Again, that's not supported by T-SQL.
TrueWill
@TrueWill - For a simple replace such as being requested, which yes is parsing some Html/XML, regular expressions are a far more appropriate tool than T-SQL. For extensive, or in depth parsing of Html, something like the Agility Pack would be more appropriate. However, if all you want to do is find a single tag in a blob of text, an Html parser is likely to be overkill.
Thomas
@Thomas - That way lies madness. http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
TrueWill
@TrueWill - Sorry I disagree. You are ignoring the context of the problem. First, this is a one time update. Second, he wants to do this in a simple SQL statement if possible which knocks the Agility Pack out of the equation. Third, the pattern is somewhat well defined. He might be able to accomplish his goal using T-SQL which in and of itself means that a regex solution would provide even more flexibility. I am well aware the *generally* trying to parse *large* amounts of Html on an ongoing basis using regex is poor choice but that does not fit the parameters of this problem.
Thomas
@TrueWill - Perhaps you haven't read the entire article. To wit: "So, while I may attempt to parse HTML using regular expressions in **certain situations**, I go in knowing that: A It's generally a bad idea.B. Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it..." In this problem, we DO have strict conditions on what is being done and thus using regex is not necessarily a bad choice.
Thomas
A: 

I am not real strong in SQL Server but here is how I would try to do this:

UPDATE TableName SET FieldName = REPLACE(FieldName ,'<span class=heading-2>', '<h2>') SET FieldName = REPLACE(FieldName, '</span>', '</h2>')

There might need to be 2 UPDATE statements issued, I am not sure if you can operate on the same field this way. The OP did say ALL occurences of the text. Set me straight if I am missing something.

Of course if there are other <span class=heading-2> or </span> text that you don't want to change, this won't work.

DaveB
+1  A: 

If you are positive that all of the HTML is (and will continue to be) valid XHTML and you're using SQL Server 2005 or later, you might be able to cast the columns to an XML data type and use XQuery. See http://msdn.microsoft.com/en-us/library/ms345117%28SQL.90%29.aspx

(Caveat: I haven't tried this.)

I think the best answer, though, is Michael Petito's comment. I would write an application to do this and use the Html Agility Pack. That will provide a permanent, maintainable solution that will work in nearly all cases.

(If this is a one-shot and you don't care about accuracy, then pick your poison.)

TrueWill
How can you claim that using regular expressions is "madness" and turn around and suggest parsing Html as XML?! If the Html is not well formed, it will not parse into XML! The odds of this working are orders of magnitude lower if there is even one small problem in the markup than simply using standard search or regex. Sorry, but this is a fragile solution.
Thomas
TBH, I'd take back my downvote, if you simply suggested using the Agility Pack as you did your comments. While I think it is overkill, it is a significantly better solution than even attempting to use the XML data type.
Thomas
@Thomas: Please note that my answer starts with "**If the HTML is valid XHTML**". The OP didn't mention what the origin of the HTML was. If it's known to be XHTML or well-formed, this may be worth considering. If not, it won't work. In the latter case I would write an application (as Mr. Petito suggested) and use the Html Agility Pack. I've had to maintain 95% solutions and "temporary" quick-and-dirty code that lived on; I'd rather not put others through that.
TrueWill
@TrueWill - Even if you *thought* it was valid XHTML, one bad record screws the whole update. If you think writing an app using Agility Pack is the right solution then say that *instead* of proposing a solution you clearly know to be bad.
Thomas
@TrueWill - If there was even a remote possibility that the solution *would* live on after it was written, then I absolutely agree that an Html parser is the right choice. However, it should be clear that a solution that lives on is not a one-time solution.
Thomas
Revised my answer.
TrueWill