views:

234

answers:

3

I recently tried to import a bunch of blog posts from an old blog (SharePoint) to my current blog (WordPress). When the import completed, a lot of nasty <div> tags and other HTML made it in to the content of the post, which screwed up the way my site was rendering.

I'm able to view the offending rows in the MySQL database and want to know if there's a way to selectively remove the HTML text that may be causing problems. I could probably hack this in C# by parsing through the text, but I'd like to figure out how I can do this using SQL if I can.

If you want to see a full text sample of what one of these files looks like as it exists in the database text field, I uploaded a full sample file to my web site.

Here's want I want to do:

  • Remove <![CDATA[<div><b>Body:</b> from the beginning of every file
  • Remove the meta information at the end of every file, which might look like this:

    <div><b>Category:</b> SharePoint</div>
    <div><b>Published:</b> 11/12/2007 11:26 AM</div>
    ]]>
    
  • Remove every <div> and closing </div> tag, which might have a class attribute like:

    <div class=ExternalClass6BE1B643F13346DF8EFC6E53ECF9043A>
    

    Note: The hex string at the end of the ExternalClass can be different

I haven't used an Update statement in MySQL before and I'm at a loss for where to begin to selectively replace text within a text field. Would I use regex from within a SQL statement to help? How would I execute a statement against the remote DB?

+1  A: 

There is no simple way of doing this without utilizing the back-end platform which you are using to serve your website or are most acustomed to. Myself, I would use PHP or Perl to clean the data up which will could be tricky at best. So the answer is, it can be done, but you must use some type of programming/processing language to do so, MySQL on its own won't be able to clean the data.

drlouie - louierd
+1  A: 

What about cleaning up the posts before you import them? Seems like working with a local file that you can treat as a text file would be far easier. Then you could use Perl or Python to bear down on the problem to your liking before importing.

This assumes that you still have access to the data that was over in SharePoint.

John at CashCommons
I like your idea. You can save SharePoint blogs as a giant RSS XML file. I think I'll try parsing that with LINQ-to-XML and C# to see if I can't re-save the file without all of the junk in the posts.
Ben McCormack
+1  A: 

Assuming you are determined to use SQL like you said in your question, If you have the skill to hack it with C# you should be able to figure out how to create a stored procedure that uses a cursor in a repeat/fetch loop to select the rows, string functions to massage the data, and an update to update the row. Check this out:

http://dev.mysql.com/doc/refman/5.0/en/cursors.html

fupsduck
hmmm...I may not be so determined to use SQL as I originally was =)
Ben McCormack