views:

194

answers:

2

I am converting about 4000 text files that contain HTML into MySQL database entries. Seems like an easy way to do this would be to add a couple lines to the HTML to make them appear as XML files, then XSLT the XML into MySQL INSERT statements. (Building a CSV would also work, but less optimal IMHO). I've tried doing this but am having little luck getting my XSL to play nice.

I'm on a Windoze box but can SSH into my webhost and run PHP, maybe Perl. Would love to automate this as much as possible. I can build a list of the files and feed that into a script easily enough.

Filename pattern: ab12345.html (numeric portion varies from 3-6 digits)

Filename content sample--this is the entirety of the file, there are not HTML footers/headers:

<div class="abEntry"><a name="top"><img width="1" height="1" src="images/common/blank.gif"/></a><div id="abEntryTitle"><div id="abEntryTitleText">What does error note "90210 Cannot Do This Thing" mean?</div></div>
      <div class="abEntryParagraph">This error means your McWhopper drive is frazzled. Read me the number off the modem--thats the little boxy thing attached to the big boxy thing--thanks.</div>
     <div class="abEntryDocumentNumber">ab90210</div>

MySQL columns and how I'd like them to map back to the content above

EntryID = auto increment
title = contents of #abEntryTitleText
content = contents of #abEntryParagraph
lastupdated = curdate
related = "1"
visible = "1"
sortorder = "0"
userid = "1"
views = "0"
posvotes = "1"
negvotes = "0"
score = null
emailed = null
detectrelated = "1"
metakeywords = null
metadescription = contents of #abEntryDocumentNumber
startdate = curdate
enableexpiry = "0"
expirydate = null
featured = "0"
workflowstatus = "auto_approved"

XSL that I've tried:

<xsl:transform version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform ">
<xsl:output method="html" indent="no"/>
<xsl:template match="/"><xsl:apply-templates/></xsl:template>
<xsl:template match="content">
<xsl:text>INSERT INTO questions (approved, title, description, publishDate) VALUES </xsl:text><xsl:text>(1, </xsl:text><xsl:value-of select="id(abEntryTitleText)"/><xsl:text>, </xsl:text>
<xsl:copy-of select="node()|@*"/>
<xsl:text>, </xsl:text>TODAY<xsl:text>,1, 1)</xsl:text>
</xsl:template>
</xsl:transform>
A: 

I'm not familiar with xsl, so I would tackle this problem using php's DOM. IIRC it could parse html without being proper xml.

Tutortial at www.phpro.org

Bob Fanger
A: 

The xslt you are looking for to create an insert statement from the <div> element would be

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
 xmlns:msxsl="urn:schemas-microsoft-com:xslt" 
 exclude-result-prefixes="msxsl">

 <xsl:output method="text" indent="no"/>

 <xsl:template match="div[@class='abEntry']">
INSERT INTO questions (approved, title, content, metadescription, publishDate)
VALUES (1, '<xsl:value-of select="normalize-space(*/div[@id='abEntryTitleText']/text())" />', '<xsl:value-of select="normalize-space(div[@class='abEntryParagraph']/text())" />', '<xsl:value-of select="normalize-space(div[@class='abEntryDocumentNumber']/text())" />', TODAY)
 </xsl:template>

</xsl:stylesheet>

You can modify this further to include the other constant column values.

After that you obviously need a script or application to run the xstl on each file. I could write something quickly in .Net if you like, but if you have some other tools/scripting capability handy it might be quicker to use that.

bstoney