views:

168

answers:

1

So, I have this word document that has a whole bunch of tables some of which are pretty long. It spans many many pages in some cases. I need to programmatically convert this thing to XML.

I was initially told we could just copy paste into Excel and save it as a CSV, then I could convert from there which would be pretty easy. However, due to the formatting of some of the fields there would need to be a lot of extra manipulation on the spreadsheet after copying to Excel to get it to look right and to have the CSV come out correctly.

I should note that this is an add-on for an old app written in VB.Net 1.1 (cue frowny face) :(. However, I'm debating just writing a separate command line tool in C# 3.5 if that'll make it easier. Seems like C# has some Word interop stuff that I doubt was in the 1.1 framework, but I haven't investigated that too far.

So, I'm just looking for the best/quickest way this can be achieved. It doesn't matter so much how it's achieved as long as it is achieved and it's done programmatically. Some of the steps could be done manually if they aren't too tough. Like if getting it to some other format first would save a bunch of coding and isn't too difficult that would be fine.

Has anyone done anything like this before? Any ideas?

Update Ok, so here is an example of exactly what I'd need to do.

I have a word doc that looks something like this...

PROTOCOL:  BIRDS           

Field Name      Data Type      Required      Length      Total Digits      Fraction Digits      ValidValues/Comparison      Description
OBSERVATION_ID  Text           Yes          16          n/a             n/a                                          Unique observation identification.  Primary key.

So, there's the table with it's name and vendor (Protocol and Birds in this case). As an example it just has one field. Valid values/comparisons can have multiple things separated by commas where each thing would be enclosed by value tags inside the XML.

Now what I basically need to do is get that to convert to this XML...

<?xml version="1.0" encoding="utf-8"?>
<Formats xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="Formats.xsd">
  <VendorFormats Vendor="PROTOCOL" LastModified="2005-9-13">
    <Format Name="BIRDS" Version="3" VersionDate="2005-9-10">
      <BaseTable>BIRDS</BaseTable>
      <StageTable>STAGE_BIRDS</StageTable>
      <Fields>
        <Text Name="OBSERVATION_ID" Required="Y">
          <NullValue />
          <Description>Unique observation identification.  Primary key.</Description>
          <Length>16</Length>
        </Text>
      </Fields>
    </Format>
   </VendorFormats>
 </Formats>

There will always be a base table and a stage table where base table is the same name as whatever follows the colon at the beginning of the (PROTOCOL: BIRDS, so it would be BIRDS) and the stage table is always STAGE_ then what follows the colon. You'll also notice the version and the last modified and version date in the XML. These things can be worried about later and perhaps manually added.

A: 

This link appears to detail a variety of possible solutions. Probably worth taking a look.

Brian Agnew
Thanks for the link. Can't really use the upCast thing which most of this article refers to since it's not free, but the VB.Net link in there might provide some insight.
Carter