views:

17

answers:

1

Hi.

I have to extract metadata from a lot (my small working sample has hundreds, the total will probably be thousands) of Microsoft Office files, mostly Word ones.

These files Word versions go from Word 2.0 to Word 2007.

I have to do it in .net 3.5 (using c#) and it's a local winforms application.

I can extract metadata from the most recent ones with OLE Automation (DsoFile.dll), i think. I did it successfully with some of them.

The problem is that the older formats aren't supported by DsoFile. They dont use OLE, probably.

I did a lot of googling and i found that the best (are probably the only) way to get the data i wanted was using antiword (http://www.winfield.demon.nl/). With antiword i can invoke its process and collect its output. It can extract some of the data but not all that i need. Example: antiword gives me only one of the stored dates and i need two of them.

There's also wvware but i guess it's linux-only.

Another option would be gnu libextractor but i can't find a way to use in on .net

Office Interop would be a desperate last resort. I haven't tested that option but i'm guessing it's not a option when one wants to process a huge amount of files with decent performance.

Can anyone help? If you need more data, just ask.

Sorry for my english, i'm not a native speaker.