I have a friend who is writing a 400-page book in Microsoft Word 2007.
Throughout the book he has 200 stories each which consist of numerous paragraphs.
When he is finished writing the book, he wants to copy the text of each story that is embedded in his Word document into a database table such as:
Title, varchar(200)
Description, text
Content, text
We do not want to have to copy and paste each story into the database but want to have a program automatically pull the marked up data from the Word file into the appropriate fields in the database.
What does he have to do in Microsoft Word to denote each group of paragraphs as "story content" and each title as a "story title" etc. A prerequisite is that this markup cannot be visible in the document. I know that Word 2007 files are basically zipped XML files so I assume this is possible and I assume that stylesheets are what we need, but how do I need to prepare the Word document precisely so that as he adds stories they are properly marked up?
I assume that the new COM Interop features of C# 4.0 is what I need to analyze the Word file and retrieve only the title, description, and content from the embedded stories, but how do I do this technically? Does anyone have examples?
Does anyone have experience doing a project like this (reading Microsoft Word as a semnatic data file) that they could share?