views:

209

answers:

4

I have a friend who is writing a 400-page book in Microsoft Word 2007.

Throughout the book he has 200 stories each which consist of numerous paragraphs.

When he is finished writing the book, he wants to copy the text of each story that is embedded in his Word document into a database table such as:

Title, varchar(200)
Description, text
Content, text

We do not want to have to copy and paste each story into the database but want to have a program automatically pull the marked up data from the Word file into the appropriate fields in the database.

  1. What does he have to do in Microsoft Word to denote each group of paragraphs as "story content" and each title as a "story title" etc. A prerequisite is that this markup cannot be visible in the document. I know that Word 2007 files are basically zipped XML files so I assume this is possible and I assume that stylesheets are what we need, but how do I need to prepare the Word document precisely so that as he adds stories they are properly marked up?

  2. I assume that the new COM Interop features of C# 4.0 is what I need to analyze the Word file and retrieve only the title, description, and content from the embedded stories, but how do I do this technically? Does anyone have examples?

Does anyone have experience doing a project like this (reading Microsoft Word as a semnatic data file) that they could share?

A: 

Use Bookmarks for Start and Stop of Each Story

I strongly suggest this technique.

Mark the start and end of each "story" with Word's Bookmark feature. To see "bookmarks", go to Word Options, Advanced, Show document content, and check Show bookmarks.

Then just go through the document collecting the content between the bookmarks.

Fairly easy and a technique I been using since Word 6.x. The only issue is having to come up with 200 bookmark names. Yet, this may be an advantage because the bookmark name could be the migrated to a "name" field in the database.

Using Styles to Mark Story Content

Another technique is to define specific style or styles that make up the story. You then extract the styles. This is a little harder and can be error prone if the author is not disciplined.

Using Text Boxes That Contain Story Content

Lastly, if these "stories" can be placed into a "text box", you can simply extract the text-boxes content. The problem with this approach is the limitations of the text-box and document layout changes which the author may not what to apply.

Notes

There are others ways, but the bookmark approach is the easiest to use and implement. I will try to respond to any comments/questions you have.

AMissico
A: 

Following is the xml for a docx document, which contains a heading containing the word "Title" and two paragraphs containing the word "Content". Study a sample file of the novel while your friend is writing it, use a uniform format for all heading and paragraph elelments and you will be able to parse it pretty easily.The content is in the word/document.xml of the zipped docx file.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"&gt;&lt;w:body&gt;&lt;w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>
abel
+1  A: 

Okay, this can be resolved in numerous ways.

First of all, I would suggest that you save the file to a *.txt, to have some plain text to parse.

Then, your friend will have to be really consistent during the writing, because what you will create, (text parser) will need consistency.

Make some rules like :

  1. Title on first line, then 2 linebreaks;
  2. All the paragraphs separated with 1 linebreak;
  3. Then 3 linebreaks after the last paragraph;

After that, load the file, and parse it using the rules above.

{enjoy}

Micael Bergeron
+1 nice suggestion, I'm pursuing it, I suppose as well I could create some kind of service that would regularly convert the .doc file to .txt so that he could simply keep his .doc file in some accessible director and every 10 minutes or so, a service would convert it to text so it could be parsed by various applications as a text file
Edward Tanguay
That could be an idea. Take care to check before if the file is currently opened by Word, as it will probably render it read-only.I would suggest just saving the file in both formats, when saving is due.
Micael Bergeron
+2  A: 

What I would do is use styles. Have one style for each type of content, and write a macro that traverses your document paragraph-by-paragraph and spits out the corresponding text file.

Jonathan Yee