ansaurus

Question

How to prepare a Word 2007 document so that C# can pull data out of it semantically?

Answer 1

A:

Use Bookmarks for Start and Stop of Each Story

I strongly suggest this technique.

Mark the start and end of each "story" with Word's Bookmark feature. To see "bookmarks", go to Word Options, Advanced, Show document content, and check Show bookmarks.

Then just go through the document collecting the content between the bookmarks.

Fairly easy and a technique I been using since Word 6.x. The only issue is having to come up with 200 bookmark names. Yet, this may be an advantage because the bookmark name could be the migrated to a "name" field in the database.

Using Styles to Mark Story Content

Another technique is to define specific style or styles that make up the story. You then extract the styles. This is a little harder and can be error prone if the author is not disciplined.

Using Text Boxes That Contain Story Content

Lastly, if these "stories" can be placed into a "text box", you can simply extract the text-boxes content. The problem with this approach is the limitations of the text-box and document layout changes which the author may not what to apply.

Notes

There are others ways, but the bookmark approach is the easiest to use and implement. I will try to respond to any comments/questions you have.

MSDN Search for "vsto word bookmark" at http://social.msdn.microsoft.com/Search/en-US?query=vsto%20word%20bookmark&refinement=-112&ac=3
MSDN Search for "vsto word 2007" at http://social.msdn.microsoft.com/Search/en-US?query=vsto%20word%202007&refinement=-112&ac=3

AMissico 2010-08-07 18:31:18

Answer 2

A:

Following is the xml for a docx document, which contains a heading containing the word "Title" and two paragraphs containing the word "Content". Study a sample file of the novel while your friend is writing it, use a uniform format for all heading and paragraph elelments and you will be able to parse it pretty easily.The content is in the word/document.xml of the zipped docx file.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"&gt;&lt;w:body&gt;&lt;w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

abel 2010-08-07 18:46:41

Answer 3

+1 A:

Okay, this can be resolved in numerous ways.

First of all, I would suggest that you save the file to a *.txt, to have some plain text to parse.

Then, your friend will have to be really consistent during the writing, because what you will create, (text parser) will need consistency.

Make some rules like :

Title on first line, then 2 linebreaks;
All the paragraphs separated with 1 linebreak;
Then 3 linebreaks after the last paragraph;

After that, load the file, and parse it using the rules above.

{enjoy}

Micael Bergeron 2010-08-12 19:44:30

+1 nice suggestion, I'm pursuing it, I suppose as well I could create some kind of service that would regularly convert the .doc file to .txt so that he could simply keep his .doc file in some accessible director and every 10 minutes or so, a service would convert it to text so it could be parsed by various applications as a text file

Edward Tanguay 2010-08-13 10:15:26

That could be an idea. Take care to check before if the file is currently opened by Word, as it will probably render it read-only.I would suggest just saving the file in both formats, when saving is due.

Micael Bergeron 2010-08-13 13:01:25

Answer 4

+2 A:

What I would do is use styles. Have one style for each type of content, and write a macro that traverses your document paragraph-by-paragraph and spits out the corresponding text file.

Jonathan Yee 2010-08-13 21:36:33

ansaurus

tags:

views:

answers:

How to prepare a Word 2007 document so that C# can pull data out of it semantically?

Use Bookmarks for Start and Stop of Each Story

Using Styles to Mark Story Content

Using Text Boxes That Contain Story Content

Notes

related questions