tags:

views:

9

answers:

1

Hello everyone, I want to traverse through all the elements of an word document one by one and according to type of element (header, sentence, table,image,textbox, shape, etc.) I want to process that element. I tried to search any enumerator or object which can represent elements of document in office interop API but failed to find any. API offers sentences, paragraphs, shapes collections but doesnt provide generic object which can point to next element. For example :

<header of document>
<plain text sentences>
<table with many rows,columns>
<text box>
<image>
<footer>

(Please imagine it as a word document)


So, now I want some enumerator which will first give me <header of document>, then on next iteration give me <plain text sentences>, then <table with many rows,columns> and so on. Does anyone knows how we can achieve this? Is it possible?

I am using C#, visual studio 2005 and Word 2003.

Thanks a lot

+2  A: 

The reason that you don't have a simple iterator is that Word documents can be far more complex than the simple structure outlined in your question.

For example, a document may have multiple headers and footers for the first page as well as even and odd pages, contains more than one section with different header and footer setup, contain footnotes, comments and revisions, and objects such as tables, text boxes, images and shapes may appear inline with text or floating. In short, there is no fix sequence of elements.

You would have to check how complex your input documents are and based on the result of that analysis decide how to iterate over paragraphs and attached images and shapes etc.

0xA3
@0xA3, thanks a lot for explaining. So basically we cannot enumerate through document elements. I asked this question because I am facing problem while processing word tables and different shapes. Sometimes word API skips few words from document and hence my program fails. Also,API gives wrong sentences like if sentence is 'Worked as Sr. Programmer', then I get 'Worked as Sr.' as one sentence and 'Programmer' as second.It should have been single sentence. I want to avoid these kind of problems.Also there are many more problems with interop API.
Shekhar
@shekhar: Sure you can iterate over the contents, but not in a simple manner. Word gives you full access to all objects. Regarding sentence segmentation you need to consider that this is a not so trivial research topic in natural language processing.
0xA3
@0xA3,Is it possible to iterate over contents? How to do that?
Shekhar