tags:

views:

18

answers:

1

I'm working with docx docs, and I need to parse a document into sections on the basis of headings styled with the "heading 1" style. So if I had a doc like this (markup is pseudocode):

<doc>
<title style>Doc Title</title style>
<heading1>First Section</heading1>
...
<heading2>Second Section</heading2>
...
<heading3>Third Section</heading3>
...
</doc>

I'd want to break this into a doc with four sections, the first being the content that precedes the first section. I figure that this is probably pretty simple once you're familiar with Open XML, but I am not.

TIA.

+1  A: 

Wow...not even any views on this question all day. Well, I figured it out and thought I'd share the wealth. I can't share the code directly, but it's just three nested loops, one looping through the paragraphs, then the paragraph runs, then the styles. The XPath for each of those is:

.//w:p
./w:pPr
./w:pStyle

Once you find a run with the style you like, you pop back up a level to get the first run, which will contain the styled text. From there on, it's just Comp Sci 101 stuff. I think the real breakthrough was to not even try to mess with the Open Xml SDK (aside from the IO Packaging stuff), and go straight to XML manipulation.

Chris B. Behrens
You can go ahead and accept your own answer as the correct one.
Otaku