views:

646

answers:

4

I need to determine which pages of a Word document that a keyword occurs on. I have some tools that can get me the text of the document, but nothing that tells me which pages the text occurs on. Does anyone have a good starting place for me? I'm using .NET

Thanks!

edit: Additional constraint: I can't use any of the Interop stuff.

edit2: If anybody knows of stable libraries that can do this, that'd also be helpful. I use Aspose, but as far as I know that doesn't have anything.

+1  A: 

This is how I get the text out, I believe you can set set the selection range to a page, then you could test that text, might be a little backwards from what you need but could be a place to start.

Microsoft.Office.Interop.Word.Application wordApplication = new Microsoft.Office.Interop.Word.Application();
object missing = Type.Missing;
object fileName = @"c:\file.doc";
object objFalse = false;

wordApplication.DisplayAlerts = Microsoft.Office.Interop.Word.WdAlertLevel.wdAlertsNone;
Microsoft.Office.Interop.Word.Document doc = wordApplication.Documents.Open(ref fileName, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,ref objFalse, ref missing, ref missing, ref missing, ref missing);

//I belevie you can define a SelectionRange and insert here
doc.ActiveWindow.Selection.WholeStory();
doc.ActiveWindow.Selection.Copy();

IDataObject data = Clipboard.GetDataObject();
string text = data.GetData(DataFormats.Text).ToString();

doc.Close(ref missing, ref missing, ref missing);
doc = null;

wordApplication.Quit(ref missing, ref missing, ref missing);
wordApplication = null;
Douglas Anderson
Thanks! I definitely appreciate the answer. I guess I should have mentioned my constraints - can't use Interop.
Adam A
I'm marking this as the best answer I could get. Hopefully it'll help someone else in the future.
Adam A
A: 

How are you defining a page?

If you only count section/hard page breaks it complex, but doable. If you want to count soft page breaks the task becomes very very difficult and somewhat meaningless. Consider that the determination of where soft-page breaks land is dynamically generated at run-time and is not stored in the file itself. It depends on a huge number of factors including the active printer driver (yes it can change for the same file on a different computer), fonts, kerning, line spacing, margins, etc, etc ,etc.

JohnFx
Unfortunately I want the soft, very very difficult version. I wouldn't say it's meaningless though. It's the only definition of page that actually means anything in the real world (anything else isn't really WYSIWYG).I definitely appreciate the clarification, and thanks a lot for your response.
Adam A
In that case you are likely going to have to "print" to a fixed page format such as tiff or PDF to accomplish this, but make sure you always do the operation on the same machine with the same print drivers selected to get consistent results.
JohnFx
A: 

One crappy way to do this with Aspose is to convert the Word file to a PDF and then grab text on each page.

I don't know anything about the Aspose internals or how they define their soft pages when converting, but this is the best I've got so far.

Adam A
A: 

Thank you for using Aspose.Words.

In the public API we currently have only the "flow-document" information e.g. paragraphs, tables, lists etc. Internally, we build a page layout model that has classes like page, block of text, line of text and so on. There are internal links of course between the document model and the layout model and it is possible to find out which page ends where and all the stuff. Making this information available via the public API is (well, still) high on our priority list.

Have you logged your request in the Aspose.Words support forums? We use this info to maintain a voting system and will work on features that get more votes first.

romeok
Not sure about logging the request, but I did ask about it on an existing thread there. I was told it was coming up and then never heard back. Converting to a PDF and then using GetNextPageText worked for me, but probably won't scale well in the future. If you guys did end up making it public that'd definitely be great. Thanks for taking interest!
Adam A