tags:

views:

187

answers:

2

I have a project which requires me to load an HTML document as a string, and parse it. I am trying to determine which HTML node will exceed the height of a page (8.5x11) so I can insert a ‘page-break-after’ before it. This will be done with a .NET dll I am producing.

I have tried using the mshtml dom. It’s not easy to load a string value into this, and when I did manage to accomplish this the offsetHeight (etc) properties always return zero. The only way I have found to make this work is to save the HTML to disk, load it via SHDocVw.InternetExplorer, and then pass that to the mshtml dom.

I’m assuming that unless the HTML is ‘rendered’ by SHDocVw, I have no offsetHeight information for mshtml to report, as this is based on screen pixels. I could be wrong.

My current code is as follows:

Dim myIE As New SHDocVw.InternetExplorer
myIE.Navigate("D:\Temp\Test.HTML")
Dim myDoc As mshtml.HTMLDocument = CType(myIE.Document, mshtml.HTMLDocument)

Dim divTag As mshtml.IHTMLElement = myDoc.getElementById("someID")

For Each childNode As mshtml.IHTMLElement In TryCast(divTag.children, mshtml.IHTMLElementCollection)
    If childNode.offsetTop + childNode.offsetHeight > 750 Then '72pixels = 1 inch.
         childNode.insertAdjacentHTML("beforeBegin", "<DIV style='page-break-after:always'></DIV>") 
    End If
Next

I have two goals. #1 is key, #2 ideal.

1) Load the HTML from a string, and have the above code still work.

2) Idealy, find a .NET component that will do the same thing. I don’t like relying on COM components in .NET unless I have no choice.

A: 

WebBrowser (maybe, not sure) will take your HTML string and convert it to a navigable DOM. Reuse, don't reinvent an HTML parser. you'll have more hair left at the end of your project.

No Refunds No Returns
Not sure if I follow. WebBrowser is a windows.forms class. I have a formless .dll. Can you do this? I'll tinker with this, but I'm curious if there is not a better way?
Brett
Well what do ya know! I added a windows form with a webbrowser control, and it works like a charm. I don't see the window appear when called, so that works for me! Thanks!
Brett
I used it in numerous unit tests a coupla years ago. Wasn't sure of the class but glad it worked.
No Refunds No Returns
A: 

Use the Html Agility Pack from Codeplex which is the most thorough parser for HTML and creates a DOM tree based on the HTML structure.

Hope this helps, Best regards, Tom.

tommieb75
I have tried this. Excellent parser, but there does not appear to be any positional properties to tell me where I am in relation to a page. The streamPosition might be it, but the documentation is not clear as to what this value is.
Brett
Use the typical XPath like structure to get at the path of child nodes that has the element 'someID'...
tommieb75
Used it ... has some flaws that make it unusable in some situtaions. If your HTML is complex, I'd avoid this. Worked ok for simple pages.
No Refunds No Returns