views:

1395

answers:

3

Hello,

I am planning to develop web crawler, which would extract coordinates of html elements from web pages. I have found out that it is possible to get html element coordinates by using "mshtml" assembly. Right now I would like to know if it is possible and how to get only necessary information (html,css) from web page, and then by using appropriate mshtml classes get correct coordinates of all html elements?

Thank you!

A: 

I am not sure how one could do this in C# as it's not my language of choice but it could be done using Javascript, particularly using jQuery's offSet() function.

Mathew Attlee
Could you explain this method a little bit more, what steps should be done in order to obtain information that I need?
spacemonkey
It's just a case of calling the function on the DOM elements you want the offset for. There is a few examples in that link.
Mathew Attlee
+2  A: 

I use these c# functions to determine element positions. You need to pass in a reference to the HTML element in question.

public static int findPosX( mshtml.IHTMLElement obj ) 
{
  int curleft = 0;
  if (obj.offsetParent != null ) 
  {
 while (obj.offsetParent != null ) 
 {
   curleft += obj.offsetLeft;
   obj = obj.offsetParent;
 }
  } 

  return curleft;
}

public static int findPosY( mshtml.IHTMLElement obj ) 
{
  int curtop = 0;
  if (obj.offsetParent != null ) 
  {
 while (obj.offsetParent != null ) 
 {
   curtop += obj.offsetTop;
   obj = obj.offsetParent;
 }
  } 

  return curtop;
}

I get HTML elements from the current document like so:

// start an instance of IE
public SHDocVw.InternetExplorerClass ie;
ie = new SHDocVw.InternetExplorerClass();
ie.Visible = true;

// Load a url
Object Flags = null, TargetFrameName = null, PostData = null, Headers = null;
ie.Navigate( url, ref Flags, ref TargetFrameName, ref PostData, ref Headers );

while( ie.Busy )
{
  Thread.Sleep( 500 );
}

// get an element from the loaded document
mshtml.HTMLDocumentClass document = ((mshtml.HTMLDocumentClass)ie.Document);
document.getElementById("myelementsid");
Matthew Lock
A: 

The answer given here is quite good. However, opening up a full instance of Internet Explorer seems like overkill if you're trying to process a good deal of information from within your own spider.

Has anyone found/encountered/thought of a way to to do this using the underlying interface structures of IE (Like IHTMLDocument2 and it's brethren).

Thanks in advance.

Dr.HappyPants
question not an answer.
Anonymous Type
Lame. Lame. Lame.
Dr.HappyPants