views:

232

answers:

1

Hi,

Background: I have to download webpages with their resources for offline viewing, however as part of this I have to "rewrite" the URL's for links with the HTML webpage so they work. This is fine more the standard types of links however I'm realizing now that there are some links that are dynamically created by javascript.

Question: What approach (or even existing library) could I use to transcribe a web page with dynamically generated links (from javascript) to a webpage with normal non-dynamic links. (as then I can do the URL rewriting I need to do)

Notes:

  • It's almost as if I need to have a Javascript interpreter library that I pass the page HTML to, and it then spits out the generated java code perhaps? Then I can rewrite the links as I wish (the result would then not use the javascript dynamic approach).
  • Context is a C# WinForms (3.5) application.

Thanks

PS. Some examples:

<script type="text/javascript">
        <!--
            document.write("<a href=\"/home.asp\" onMouseOver=\"MM_swapImage('tab_home','','/_includes/images/tab_home_.gif',1)\" onMouseOut=\"MM_swapImgRestore()\"><img src=\"/includes/images/tab_home.gif\" alt=\"Home\" name=\"tab_home\" width=\"45\" height=\"18\" border=\"0\" id=\"tab_home\"><\/a>");

            if (window.document.location.pathname.indexOf("mysite.asp") != "-1") {
                document.write("<a href=\"/mysite.asp\" onMouseOver=\"MM_swapImage('tab_my_site','','/_includes/images/tab_my_site_.gif',1)\" onMouseOut=\"MM_swapImgRestore()\"><img src=\"/_includes/images/tab_my_site_.gif\" alt=\"My Site\" name=\"tab_my_site\" width=\"76\" height=\"18\" border=\"0\" id=\"tab_my_site\"><\/a>");
            }
            else {
                document.write("<a href=\"/mysite.asp\" onMouseOver=\"MM_swapImage('tab_my_site','','/_includes/images/tab_my_site_.gif',1)\" onMouseOut=\"MM_swapImgRestore()\"><img src=\"/_includes/images/tab_my_site.gif\" alt=\"My Site\" name=\"tab_my_site\" width=\"76\" height=\"18\" border=\"0\" id=\"tab_my_site\"><\/a>");
            }

and

<script type="text/javascript">
  var fo = new FlashObject("/homepage/ia/flash/hero/banner.swf?q=1", "hero", "642", "250", "8", "#ffffff");
  fo.addParam("wmode", "transparent");
  fo.addParam("allowScriptAccess", "always");
  fo.addParam("base", "/homepage/ia/flash/hero/");
  fo.write("flashContent");
</script>

and

<td width="1%">  
  <a href="javascript:checksubmit(this);" 
      onmouseover="MM_swapImage('but_srch_go','','/_includes/images/but_srch_go_.gif',1)"      
      onmouseout="MM_swapImgRestore()">        
      <img src="http://localhost:3000/sites/http://qheps.health.qld.gov.au/_includes/images/but_srch_go.gif" alt="Go" name="but_srch_go" width="57" height="40" border="0">   
   </a>
</td>
+2  A: 

If you're not using the WebBrowser control you might be able to use the JScriptEvaluate method in JScript.NET but chances are you'll need to evaluate more than just a simple expression. The WebBrowser control is certainly the easier route.

If you are using the WebBrowser control, you can invoke the "eval" method from C# pretty easily.

/// <summary>
/// Handles the Navigated event of the browser control.
/// </summary>
/// <param name="sender">The source of the event.</param>
/// <param name="e">The <see cref="T:WebBrowserNavigatedEventArgs"/> instance containing the
/// event data.</param>
private void browser_Navigated( object sender, WebBrowserNavigatedEventArgs e )
{

    string codeToEval = "window.alert('blah')";

    if ( browser.Document != null ) {

        object window = browser.Document.Window.DomWindow;
        if ( window != null ) {

            Type windowType = window.GetType();
            BindingFlags flags = BindingFlags.InvokeMethod | BindingFlags.Instance;
            string[] args = { codeToEval, "JScript" };

            windowType.InvokeMember( "[DispID=1165]", flags, null, window, args );

        }   // if

    }   // if

}

There is a third option too. You could always download the HTML pages as-is without rewriting the URL's then in the code that presents the HTML to the user, you could trap the click on the link and cancel navigation and instead navigate to the corresponding "offline" link.

Josh Einstein
Thanks - I'll look into this. So is the idea that I can use this contro without having to display it in a form then? As it's for an under the bonnet routine that shouldn't be displayed to the user.
Greg
Well threading concerns aside, yeah you could use a WebBrowser control that is fully automated and hidden without any display to the user. It will of course be slower than a pure HTTP request because it'll go through the DOM, the rendering engine, and the scripting engine (which is what you wanted anyway.) I suppose you could also try the MSHTML object model without WebBrowser. Both of these options would require you to have a STA thread and probably a message loop so just be sure to do your work in the Winforms main thread.
Josh Einstein
Ummm...what about just executing the jvascript itself? I'll have the page HTML + the other JS files/text he page references. In fact you would think the JavaScript engine/library would only have to execute basic string manipulations?
Greg
Right but what if the expressions that generate the links depend upon page events. For example, it's very common in jQuery to create elements in response to DOM events which your client would need to replicate. This is why the WebBrowser control (or possibly MSHTML) would be much simpler because you would literally be loading the page in an actual browser with full automation capabilities.
Josh Einstein
ugh..plot keeps thickening..I've added some examples from web pages to the post
Greg