views:

184

answers:

5

Folks,

I need to accomplish some sophisticated web crawling.

The goal in simple words: Login to a page, enter some values in some text fields, click Submit, then extract some values from the retrieved page.

What is the best approach?

  1. Some Unit testing 3rd party lib?
  2. Manual crawling in C#?
  3. Maybe there is a ready lib for that specifically?
  4. Any other approach?

This needs to be done within a web app.

Your help is highly appreciated.

+4  A: 

WatiN.

http://watin.sourceforge.net/

var browser = new IE();

browser.GoTo("http://www.mywebsite.com");

browser.TextField("username").TypeText("username goes here"); // alternatively, use .Value = if you don't need to simulate keystrokes.

browser.Button(Find.ById("submitButton")).Click();

and in your asserts on the return page:

Assert.AreEqual("You are logged in as Username.", ie.TextField("username").Value); // you can essentially check any HTML tag, I just used TextField for brevity.

Edit -

After reading the edit on doing this from within a web browser, you might consider using WebRequest and the HTML Agility Pack to validate what you get back:

WebRequest:

http://msdn.microsoft.com/en-us/library/debx8sh9.aspx

HTML Agility Pack:

http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack

Ian P
Oh sorry.. I forgot to mention: I need this being done inside a web app. That's why I cannot use watin.
charlie
Ah, just saw the web app edit.. This won't help with that.. lol
Ian P
Thanks Ian for trying. I appreciate that.Any other direction?
charlie
See the edit I just made.
Ian P
I checked HTML Agility Pack, and I'm not sure where to take it from there.I mean, I saw their example, and it allows actually querying the XPATH structure on the page, but not sure how this could take me closer to the goal, which involves more stuff like HTTP POST, moving the CookieContainer around, and so forth...
charlie
A: 

If you know what the form post values are supposed to be going in and coming out you could create an app in C# that uses the HttpWebRequest and post to the page and parse the results. This code is highly specialized for my own use but you should be able to tweak it around and make it do what you want. It's actually part of a bigger class that lets you add post/get items to it and then submits an http request for you.

// this is for the query string
char[] temp = new char[1];
temp[0] = '?';

// create the query string for post/get types
Uri uri = _type == PostType.Post ? new Uri( url ) : new Uri( ( url + "?" + postData ).TrimEnd( temp ) );

// create the request
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( uri );

request.Accept = _accept;
request.ContentType = _contentType;
request.Method = _type == PostType.Post ? "POST" : "GET";
request.CookieContainer = _cookieContainer;
request.Referer = _referer;
request.AllowAutoRedirect = _allowRedirect;
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3";

// set the timeout to a big value like 2 minutes
request.Timeout = 120000;

// set our credentials
request.Credentials = CredentialCache.DefaultCredentials;

// if we have a proxy set its creds as well
if( request.Proxy != null )
{
   request.Proxy.Credentials = CredentialCache.DefaultCredentials;
}


// append post items if we need to
if( !String.IsNullOrEmpty( _body ) )
{
  using( StreamWriter sw = new StreamWriter( request.GetRequestStream(), Encoding.ASCII ) )
  {
     sw.Write( _body );
  }
}

if( _type == PostType.Post &&
     String.IsNullOrEmpty( _body ) )
{
  using( Stream writeStream = request.GetRequestStream() )
  {
      UTF8Encoding encoding = new UTF8Encoding();
      byte[] bytes = encoding.GetBytes( postData );

      writeStream.Write( bytes, 0, bytes.Length );
    }
}

if( _headers.Count > 0 )
{
  request.Headers.Add( _headers );
}//end if

// we want to keep this open for a bit
using( HttpWebResponse response = (HttpWebResponse)request.GetResponse() )
{
    // TODO: do something with the response
}//end using
Justin
Thanks Justin.I tried doing this, and for some reason I can't pass thru the 1st step.I get always back to the first URL, and the post doesn't work, even though when I do it manually everything seems to work fine.Do you have any code example for this?
charlie
@charlie, code was added. Again this is pretty specific to how we do things. But it shows you how to setup the request and if you need to change any values you certainly can do so. Many times the allow redirect property needs to be set to false on posts to prevent automatic redirects which turn into gets and you lose the post. I often find myself doing a post which gets sent to another page so you have to do one post, get the redirect URL and post again to that page. HTTP submits can be a little tricky and can take some work to get right.
Justin
+1  A: 

I was going to say Selenium, but if you are going to do it internal I would probably do something like NUnit to write the tests and then run them from the web-app.

http://www.nunit.org/

Why within the web-app though? That's like crash testing a car within the car.

Gus
+1 I like the analogy.
Ian P
Good question. The answer is like this:I'm working for a bus agency. We work with many bus providers. When we get a service call, the agents need to check with many providers what the cheapest price is.So for all those providers who provide us APIs - the solution is simple.However, for those who provide websites that require us to login in order to get prices, we need to write an app like this.Now: since our in-house system is written as a web-app, this crawling is supposed to be part of this web-app... Weird from the first sight, but highly useful when rethinking it...
charlie
WEll you aren't really testing the web app? Perhaps changing your question to reflect that you are looking at scraping data from an external website for use inside your webapp would be appropriate?
Doon
A: 

You might look at NUnitAsp (it's dead and not maintained but it does pretty much exactly what you want, modulo the fact that it's designed to only deal with websites written in ASP.NET). It should be a good example, though.

Cole
+1  A: 

Not sure how will it would work within a web applications, but did you consider giving HtmlUnit a try? I think it should work fine since it's basically a headless web browser.

Steven Sanderson has a blog post about using HtmlUnit in .NET code.

ShaderOp
Seems to bring me closer to our goal...I'll look into it several minutes and post here the results.Thanks! talk to you soon.
charlie
This is where I got stuck: It is required to include the IKVM.OpenJDK.Security.dll from the Java to be available in >net, but I have no clue how to do it...
charlie
No, you don't need anything from the Java SDK. If you look in your IKVM download directory, you'll find a bunch of DLLS, one of them is IKVM.OpenJDK.Security.dll. You'll need to add a reference to that in your project like you would any .NET assembly. You'll also need to add a reference to IKVM.OpenJDK.Core, IKVM.OpenJDK.Text, IKVM.OpenJDK.Util, IKVM.OpenJDK.XML.API, IKVM.OpenJDK.XML.Parse, IKVM.OpenJDK.XML.Path, and IKVM.Runtime.dll.
ShaderOp
All the other DLLs [except .Security] Steven Sanderson has the COM objects ready to being used in .Net in the download on his blog [link above]. But when I downloaded from IKVM download directory the .Security DLL it throws this: 'Could not load file or assembly 'IKVM.OpenJDK.Security, Version=0.42.0.3, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)'Hmm... maybe it is because the DLL is not prepared for .Net?
charlie
All the IKVM.OpenJDK.* are .NET assemblies, and thus they don't need any preparation for .NET. I think the best thing to do is retrace your steps and make sure you're including all the assemblies mentioned above. And if the downloads from Sanderson's blog is working for you, then I don't think you need to fiddle with downloading IKVM yourself, since Sanderson's download is up to date.
ShaderOp
I see that more people experience the same issue: Handerson's code fails when dealing with HTTPS pages... See http://blog.stevensanderson.com/2010/03/30/using-htmlunit-on-net-for-headless-browser-automation/ comment #6. SInce Handerson's example does NOT have the .Security DLL I had to download it from IKVM. In any case, it still throws this .Security error... Maybe I include it incorrectly? All I did is adding a reference to it i nthe Solution Explorer > Add Reference
charlie
I must confess that I haven't tried using HtmlUnit with HTTPS requests. But I do have a project that references IKVM.OpenJDK.Security with no issues. You can take a look at it at http://bitbucket.org/shaderop/htmlunitdriver
ShaderOp
Thanks so much. This is exactly what I needed.
charlie
OK. This solution works for me perfectly. The problem was that I had the .Security DLL from a newer version from IKVM's website. Once I used the older version which matched Sanderson's exmaple's version everything was fine. Thanks loads for the help!
charlie
My pleasure. Good luck with your project :)
ShaderOp