views:

85

answers:

3

I've been entrusted with an idiotic and retarded task by my boss.

The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be accessed programmatically.

Now the thing: the application is made with standart aspx webform engine, so nothing like standard URLs or posts, but the dreadful postback engine crowded with javascript and non accessible html. The pagination links call the infamous javascript:__doPostBack(param, param) so I think it wouldn't even work if I try even to simulate clicks on those links.

There are also inputs to filter the results and they are also part of the postback mechanism, so I can't simulate a regular post to get the results.

I was forced to do something like this in the past, but it was on a standard-like website with parameters in the querystring like pagesize and pagenumber so I was able to sort it out.

Anyone has a vague idea if this is doable, or if I should tell to my boss to quit asking me to do this retarded stuff?

EDIT: maybe I was a bit unclear about what I have to achieve. I have to parse, extract and convert that data in another format - let's say excel - and not just read it. And this stuff must be automated without user input. I don't think Selenium would cut it.

EDIT: I just blogged about this situation. If anyone is interested can check my post at http://matteomosca.com/archive/2010/09/14/unethical-programming.aspx and comment about that.

A: 

Already commented but think thus is actually an answer.
You need a tool which can click client side links and wait while page reloads. Tool s like selenium can do that. Also (from comments) WatiN WatiR

Sergey Mirvoda
A: 

WatiN will help you navigate the site from the perspective of the UI and grab the HTML for you, and you can find information on .NET DOM parsers here.

David
I have no problem writing code to parse HTML on my own - but have you ever seen the output generated by aspx webform engine?
Matteo Mosca
@Matteo: Yes, it's pretty awful. (We actually did screen scraping like this at a previous job of mine, tons of data aggregation.) But if it's at least valid HTML then a good DOM parser should be able to handle it. I was under the impression that the issue here was interacting with the page itself, navigating the UI and getting the data to display, which is where Selenium and WatiN may be able to help.
David
Valid html? From webforms engine? Are you trying to kill me with insane laughter?
Matteo Mosca
http://msdn.microsoft.com/en-us/library/exc57y7e.aspx ASP.NET allows you to create Web pages that are conformant with XHTML standards.
Sergey Mirvoda
The fact that it allows you to do that is clear to me, I accomplished that several times myself even before MVC came, but the chances to find a web app that does in fact comply with standards... ;)
Matteo Mosca
+1  A: 

Stop disregarding the tools suggested.

No, the parser you can write isn't WatiN or Selenium, both of those Will work in that scenario.

ps. had you mentioned anything on needing to extract the data from flash/flex/silverlight/similar this would be a different answer.


btw, reason to proceed or not is Definitely not technical, but ethical and maybe even lawful. See my comment on the question for my opinion on this.

eglasius
Ok, I'll definitely try those tools and see if they can help me. I am limited to VS2005 for technical reasons on this "project" so I hope they will work in an outdated IDE like that. Thanks :)
Matteo Mosca