Extracting data from an ASPX page

I've been entrusted with an idiotic and retarded task by my boss.

The task is: given a web application that returns a table with pagination, do a software that "reads and parses it" since there is nothing like a webservice that provides the raw data. It's like a "spider" or a "crawler" application to steal data that is not meant to be accessed programmatically.

Now the thing: the application is made with standart aspx webform engine, so nothing like standard URLs or posts, but the dreadful postback engine crowded with javascript and non accessible html. The pagination links call the infamous javascript:__doPostBack(param, param) so I think it wouldn't even work if I try even to simulate clicks on those links.

There are also inputs to filter the results and they are also part of the postback mechanism, so I can't simulate a regular post to get the results.

I was forced to do something like this in the past, but it was on a standard-like website with parameters in the querystring like pagesize and pagenumber so I was able to sort it out.

Anyone has a vague idea if this is doable, or if I should tell to my boss to quit asking me to do this retarded stuff?

EDIT: maybe I was a bit unclear about what I have to achieve. I have to parse, extract and convert that data in another format - let's say excel - and not just read it. And this stuff must be automated without user input. I don't think Selenium would cut it.

EDIT: I just blogged about this situation. If anyone is interested can check my post at http://matteomosca.com/archive/2010/09/14/unethical-programming.aspx and comment about that.

I have no problem writing code to parse HTML on my own - but have you ever seen the output generated by aspx webform engine?

Matteo Mosca 2010-09-14 14:14:04

@Matteo: Yes, it's pretty awful. (We actually did screen scraping like this at a previous job of mine, tons of data aggregation.) But if it's at least valid HTML then a good DOM parser should be able to handle it. I was under the impression that the issue here was interacting with the page itself, navigating the UI and getting the data to display, which is where Selenium and WatiN may be able to help.

David 2010-09-14 14:20:22

Valid html? From webforms engine? Are you trying to kill me with insane laughter?

Matteo Mosca 2010-09-14 14:22:34

http://msdn.microsoft.com/en-us/library/exc57y7e.aspx ASP.NET allows you to create Web pages that are conformant with XHTML standards.

Sergey Mirvoda 2010-09-14 16:39:47

The fact that it allows you to do that is clear to me, I accomplished that several times myself even before MVC came, but the chances to find a web app that does in fact comply with standards... ;)

Matteo Mosca 2010-09-15 07:10:47

Ok, I'll definitely try those tools and see if they can help me. I am limited to VS2005 for technical reasons on this "project" so I hope they will work in an outdated IDE like that. Thanks :)

Matteo Mosca 2010-09-15 07:11:53

ansaurus

tags:

views:

answers:

Extracting data from an ASPX page

related questions