views:

177

answers:

3

One of my friends is working on having a good solution to generate aspx pages, out of html pages generated from a legacy asp application.

The idea is to run the legacy app, capture html output, clean the html using some tool (say HtmlTidy) and parse it/transform it to aspx, (using Xslt or a custom tool) so that existing html elements, divs, images, styles etc gets converted neatly to an aspx page (too much ;) ).

Any existing tools/scripts/utilities to do the same?

+2  A: 

Here's what you do.

  1. Define what the legacy app is supposed to do. Write down the scenarios of getting pages, posting forms, navigating, etc.

  2. Write unit test-like scripts for the various scenarios.

  3. Use the Python HTTP client library to exercise the legacy app in your various scripts.

  4. If your scripts work, you (a) actually understand the legacy app, (b) can make it do the various things it's supposed to do, and (c) you can reliably capture the HTML response pages.

  5. Update your scripts to capture the HTML responses.

You have the pages. Now you can think about what you need for your ASPX pages.

  • Edit the HTML by hand to make it into ASPX.

  • Write something that uses Beautiful Soup to massage the HTML into a form suitable for ASPX. This might be some replacement of text or tags with <asp:... tags.

  • Create some other, more useful data structure out of the HTML -- one that reflects the structure and meaning of the pages, not just the HTML tags. Generate the ASPX pages from that more useful structure.

S.Lott
A: 

Just found HTML agility pack to be useful enough, as they understand C# better than python.

amazedsaint
A: 

I know this is an old question, but in a similar situation (50k+ legacy ASP pages that need to display in a .NET framework), I did the following.

  1. Created a rewrite engine (HttpModule) which catches all incoming requests and looks for anything that is from the old site.

  2. (in a separate class - keep things organized!) use WebClient or HttpRequest, etc to open a connection to the old server and download the rendered HTML.

  3. Use the HTML agility toolkit (very slick) to extract the content that I'm interested in - in our case, this is always inside if a div with the class "bdy".

  4. Throw this into a cache - a SQL table in this example.

Each hit checks the cache and either a)retrieves the page and builds the cache entry, or b) just gets the page from the cache.

  1. An aspx page built specifically for displaying legacy content receives the rewrite request and displays the relevant content from the legacy page inside of an asp literal control.

The cache is there for performance - since the first request for a given page has a minimum of two hits - one from the browser to the new server, one from the new server to the old server - I store cachable data on the new server so that subsequent requests don't have to go back to the old server. We also cache images, css, scripts, etc.

It gets messy when you have to handle forms, cookies, etc, but these can all be stored in your cache and passed through to the old server with each request if necessary. I also store content expiration dates and other headers that I get back from the legacy server and am sure to pass those back to the browser when rendering the cached page. Just remember to take as content-agnostic an approach as possible. You're effectively building an in-page web proxy that lets IIS render old ASP the way it wants, and manipulating the output.

Works very well - I have all of the old pages working seamlessly within our ASP.NET app. This saved us a solid year of development time that would have been required if we had to touch every legacy asp page.

Good luck!

David Lively