I'm working on building intelligence around link propagation, and because I need to deal with many short URL services where a reverse-lookup from an exact URL address is required, I need to be able to resolve multiple approximate versions of the same URL.
An example would be a URL like http://www.example.com?ref=affil&hl=en&ct=0
Of course, changing GET params in certain circumstances can refer to a completely different page, especially if the GET params in question refer to a profile or content ID.
But a quick parse of the page would quickly determine how similar the pages were to each other. Using a bit of machine learning, it could quickly become clear which GET params don't effect the content of the pages returned for a given site.
I'm assuming a service to send a URL and get a list of very similar URLs could only be offered by the likes of Google or Yahoo (or Twitter), but they don't seem to offer this feature, and I haven't found any other services that do.
If you know of any services that do cluster together groups of almost identical URLs in the aforementioned way, please let me know.
My bounty is a hug.