views:

165

answers:

6

My question is about verification more than anything else. What can be used to determine what is unique in an HTML document? (The document can have a degree of being dynamic.)

What is able to be used, or generated to recognize that a page is the correct page to an accuracy of say 99%, taking into consideration you can store a "fingerprint" of sorts of the page you are verifying?


For clarity, this is an added extra to encryption/https etc. This page can and will change with dynamic content according to specific users, however so can the fingerprint, but a single fingerprint cannot 100% match 100% of users due to the nature of dynamic content. Therefore a hash cannot work here, at least not in a simplistic form.

A: 

You can't be even 1% sure if you won't check IP of host. The next is encryption. (Without this you can be a victim of ARP poisoing (only in lan networks)).

The key in HTTPS has to be the same all the time.

If it changes it means that someone is cheating or the key got update (the keys have their expiration date.)

oneat
A: 

The fingerprint of the page is the host-name,port,and path. That is the only thing guaranteed to be unique across the web. I suppose you could also include the cache headers (Last-Modified) to see if it changed.

On top of this if you hashed the html you could see if it changed even if last-modified header changed.

Byron Whitlock
A: 

Assuming for a minute that you want to store a 'fingerprint' of a HTML page so you can recognise it later if it exactly matches, just use a simple hash digest of the HTML page.

Unless you clearify the question more, I can see no reason of why it should matter that it is HTML or what browser it is in.

This won't tell you if the page is at the same location however. For that you would need to store additional details such as host/ip and path.

Dan McGrath
Clarified a bit in the question.
Kyle Rozendo
+2  A: 

A unique fingerprint of a HTML page is easy to calculate. Build a hash from the following:

  • protocol: http or https
  • URL: domain + uri
  • Query_string
  • the exact page's contents down to a byte

Optionally some headers:

  • Server
  • Content-Type this is important
  • Content-encoding this probably too
  • more ideas? Feel free to edit them in.

this assumes you're not POSTing any data to pages.

Pekka
Sorry, I clarified during your post. The exact pages content is dynamic according to user prefs. Any further ideas?
Kyle Rozendo
If you can change those dynamic pages, the most reliable solution would be to mark the dynamic areas with something like `<dynamic>` `</dynamic>`. You would then md5 compare only the areas outside of those tags. With anything else, you would have to start creating profiles from the contents, and comparing their relative similarity - if it's security you're after, probably not a good approach.
Pekka
That's what I was thinking. I'm trying to think of a reliable, robust and most of all secure way of comparing similarities.
Kyle Rozendo
I think comparing similarities has an inherent margin of error, and will never be fully secure. How do you want to tell a program to tell apart "dynamic" changes (user name, greeting...) and legitimate differences in the code? Can't be done. I think if you want certainty, you need to tell your hashing program which parts of the page are dynamic, and must be ignored when comparing. Feel free to prove me wrong, of course.
Pekka
Absolutely, I agree. I will continue to try find an alternative to conditional hashing, as its not really an option currently, but thanks a ton for the input.
Kyle Rozendo
Thanks for the answer, we've decided to go a different route as in the question, but you helped me to find it.
Kyle Rozendo
A: 

If you can get the text versions of the two pages you could diff them. You could determine a maximum range acceptable for differences in the page.

There is a Unix util (called diff). There are win32 versions of this tool floating around the net also. Wikipedia has an article on diff: http://en.wikipedia.org/wiki/Diff.

The wiki article lists free file comparison tools and the "See also" section has links to other articles that discuss file comparison tools and delta encoding.

The "Levenshtein distance metric" may also be an interesting approach.

There is a decent C# Difference engine on CodeProject. I can't post another link due to my low points but the article title is: "A Generic, Reusable Diff Algorithm in C#".

Brandan
A: 

Even if you had the exact hostname, port, and path the content still could be different if there is an app server serving the web pages or if the web server is inserting ad content.

If you could reliably identify the parts of the HTML that are dynamic (like ads or timestamps that keep updating) then I would normalize the data first. I'd strip out all space characters (spaces, tabs, newlines) then make a hash of that content.

I would not include the hostname-port-path in the hash because that wouldn't add anything to the "fingerprint". (That info is useful later when you have to requery the web server later to compare the HTML.)

Amy