views:

2879

answers:

15

I suppose, Gravatar generates the image from email address. If so, the reverse should be possible. How difficult would be to get the email associated with the image? Isnt it a potential spam threat?

What are your thoughts? (other than calling me paranoid ;) i did that already)

+6  A: 

Gravatar uses an md5 hash. Now, typically you'd run the risk of a brute force dictionary attack, but emails tend to content pretty abnormal words...for example, I'd be surprised if a dictionary of words contained "stackoverflow", and then to add the permutations of ".com", ".net" and ".org" plus the many possibilities of usernames (which may be an alias rather an a real name), and I think it's reasonably safe.

That said, they probably could have used something other than MD5. MD5 is a hashing algorithm meant for speed. This is a common mistake for people who hash passwords. You should pick something intentionally slow, so that brute force attacks are even less likely.

Karl Seguin
+14  A: 

@Karl: You are right. I tried a reverse MD5 lookup on a few of the gravatars on stackoverflow and didn't get any results so I wrongly assumed they were salting.

I did some checking (instead off faulty reasoning) and the gravatar FAQ explains their justification for using plain md5 hashes:

"MD5 is plenty good for obfuscating the email address of users across the wire. if you're thinking of rainbow tables, those are all geared at passwords (which are generally shorter, and less globally different from one another) and not email addresses, furthermore they are geared at generating anything that matches the hash, NOT the original data being hashed. If you are thinking about being able to reproduce a collision, you still don't necessarily get the actual email address being hashed from the data generated to create the collision. In either case the work required to both construct and operate such a monstrocity would be prohibitively costly. If we left your password laying around in the open as a plain md5 hash someone might be able to find some data (not necessarily your password) which they could use to log in as you... Leaving your email address out as an md5 hash, however, is not going to cause a violent upsurge in the number of fake rolex watch emails that you get. Lets face it there are far more lucrative, easier, ways of getting email address. I hope this helps ease your mind."

Peter Coulton
+3  A: 

@Peter: I'm pretty sure they aren't salting. Since you can drop a gravatar link on your website, the salt would need to be made public anyways.

Karl Seguin
A: 

peter, you could mark my answer up ;)

Karl Seguin
A: 

Hash Algorithms are one way, and it's absolutely impossible to get the e-Mail address back, because there is an unlimited amount of input data that will generate the same hash, which in terms means that you will get an unlimited number of output data from any hash.

Michael Stum
A: 

@Michael - It's not true that it's "absolutely impossible". It is difficult, it may take more time than is available, but it is not "absolutely impossible". For instance, see http://en.wikipedia.org/wiki/Rainbow_table

ICR
+34  A: 

EDIT: This answer was written over a year ago. See my addendum at the bottom.

@ICR: It IS absolutely impossible, no "but".

A Rainbow Table helps to get a match for a given Hash, but you will NEVER be able to get the exact input (well, there is indeed a chance of 1 to unlimited to get it, but you would not know whether or not that was the exact input). If you could get the exact Input from a 128-Bit Hash, you would just have invented the best compression algorhithm in the world.

There are unlimited possibilities that will all produce the exact same hash.

Let me explain a bit more in detail. Image a Hash algorithm where "foo" and "bar" will result in the same hash, let's say "HAsh1!".

Now, you get the Hash - HAsh1!. How do you know if the input data was "foo" or "bar"? You can't, and you will never ever be able to.

Now, what is the problem and what are rainbow tables used for?

Imagine you hack my user database, and you check the Users-Table. My Login is usually stored in cleartext, so you know to log in with "mstum". But my password is hashed. I used "trez" with incidentially also hashes to HAsh1!.

You check my hashed password, which us HAsh1!, but you do not know that my plaintext password is trez. But the point: You don't NEED to know that. You grab your Rainbow Table for HAsh1! and see "foo".

So you go to the page you just got the user data for, and you log in as "mstum" with "foo". Despite me having used "trez" as password, you will successfully log in with "foo" because the hash is exactly the same.

This is called a colision: 2 Input Values having the same hash. On MD5, there are only 128 Bit, whereas on SHA-512 there are 512 Bit, but you WILL still have infinite inputs which will result in the same hash.

So: Rainbow tables solve one of the two problems hackers want to solve - it allows them to authenticate as any user they got the hash for.

The second problem that hackers have is a bit off-topic: How do you give the user some file (for example, a manipulated Setup File that includes Malware or that is a modified version of a tool that does bad stuff) that will successfully have the same Hash? When you check some popular download pages or open source sites, they will have "Setup.exe, MD5 Hash: f387......". While there WILL BE infinite possibilities to produce something that hashes to f387..., the hacker needs to modify the Setup.exe in a way that it contains his Malware AND hashes to f387.... This hard, but possible. On MD5, this is now possible with a reasonable timeframe, which is why MD5 should not be used anymore unless a standard requires it or if security is not a concern.

For Gravatar, Security does not need to be a concern, so MD5 is fine and they want something that is fast since it's not security-related.

Edit: Here is a real world example, the one that was used to show collisions in MD5:

d131dd02c5e6eec4693d9a0698aff95c 2fcab58712467eab4004583eb8fb7f89 
55ad340609f4b30283e488832571415a 085125e8f7cdc99fd91dbdf280373c5b 
d8823e3156348f5bae6dacd436c919c6 dd53e2b487da03fd02396306d248cda0 
e99f33420f577ee8ce54b67080a80d1e c69821bcb6a8839396f9652b6ff72a70

and

d131dd02c5e6eec4693d9a0698aff95c 2fcab50712467eab4004583eb8fb7f89 
55ad340609f4b30283e4888325f1415a 085125e8f7cdc99fd91dbd7280373c5b 
d8823e3156348f5bae6dacd436c919c6 dd53e23487da03fd02396306d248cda0 
e99f33420f577ee8ce54b67080280d1e c69821bcb6a8839396f965ab6ff72a70 

both hash to 79054025255fb1a26e4bc422aef54eb4.

So, given the hash - how do you know which one of the two was used?

Addendum: I still stand by my original point: Decoding an e-Mail address from a Hash is impossible. You can brute force or use a rainbow table and you might get lucky to get an e-Mail address. As there are infinite inputs to a hash, there are consequently also infinite inputs that look like an e-mail address, but the chances of getting a valid one are really insignificant. So decoding from Hash is not an option. But there are some options that I haven't originally considered, outlined for example here. Basically, you can guess. I post with my real name, so you can try all sorts of combinations of michaelstum@, mstum@, michael.stum@ with all the big providers and run them through a hash function. This does not work in all cases, but as many people use GMail, Yahoo or Hotmail, the chance to hit is not that low. Would salting solve that? Maybe. Maybe not. You would have to figure out the salt, which almost certainly requires brute-force approaches. At least you have a known plaintext (your own e-Mail address), so you can run this brute force on your machine. Having a different salt per-user might be more secure, but I don't know if that could cause a significant risk of creating collisions.

Thinking about this, I think that salting with a long, random salt would indeed improve security as it defeats the one viable approach of guessing. I'd rather recommend using disposable/different e-Mail addresses as you never know what can happen (website databases get hacked every day, and those e-Mail addresses are certainly sold to spammers), but I realize that is impractical for people without a domain and a catchall-mailbox.

Michael Stum
Very nice response, but I almost feel like it's a little too detailed for the question asked.Maybe one of you guru-level guys can edit the QUESTION so that it explicitly asks something like "Given some hash, is it possible to work backwards and obtain the original input?"
Outlaw Programmer
Could you blod the differences between the 2 entries? I first thought you did a bad copy paste :)
Luk
I would love to, but Markup seems to be bugged, it leaves in the two ** instead of bolding them. See this link instead: http://www.mscs.dal.ca/~selinger/md5collision/
Michael Stum
Likelyhood of MD5 collision is incredibly small. Collision of two short strings that are valid e-mail addresses might just be absolutely impossible (number of plausible, valid e-mail addresses is finite). And even if collision happens, it doesn't matter - spammer will spam whatever he finds.
porneL
Even without a domain and catch-all mailbox, some providers deliver *user+tag@domain* to *user@domain* (notably gmail).
Roger Pate
jwanagel
+4  A: 

@Michael Stum, it is possible, because only one of the values to generate the hash would be a valid email address, so you could avoid collisions.

Xenph Yan
A: 

Hmm.. I don't think so. As there are infinite possible inputs, I believe there are also infinite valid e-Mail addresses in there. Since we are talking about a huge (as in "the size of a small supermassive black hold") result set here that would take some time (as in "years") to calculate, but you will never know if you hit the correct e-Mail address.

The chance of actually creating a collision is really small (1 : 2^128 if i'm not mistaken), but that does not make it much easier trying to find an e-Mail address for a given hash.

Michael Stum
+8  A: 

MicroID is basically a hash calculated using a user's profile page URL and registered email address, producing a token that makes the email address vulnerable to dictionary attacks. To see how easy it was to crack these tokens, I conducted a small study, choosing 56,775 random Digg users, and cracking the email addresses of 14,294 of them (25%) using just their MicroID, username, and a list of popular email domains.

http://yro.slashdot.org/article.pl?sid=08/08/28/2241238

Not sure if this relates to the Gravatar hash, exactly, but interesting.

Jeff Atwood
+5  A: 

The hash collision answer is a nice one, but we're forgetting something here: we know what to expect!. We know that we're looking for realistic email addresses.

This not only gives us a pattern to match all potential results against, but also simplifies the attack pattern for brute force attacks.

Given that we don't require everyones address, but just want to get lots of addresses quickly, we could go even further: 80% of all email accounts are supplied by a major provider. Just trying @gmail.com, @hotmail.com, ... with short word or dictionary attacks are feasable, and building a good database for looking up many addresses also becomes much easier.

wvdschel
+1  A: 

If you had enough time and incentive to crack a gravatar email address then I guess using wvdschel's method to building a hash table based on known domains (@gmail.com, @hotmail.com) would be the fastest approach, but than wouldn't you have a table with every possible gmail/ hotmail email address? from [email protected] to [email protected], so from a spammer point of view, why at that point would he/she need to crack the gravatar? as they could just spam every one under the sun, after all who cares if you get a few thousand bounce back when your a prince trying to get your gold out?

and on a side note, looking a cross section of people on stackoverflow (a site that uses gravatars), linking to your blog's, I can easily find email addresses for about 70ish% of users with a web site listed.

Re0sless
I believe wvdschel was suggesting you combine the methods - get the possible strings that hash to the MD5 and weed out all the ones that don't look like 'realistic' email addresses.
Bobby Jack
A: 

@Re0sless You would indeed have a database of all email adresses, but 90% of those would bounce. So mailing to all those takes time (given the speed limits of real world networks), and thus reduces effectiveness of your spamrol.

Following through to the sites is indeed a more effecient way to get around this, but defeats the purpose of this question. Also, it requires a notion of context and site structure.

Besides, only a fool would show his own email in an easily bot-readable form on their site. I know, because I used to. Spambots find you in no time, just crawling the web with a simple email regex.

wvdschel
+22  A: 

Michael Stum is way off here, but he has a big green check mark that means his answer will forever appear as the "correct" one despite the mistakes...

Let's handle the second part of his answer first, which is a little off-topic. He's just completely wrong here. Given a file from an "open source site" and the MD5sum of that file you can't make another different file with the same hash. There may some day be a breakthrough that makes this possible, but that day is not here and may be years or decades away.

Why does Michael think this is possible? A simple mistake. He's read that it's possible to create two different files with the same MD5sum by forcing a certain type of collision, this result was reported in 2004 and has been improved since. But, this is not the same problem. In cryptography the type of attack Michael claimed was possible is called a pre-image attack, and for MD5 there is as yet no practical pre-image attack, where as what was actually found is merely a collision, the easiest type of attack to find against a hash, and no doubt a reason not to choose MD5 for new applications, but not, and this is worth emphasising, NOT a reason not to trust MD5sums provided for software downloads, nor a reason (on its own) to avoid MD5 in password hashes.

OK, now the more relevant earlier part of Michael's answer. Here's the trouble: Spammers and similar Attackers don't care that one person in a billion has an email address that looks like an exercise in obfuscated Perl - they want the bulk of addresses. Given a list of 5 million Gravatar hashes, at least 4 million are hashes of fairly ordinary looking email addresses like "[email protected]". These are 128-bit hashes, so any arbitrary string has a one in 2^128 chance of hashing to the same value, but that's a very big number and there simply aren't 2^128 plausible stereotypical email addresses. So long as we only want a reasonable factor like say 50% chance of reversing, we can brute force this problem. Try the ten million most likely firstname / lastname combinations, with up to a million domain names taken from registrar's lists, and you have 10 billion values to try. For a one shot attack you just do it brute-force, 10 billion MD5 calculations is childs play. But if you were a spammer and had access to a crawler looking for fresh Gravatar URIs every day, you'd build a rainbow table so that you can spread the compute time over a potentially unlimited supply of hashes.

Now, is there a way around this? Well, yes and no. There's no way to fix this without changing something. But the minimal change is fairly painless. The way to defeat rainbow tables is to add salt and waste CPU cycles, just as with passwords. You can add salt by asking everyone to pick a favourite word, and type that in along with their email address when generating a Gravatar or an account that needs a Gravatar. It doesn't matter what word they pick, so long as its always the same they get the same Gravatar. Now an attacker must guess not only a plausible email address, but also a random word. That makes their chances of success orders of magnitude smaller. Meanwhile you also switch from MD5 (designed to be fast) to a password hash (designed to be slow). When calculating the hash you now work a little harder, but the attacker must work proportionally harder too. If your CPU takes 0.5 seconds to generate your gravatar when creating an account, the attacker must waste 0.5 seconds per guess, and that will soon exhaust him.

tialaramex
First off, I completely agree on the fact that an arbitrary MD5 collision is _possibly_ still far away, but in the past, computing power has grown a bit faster than expected.
Michael Stum
But I still do not agree on the practicability of Gravatar Hashes: To brute force billions of [email protected], I don't need Hashes, I can just grab a dictionary and just send billions of mails. Salting or using another Hash is not going to change this, it just shifts the problem a little bit.
Michael Stum
However, I'm not saying that you're wrong (hence the vote-up), I just don't agree on how practical the approach of "decoding" hashes is when you can just send billions of mail with virtually no effort through a bot net (that can also be used to "crack" a salted hash) or through the RBN etc.
Michael Stum
Arbitrary collision is here now. Pre-image attacks are not. One does not become the other by just throwing a bit more CPU at the problem, the only thing that will make a significant difference is further cryptographic breakthroughs.
tialaramex
Still, it's easier to just send an email to a million [email protected] addresses than harvesting gravatars and breaking them just to know if they are valid. They just discard bounces and be happy ever after.
Vinko Vrsalovic
@Vinko Vrsalovic: spammers would much rather send to a million valid email addresses in that same time, so being able to filter a list of randomly generated addresses for ones for which there's a gravatar link is a big win for them.
ysth
+6  A: 

10% of stackoverflow users e-mails were hacked due to Gravatar weakness: http://tech.slashdot.org/story/09/12/15/2352218/Gravatars-Can-Leak-Users-Email-Addresses?art%5Fpos=1&art%5Fpos=1

jwanagel