views:

547

answers:

12

Possible Duplicates:
Is it possible to decrypt md5 hashes?
Is it possible to reverse a sha1?

i asked this question: http://stackoverflow.com/questions/3143693/working-with-huge-spreadsheet

and got a great answer and i followed the advice. i used this: http://splinter.com.au/blog/?p=86

and i hashed about 300,000 different elements in a column in an excel spreadsheet

since you can do:

=SHA1HASH('The quick brown fox jumps over the lazy dog')

And you'd get back:

2fd4e1c67a2d28fced849ee1bb76e7391b93eb12

couldnt you go backwards as well?

im saying if it encrypts the same text the same way every single time, what is the point?

if you do know the hash algorithm, is it possible to go backwards?

can you please explain to me very simply how does hashing work? how can you convert a 20gb to a 40 character hash? does it take a long time to hash a 20gb hardrive?

+21  A: 

General answer

A cryptographic hash function cannot be easily reversed. This is why it is also sometimes called a one-way function. There is no going back.

You should also be careful about calling this 'decryption'. Hashing is not the same as encryption. The set of possible hash value is typically smaller than set of possible inputs so multiple inputs map to the same output.

For any hash function given the output you can't know which of the many inputs was used to generate this particular output.

For cryptographic hashes like SHA1 it is very difficult to even find one input that produces that output.

The simplest way to reverse a cryptographic hash is to guess the input and hash it to see if it gives the right output. If you are wrong, guess again. Another approach is to use rainbow tables.

Regarding using hashing to encrypt SSNs

With your use case of SSNs an attack is feasible due to the relatively small number of possible input values. If you are worried about people getting access to SSNs then it might be best to not store or use the SSN at all in your application, and in particular do not use them as an identifier. Instead you could find or create another identifier, for example an email address, a login name, a GUID or just an incrementing number. It can be tempting to use the SSN as it is already there and at first glance appears to be a unique unchanging identifier, but in practice using it just causes problems. If you absolutely need to store it for some reason then use strong non-deterministic encryption with a secret key and make sure you keep that key safe.

Mark Byers
if you do know the hash algorithm, is it possible to go backwards?
i am a girl
No, not possible with a hash algorithm.
blockhead
+1, once you go black there is no going back
VoodooChild
See my answer below. Yes, you can *easily* get a SSN number back from an SHA1 hash IF YOU DONT SALT IT! Just generate the SHA1 hash of all SSNs (about 10 million) then directly match the unsalted hash to the hash in the spreadsheet. What would it take, about 2 minutes to generate the whole list of 10 million SSN hashes???
Zak
sorry but can you please explain to me how to use rainbowtables?
i am a girl
@Zak that's assuming you know that the target here is a SSN
blockhead
@blockhead: No, it is possible to find a set of preimages. With simple hash functions e.g. `f(x) = x % 7`, if we knew `f(x) == 4`, we immediately can deduce `x in [4, 11, 18, ...]`, and the correct input is in one of these. This is the same for cryptographic hash functions, except it is designed to be extremely hard to compute the inverse.
KennyTM
@Blockhead: it's very risky business relying on security through obscurity. It's not a very big leap for someone to look at a set of data, remember that SSN was collected along with all the other information, then wonder, "oh I wonder if that is just the SSN number hashed". Same thing with Credit card numbers.
Zak
@user Rainbow tables are a bit too complex to give a complete tutorial on in a SO answer (but the Internet has plenty of info on them, so do a search). The general idea is that you pre-compute hashes for a whole bunch of candidate plaintexts and then use a clever means of doing a reverse lookup to find which plaintext corresponds to a given hash. This technique, however, is easily defeated by "salting" the plaintexts (modifying each of them in a unique yet deterministic way, before hashing).
Tyler McHenry
@Zak Salting alone will not suffice for a small domain such as SSNs. It'll make life more difficult by requiring the attacker to brute-force each one separately, but the domain is still small enough a brute-force attack is practical.
Nick Johnson
nick, then what is your suggestion?
i am a girl
@nick - see my answer below... with a big enough salt, and keeping your salt secret, you can first of all avoid a single pass mass crack off all SSN's, and make it a large number of times more difficult to brute force....
Zak
@user29823498750932874509823745: If you make a function that maps a SSN to a unique value deterministically then anyone who knows this function can call it for all possible SSN values and see which values exist in your database. The only two ways I can think of to prevent someone from doing this are: a) prevent people from knowing the function. But if you need to include the function in your Excel file then this won't work. b) Make your function so slow that brute force is infeasible. See 'key strengthening': en.wikipedia.org/wiki/Key_strengthening Neither option is pleasing...
Mark Byers
@Zak If it's secret, it's not a salt - it's a poorly implemented HMAC. @user29823498750932874509823745 If you have the capability to store a secret, use an HMAC, which will prevent anyone finding a preimage without the secret key. Better yet, don't give people access to this spreadsheet if they shouldn't know the SSN.
Nick Johnson
+2  A: 

A good hash is one way, meaning you shouldn't be able to go backwards. The point is to provide a key of a string without revealing the string. For instance, this is a good way to match passwords without storing a password. Instead, you store a hash and compare the resultant hash of inputs.

SB
if you do know the hash algorithm, is it possible to go backwards?
i am a girl
Possibly, but many are designed to make this very difficult. Some algorithms generate the same hash for different inputs, and many, if not all, rely on the length of the input. You need to know a lot about the data being hashed to go back to it.
SB
This is the difference between a "normal" hash function, like you'd use in a python map or java.util.HashMap and cryptographic hash function like SHA1. Normal hash functions are mostly optimized for spreading out bits in the input (coverage over the whole range of hash output) and efficiency. A cryptographic hash also adds the non-reversible constraint, with efficiency as a second-order constraint.
Paul Rubel
+5  A: 

Encrypting and hashing are two different things.

Hashing simply digests the string into a number. Encryption preserves the contents of the string so that it can later be decrypted. There is no method from getting the original string from a hash. The contents are just not there.

BC
+3  A: 

No. The point of a hash is that it's one way encryption (as other's have pointed out, its not really "encryption", but stay with me here). The downside is the, in theory, there is a small possibilities of "collisions", when two or more string return the same hash. But it's usually worth this downside.

blockhead
+16  A: 

The whole point of a cryptographic hash is that you can't decrypt it and that it does encrypt the same way every time.

A very common use case for cryptographic hashes is password validation. Imagine I have the password "mypass123", and the hash is "aef8976ea17371bbcd". Then a program or website wishing to validate my password can store the hash "aef8976ea17371bbcd" in their database, instead of the password, and every time I want to log in, the site or program re-hashes my password and makes sure that the hashes match. This allows the site or program to avoid storing my actual password, and so protects my password (in case it's a password I use elsewhere) in the case that the data is stolen or otherwise compromised -- a hacker would not be able to go backwards from the hash to the password.

Another common use of cryptographic hashes is integrity checking. Suppose a given file (e.g. an image of a Linux distribution CD) has a known, publicly available cryptographic hash. If you have a file which purports to be the same thing, you can hash it yourself and see if the hashes match. Here, the fact that it hashes the same way every time allows you to independently validate it, and the fact that it is cryptographically secure means that no one can feasibly create a different, fake file (e.g. with a trojan in it) that has the same hash.

Keep in mind the very important distinction between hashing and encryption, though: hashing loses information. This is why you can't go backwards (decrypt) the hash. You can hash a 20 GiB file and end up with a 40-some character hash. Obviously, this has lost a lot of information in the process. How could you possibly "decrypt" 40-some characters into 20GiB? There's no such thing as compression that works that well! But this is also an advantage, because in order to check the integrity of a 20 GiB file, you only have to distribute a 40-some character hash.

Because information is lost, many files will have the same hash, but the key feature of a cryptographic hash (which is what you're talking about) is that despite the fact that information is lost, it is computationally infeasible to start with a file and construct a second, slightly different file that has the same hash. Any other file with the same hash would be radically different, and not easily mistakable for the original file.

Tyler McHenry
this is a great answer! thank you. if you do know the hash algorithm, is it possible to go backwards?
i am a girl
No, you can't go backwards, that's the whole point. The reason is because hashing loses information. I edited that into my answer.
Tyler McHenry
+1 for "hashing loses information", that's why you can't go back!
FrustratedWithFormsDesigner
@user29823498750932874509823745- You can't "go backwards" in the sense of inverting the mathematical formula, applying your hash, and re-generating your original input. You can, however, create a table of all possible inputs, run them through the hash function, and store the input + hash result in a table. Use this table to reverse-lookup what possible inputs could have generated this output. As Tyler mentioned, several different inputs will give you the same output. Even if you had the time/storage to generate such a table, you would still have a number of possible inputs to choose from.
bta
The user is targeting SSN numbers though. Everyone here should be reminding him to salt his hashes..
Zak
+2  A: 

No. At least not easily.

SHA1 is still considered cryptographically secure. A hash algorithm is secure if it is easy to compute one way, but very hard (exhaustive search) to compute the other way. It is true that every time you encrypt a specific phrase, it will result in the same hash, but there are infinite phrases that will also hash to that same value. The security comes from not knowing what those other phrases are until you run them all through the SHA1 function.

Adam Shiemke
+6  A: 

No, you cannot go backwards because not enough information is preserved by the hashing function.

You can think of it as the hash function mapping the original text to a single, huge, number. This same number may also be mapped to other texts as well, although a good hash function will have few collisions:

alt text

If the original message were encrypted then yes, you could go back.

Justin Ethier
+1  A: 

That it encrypts the same text the same way every time is the whole point of a hash. It's a feature.

If I have a database of hashes of passwords, then I can check that you entered the correct password by hashing it and seeing if the hash matches what I have in the database for you. But if somebody stole my database of hashes, they won't be able to figure out what your password is unless they accidentally stumble upon some plain text that hashes to that value.

Paul Tomblin
+1  A: 

In cryptography it is called a digest. A cryptographically strong digest doesn't allow to get source text based on the digest value without some additional knowledge. A digest value is the same for the same text, so you can calculate digest of the text and compare it with a published digest. A popular application is the password verification, so you can save digest instead of the password. This is of course prone to a dictionary attack which you already explored, and that is why it is strongly recommended to not use dictionary words for passwords.

Eugene Kuleshov
+2  A: 

No, you cannot go back. Count how many different hashes you can have. Now count how many different strings you can have. The first is finite, the second is infinite. There are lots of (infinitely many, to be precise) strings which have the same SHA1 sum. The point is, however, it's very hard to find two texts, which have the same hash.

You can think of hashing as shortening something. For example take a hashing function which sums all the ASCII codes of the letters in a string. You can't tell what was before hashing, just knowing the sum of ASCII codes of the letters. It is similar with SHA1, but more complicated.

The point of hashing is not to encrypt something. The point of hashing is to shorten something, so that checking whether two things are the same takes less time. Now how can you tell that two things are indeed the same if you know that lots of things have the same hash? Well, you can't. You just assume that it's so rare that it won't happen.

But hashing is not just about checking, as checking equality using hashes is usually used just for confirmation/validation and it is not deterministic. If you see that hashes are the same, then basing on the parameters of a particular hashing function, you can estimate the probability that the hashed objects are indeed the same.

And that's why the fact that a hashing function always yields the same results for the same objects is the most important feature of a hashing function. It lets you validate and compare objects.

Michał Trybus
+1  A: 

The trick is finding what function does the reverse.

f(2fd4e1c67a2d28fced849ee1bb76e7391b93eb12) == 'The quick brown fox jumps over the lazy dog'

This is not so easy.

Bill the Lizard
+5  A: 

I see your point based on the fact that you are trying to hide Social security numbers. If someone knows you are using an SHA1HASH on the SSN to create a unique identifier, then can just generate a quick list of all SSN numbers, SHA1HASH them, then compare to automatically have the SSN of the person in the record. Even worse, they can pregenerate all these in a hash lookup table, and have a key of 1 hash for every SSN. This is called a hash lookup table, and more complex forms are called rainbow tables.

This is why a second feature of hashing was invented. It is called salting. Salting is basically this; you create a salt, then modify your data using the salt. For instance, say you had the SSN 123-45-6789 . You could salt it with the string "MOONBEAM". Your new string for hashing is "123-45-6789MOONBEAM"

Now, even if someone knows that you are hashing the SSN to generate your unique ID, they still don't know the salt you will be using, and so are unable to derive the original SSN by pre-hashing a list of all SSNs and comparing to your ID. You however, can always take the user's SSN, use the salt, and rehash the SSN+SALT to see if the user SSN matches up with their ID.

Finally, if you use just 1 salt for everything, and keep it secret, instead of being able to see the salt, and generate the corresponding SSN by running SSN increments + salt 100 million times and picking the match, they have to do a lot more work to retrieve SSN. This is because the 100 million SSN numbers have a relatively low amount of entropy. (10^9 combinations). By adding your salt and keeping it secret, instead of just running

SHA1HASH(111-11-1111) -> check hash match
SHA1HASH(111-11-1112) -> check hash match
SHA1HASH(111-11-1113) -> check hash match

They would have to run

SHA1HASH(111-11-1111a) -> check hash match
SHA1HASH(111-11-1111b) -> check hash match
SHA1HASH(111-11-1111c) -> check hash match
...
SHA1HASH(111-11-1111azdfg) -> check hash match
SHA1HASH(111-11-1111azdfh) -> check hash match
....
SHA1HASH(111-11-1111zzzzzzzzzzzzzzzz) -> check hash match
SHA1HASH(111-11-1112a) -> check hash match
SHA1HASH(111-11-1112b) -> check hash match

.. and so on until they finally get to

SHA1HASH(123-45-6789MOONBEAM) -> check hash match

at which point they finally did manage to crack the SSN + SALT

They don't even know how many characters long your salt is So that is 10^(number of characters of your salt) times more work for them to do just to get 1 SSN, let alone get the whole table.

Zak
wow this is a great point!!
i am a girl
A salt is _not_ a secret! Salts should be appended to the hashed value, and serve to prevent precomputation for dictionary attacks. What you are referring to is more like an HMAC - in which case you should use a proper HMAC rather than this ad-hoc scheme.
Nick Johnson
A salt *is not* a secret, but a salt *can be* secret. I'm actually trying to tailor a solution to this guy's particular problem. Take the SSN number. If you know it is an SSN number, and you see the salt right there in the spread sheet, you just run through the 100,000,000 SHA1 hashes for all SSNs using that salt and bingo, you have that guys SSN. Keep the salt secret and you won't have that problem.
Zak
I can see tht my answer makes it seem like a salt is a secret. Updating to correct that.
Zak
A possible attack on storing H(SSN || salt): If the attacker registers themselves (or someone else, or a fictitious SSN) into the system they can probably find the row corresponding to their registration and the hash of their SSN. Since they already know their own SSN so they can use this informatin to brute force the salt (MOONBEAM). Once they know the salt they can get the other SSNs relatively easily.
Mark Byers
@Zak A secret salt is just a poorly implemented HMAC. Salts are always stored alongside the hash. @Mark Salts, used properly, are generated randomly for each stored value - the salt on one entry bears no relationship to the salt on another.
Nick Johnson
@Nick Johnson: If you have to store the salts per row associated with the MAC then you could just make sure that the salts are unique and store the salt in the excel file without any concatenating or hashing at all. In fact, no need to use a salt, just use a number series 1,2,3...
Mark Byers
@Mark The point of a salt is to prevent precomputation. If it's predictable - a sequence, for example - it's possible for someone to precalculate a table of possible hashes, and save themselves a lot of work.
Nick Johnson
@Nick Johnson: Yes but you're missing my point. If you have to store a separate secret and secure mapping of SSNs to salts you might as well just use the "salts" as your unique random identifier in the excel file and forget the hash step altogether. In fact, don't even call them salts, just call them random numbers. The point is to make a deterministic mapping. If you have a secret database of SSN->some number, that is already a deterministic mapping and its provably irreversible *forever* regardless of computing power. The hash of the SSN is simply not needed in this scenario.
Mark Byers
@Nick Johnson: Unless you are talking about *non-secret* salts as they are traditionally used, in which case we are talking about two completely different things... maybe I didn't make that clear enough? I am replying to the answer and in particular to this part: "they still don't know the salt you will be using". In my opinion this answer is flawed and I was demonstrating an attack on it. If you want to propose a different scheme I suggest you post it as a separate answer so I can comment on your scheme there and not here and so that the two different schemes are not confused.
Mark Byers
@Nick Johnson: Another problem with the different salt per row that foils both secret and non-secret salts is how will you know which salt to use? You need another way to identify the row and that will have to not be based on the SSN which brings me back to the point I made in my answer (last paragraph). Basically - don't use SSNs as an identifier, whether they are hashed or not.
Mark Byers
@Mark If storing a meaningless identifier for each user were sufficient, then I doubt the OP would be asking about hashing SSNs in the first place. Presumably they have a compelling reason why they have to store a hash of the SSN. As I already said, a "secret salt" is not a salt - it's a poorly implemented HMAC. As to knowing the salt - the salt becomes part of the hashed value - HASH(value + salt) + salt. This is pretty basic crypto.
Nick Johnson
@Mark - a hashed SSN is not an SSN .. that's the whole point. To really design a good solution, OP should more clearly identify what the problem is. I created my answer based on an implied need that given an SSN, someone should be able to look up the user, but shouldn't store the SSN, and shouldn't be able to reverse engineer the SSN easily if you are an end user looking at the spreadsheet. My whole solution is predicated on the idea that there can even be someone with "admin access" and who knows the secret key for hashing. Saying don't store hahsed SSNs is like saying don't store hashed pws.
Zak
@Zak: Maybe I'm wrong and this is a good solution... I guess as you say it depends on the interpretation of the question. It's got quite a few upvotes and the accept so quite a lot of people here think it's a good solution. I'm still not totally convinced that it's a good idea but I don't think discussing it further is going to change my mind or yours, so I think I will take a look at some other questions now. Congratulations on getting the accept.
Mark Byers