tags:

views:

969

answers:

4

I'm trying to match SHA1's in generic text with a regular expression.

Ideally I want to avoid matching words.

It's safe to say that full SHA1's have a distinctive pattern (they're long and a consistent length) - so I can match these reliably - but what about abbreviated SHA1's?

Can I rely on the presence of numbers?

Looking at the SHA1's in my commit log - numbers always appear in the first 3 characters. But is this too short? How many characters of SHA1 do I need to consider before I can assume a number would have appeared?

This does not have to be 100% accurate - I just need to match an abbreviated SHA1 99% of the time.

+8  A: 

You can consider the SHA1 hashes to be completely random, so this reduces to a matter of probabilities. The probability that a given digit is not a number is 6/16, or 0.375. The probability that three SHA1 digits are all not numbers is 0.375 ** 3, or 0.0527 (5% ish). At six digits, this reduces again to 0.00278 (0.2%). At five digits, the probability of all letters drops below 1% (you said you wanted to match 99% of the time).

It's easy to craft a regular expression that always matches SHA1 values:

\b[0-9a-f]{5,40}\b

However, this may also match perfectly good five letter words, like "added" or "faded". In my /usr/share/dict/words file, there are several six letter words that would match: "accede", "beaded", "bedded", "decade", "deface", "efface", and "facade" are the most likely. At seven letters, there is only "deedeed" which is unlikely to appear in prose. It all depends on how many false positives you can tolerate, and what the likely words you will encounter actually are.

Greg Hewgill
Why the {5,40}, and not {40}?
sheepsimulator
@sheepsimulator: Presumably because it's common to abbreviate hashes - though the default abbreviation length in git output is 7, so you could pretty safely go to `{7,40}` and have far fewer false positives. @Greg Hewgill: my /usr/share/dict words also contains "acceded", "defaced", "effaced", and "facaded" - and the first three are common, at least relative to deedeed!
Jefromi
@Jefromi: weird, my `words` file contains "undefaced" but not "defaced"! On both OS X and FreeBSD, too.
Greg Hewgill
+10  A: 

What exactly are you trying to do? You shouldn't need to parse anything git outputs with heuristics -- you can always request exactly the data you need.

If you want to match a full hex representation of an SHA1 sum, try:

/\b([a-f0-9]{40})\b/

That is, a word consisting of 40 characters which are either digits or the letters a through f.

If you only have a few characters and don't know where they are, then you are pretty much out of luck. Is "e78fd98" an abbreviated commit ID? Maybe, but what about "1234567"? Is that a commit ID? A problem ticket number? A number that makes a test fail?

Without context, you can't really know what the data means.

To answer your direct question, there is no property of SHA1 that would make the first three characters (in hex form) digits. You are just lucky, or perhaps unlucky, depending on how you look at it.

jrockway
This is what you want, exactly 40 characters of hex digits is always going to match rather than the current accepted answer that may not work always.
Otto
+3  A: 

I'm going to assume you want to match against hexadecimal printed representation of a SHA1, and not against the equivalent 20 raw bytes. Furthermore, I'm going to assume that the SHA1's in question use only lower-case letters to represent hex digits. You'll have to adjust the regular expression if your requirements differ.

grep -o -E -e "[0-9a-f]{40}"

Will match such a SHA1. You'll need to translate the above regular expression from egrep's dialect to whatever tool you happen to be using. Since the match must be exactly 40 characters long I don't think you're in danger of accidentally matching words. I don't know of any 40-character words that consist only of the letters a through f.

edit:

Better yet: use http://stackoverflow.com/questions/468370/a-regex-to-match-a-sha1#468397 as his solution includes checking for word boundaries at both ends. I overlooked that above.

bendin
+1  A: 

If you have access to the repo, you can use git cat-file -e to check for sure that it represents an object in the repo. This is very fast, too. If you further want to restrict this to just commits and tags, you can use git cat-file -t to find out the type of the object.

This could be used, for example, to search human-generated text for mentions of git commits and generate hyperlinks to a git web interface.

Neil Mayhew