tags:

views:

1129

answers:

4

I've come up with:

re.findall("([a-fA-F\d]*)", data)

but it's not very fool proof, is there a better way to grab all MD5-hash codes?

+2  A: 

How about "([a-fA-F\d]{32})" which requires it to be 32 characters long?

Corey
+11  A: 

Well, since md5 is just a string of 32 hex digits, about all you could add to your expression is a check for "32 digits", perhaps something like this?

re.findall(r"([a-fA-F\d]{32})", data)
Marc Novakowski
+5  A: 

When using regular expressions in Python, you should almost always use the raw string syntax r"...":

re.findall(r"([a-fA-F\d]{32})", data)

This will ensure that the backslash in the string is not interpreted by the normal Python escaping, but is instead passed through to the re.findall function so it can see the \d verbatim. In this case you are lucky that \d is not interpreted by the Python escaping, but something like \b (which has completely different meanings in Python escaping and in regular expressions) would be.

See the re module documentation for more information.

Greg Hewgill
Thanks Greg - I edited my answer to include the "r", best not to introduce subtle bugs!
Marc Novakowski
+4  A: 

Here's a better way to do it than some of the other solutions:

re.findall(r'(?i)(?<![a-z0-9])[a-f0-9]{32}(?![a-z0-9])', data)

This ensures that the match must be a string of 32 hexadecimal digit characters, but which is not contained within a larger string of other alphanumeric characters. With all the other solutions, if there is a string of 37 contiguous hexadecimals the pattern would match the first 32 and call it a match, or if there is a string of 64 hexadecimals it would split it in half and match each half as an independent match. Excluding these is accomplished via the lookahead and lookbehind assertions, which are non-capturing and will not affect the contents of the match.

Note also the (?i) flag which will makes the pattern case-insensitive which saves a little bit of typing, and that wrapping the entire pattern in parentheses is superfluous.

ʞɔıu
Why not just add ^ and $ line anchors instead of using lookahead and lookbehind?
Imran
I assumed he was searching for md5s within a larger amount of text. If you're checking whether a single string entirely contains an md5, anchors would be the better way of doing that.
ʞɔıu