I've come up with:
re.findall("([a-fA-F\d]*)", data)
but it's not very fool proof, is there a better way to grab all MD5-hash codes?
I've come up with:
re.findall("([a-fA-F\d]*)", data)
but it's not very fool proof, is there a better way to grab all MD5-hash codes?
How about "([a-fA-F\d]{32})" which requires it to be 32 characters long?
Well, since md5 is just a string of 32 hex digits, about all you could add to your expression is a check for "32 digits", perhaps something like this?
re.findall(r"([a-fA-F\d]{32})", data)
When using regular expressions in Python, you should almost always use the raw string syntax r"..."
:
re.findall(r"([a-fA-F\d]{32})", data)
This will ensure that the backslash in the string is not interpreted by the normal Python escaping, but is instead passed through to the re.findall
function so it can see the \d
verbatim. In this case you are lucky that \d
is not interpreted by the Python escaping, but something like \b
(which has completely different meanings in Python escaping and in regular expressions) would be.
See the re
module documentation for more information.
Here's a better way to do it than some of the other solutions:
re.findall(r'(?i)(?<![a-z0-9])[a-f0-9]{32}(?![a-z0-9])', data)
This ensures that the match must be a string of 32 hexadecimal digit characters, but which is not contained within a larger string of other alphanumeric characters. With all the other solutions, if there is a string of 37 contiguous hexadecimals the pattern would match the first 32 and call it a match, or if there is a string of 64 hexadecimals it would split it in half and match each half as an independent match. Excluding these is accomplished via the lookahead and lookbehind assertions, which are non-capturing and will not affect the contents of the match.
Note also the (?i) flag which will makes the pattern case-insensitive which saves a little bit of typing, and that wrapping the entire pattern in parentheses is superfluous.