views:

2312

answers:

2

Hi, I'm using re.findall() to extract some version numbers from an HTML file:

>>> import re
>>> text = "<table><td><a href=\"url\">Test0.2.1.zip</a></td><td>Test0.2.1</td></table> Test0.2.1"
>>> re.findall("Test([\.0-9]*)", text)
['0.2.1.', '0.2.1', '0.2.1']

but I would like to only get the ones that do not end in a dot. The filename might not always be .zip so I can't just stick .zip in the regex.

I wanna end up with:

['0.2.1', '0.2.1']

Can anyone suggest a better regex to use? :)

+3  A: 
re.findall("Test([0-9.]*[0-9]+)", text)

or, a bit shorter:

re.findall("Test([\d.]*\d+)", text)

By the way - you must not escape the dot in a character class:

[\.0-9]  // matches: 0 1 2 3 4 5 6 7 8 9 . \
[.0-9]   // matches: 0 1 2 3 4 5 6 7 8 9 .
Tomalak
Works great, thanks a lot!
Ashy
It should probably be \d+ if numbers can be greater than 9
unbeknown
True. I'll add that, thanks.
Tomalak
It should be r"Test([\d.]*\d+)" -- \d doesn't mean anything in the string so it works, but generally it's good practice not to rely on that.You could do r"Test(\d*(?:\.\d+))" if you want to be slightly more restrictive (rejecting 1..2, for instance)
ianb
A: 

Nice answer! Learn a lot from that. Or an alternative, fetch all the matches and make an "if" clause to filter out those end up with dot.

dgg32