views:

120

answers:

3

Hi All,

I'm trying to create regular expression that filters from the following partial text:

amd64 build of software 1:0.98.10-0.2svn20090909 in archive

what I want to extract is:

software 1:0.98.10-0.2svn20090909

How can I do this?? I've been trying and this is what I have so far:

p = re.compile('([a-zA-Z0-9\-\+\.]+)\ ([0-9\:\.\-]+)')
iterator = p.finditer("amd64 build of software 1:0.98.10-0.2svn20090909 in archive")
for match in iterator:
    print match.group()

with result:

software 1:0.98.10-0.2

(svn20090909 is missing)

Thanks a lot.

A: 

Don't use a capturing group if you want everything in one piece.

Azeem.Butt
I want the capturing group :)
Then you should know how they work :)
Azeem.Butt
+3  A: 

This will work:

p = re.compile(r'([a-zA-Z0-9\-\+\.]+)\ ([0-9][0-9a-zA-Z\:\.\-]+)')
iterator = p.finditer("amd64 build of dvdrip software 1:0.98.10-0.2svn20090909 in archive")
for match in iterator:
    print match.group()
# Prints: software 1:0.98.10-0.2svn20090909

That works by allowing the captured section to contain letters while still insisting that it starts with a number.

Without seeing all the other strings it needs to match, I can't be sure whether that's good enough.

RichieHindle
+3  A: 

If you have consistent lines, this is, if each entry is on one line and the first word you want is always before the numbers part (the 1:0.98 ... part) you don't need a regexp. Try this:

>>> s = 'amd64 build of software 1:0.98.10-0.2svn20090909 in archive'
>>> match = [s.split()[3], s.split()[4]]
>>> print match
['software', '1:0.98.10-0.2svn20090909']
>>> # alternatively
>>> match = s.split()[3:5] # for same result

what this is doing is the following: it first splits the line s at the spaces (using the string method split()) and selects the fourth and fifth elements of the resulting list; both are stored in the variable match.

Again , this only works if you have one entry per line and if the 'software' part always comes before the 1:0.98.10-0.2svn20090909 part.

I often avoid regexps when I can do with split lists. If the parsing becomes a nightmare, I use pyparsing.

Arrieta
Awesome!! This also helps me :)