ansaurus

Question

python regular expression across multiple lines

Answer 1

+1 A:

x="""Top Assembly Part Number        : 800-25858-06
Top Assembly Revision Number    : A0
Version ID                      : V08
CLEI Code Number                : COMDE10BRA
Hardware Board Revision Number  : 0x01


Switch   Ports  Model              SW Version              SW Image
------   -----  -----              ----------              ----------
*    1   52     WS-C3750-48P       12.2(35)SE5             C3750-IPBASE-M
     2   52     WS-C3750-48P       12.2(35)SE5             C3750-IPBASE-M
     3   52     WS-C3750-48P       12.2(35)SE5             C3750-IPBASE-M
     4   52     WS-C3750-48P       12.2(35)SE5             C3750-IPBASE-M


Switch 02
---------
Switch Uptime                   : 11 weeks, 2 days, 16 hours, 27 minutes
Base ethernet MAC Address       : 00:26:52:96:2A:80
Motherboard assembly number     : 73-9675-15"""

>>> import re
>>> re.findall("^\*?\s*(\d)\s*\d+\s*([A-Z\d-]+)",x,re.MULTILINE)
[('1', 'WS-C3750-48P'), ('2', 'WS-C3750-48P'), ('3', 'WS-C3750-48P'), ('4', 'WS-C3750-48P')]

UPDATE: because OP edited question, and Thanks Tom for pointing out for +

>>> re.findall("^(\*?)\s+(\d)\s+\d+\s+([A-Z\d-]+)",x,re.MULTILINE)
[('*', '1', 'WS-C3750-48P'), ('', '2', 'WS-C3750-48P'), ('', '3', 'WS-C3750-48P'), ('', '4', 'WS-C3750-48P')]
>>>

S.Mark 2009-12-09 00:57:41

+1 because I think you did a fine job of answering the question quickly :-). But to clean this up... I would use \s+ instead of \s*. Also, re.MULTILINE does nothing of importance in this case. I believe your solution will work without it :-).

Tom 2009-12-09 01:12:18

@Tom, well, you need the multi-line IF that `^` is to match the start-of-line, as I elaborated on in my answer -- I'm just not sure whether it's actually necessary to sync up with start-of-line, it depends on how the "model" can be identified.

Alex Martelli 2009-12-09 01:14:54

Hmm what you do above looks ok but you are only parsing the table. The string you are using is in the middle of a whole bunch of other text (see original post) with the table in the middle. My data has maybe 50 lines above and below

2009-12-09 01:23:32

BTW model will always start WS-

2009-12-09 01:24:49

OK, I was not sure it always starts with WS-, going to update it.

S.Mark 2009-12-09 01:25:56

Ah, It will be exact same with Alex Answer, So I won't update it, instead please accept Alex Answer. Thanks.

S.Mark 2009-12-09 01:27:07

Bingo. I am in the presence of heros. Many thanks chaps

2009-12-09 12:42:53

Answer 2

+1 A:

To have . match any character, including a newline, compile your RE with re.DOTALL among the options (remember, if you have multiple options, use |, the bit-or operator, between them, in order to combine them).

In this case I'm not sure you actually do need this -- why not something like

re.findall(r'(\d+)\s+\d+\s+(WS-\S+)')

assuming for example that the way you identify a "model" is that it starts with WS-? The fact that there will be newlines between one result of findall and the next one is not a problem here. Can you explain exactly how you identify a "model" and why "multiline" is an issue? Maybe you want the re.MULTILINE to make ^ match at each start-of-line, to grab your data with some reference to the start of the lines...?

Alex Martelli 2009-12-09 01:01:34

Alex, once again, you beat me to it :-). The key to doing multiline regexes is really the re.DOTALL (which is confusing because you would think it's re.MULTILINE). BUT, as he pointed out you don't need it in this case since your data you want to extract is on its own line :-). Also, I like that alex used \s+, meaning 1 or more whitespace character. Also, one thing I might have added... I usually like to name my groups: (?P<model>WS-\S+).

Tom 2009-12-09 01:10:25

Hmm you are probably right there - will try that and report back, but as I'm UK it will be tomorrow. Many thanks for your time

2009-12-09 01:12:35

ansaurus

tags:

views:

answers:

python regular expression across multiple lines

related questions