tags:

views:

99

answers:

3

I'm having trouble with the needed regular expression... I'm sure I need to probably be using some combination of 'lookaround' or conditional expressions, but I'm at a loss.

I have a data string like:

pattern1 pattern2 pattern3 unwanted-groups pattern4 random number of tokens pattern5 optional1 optional2 more unknown unwanted junk separated with white spaces optional3 optional4 etc

where I have a matching expression for each of the 'pattern#' and 'optional#' groups (optional groups being groups that are not required in the data and therefore not always present), but I don't have any pattern (text is free-form) or group count to skip for the other sections other than all 'tokens' are separated by white space.

I've managed to figure out how to skip the unwanted stuff between the required groups but when I hit the optional groups, I'm lost. any suggestion on where I should be looking for hints/help?

Thanks

this is what I currently have:

pattern = re.compile(r'(?:(METAR|SPECI)\s*)*(?P<ICAO>[\w]{4}\s)*'
          r'(?P<NIL>(NIL)\s)*(?P<UTC>[\d]{6}Z\s)*(?P<AUTOCOR>(AUTO|COR)*\s)*'
          r'(?P<WINDS>[\w]{5,6}G*[\d]{0,2}(MPS|KT|KMH)\s)\s*'
          r'.*?\s' #skip miscellaneous between winds and thermal data
          r'(?P<THERM>[\d]{2}/[\d]{2}\s)\s*(?P<PRESS>A[\d]{4}\s)\s*'
          r'(?:RMK\s)\s*(?P<AUTO>AO\d\s)*'
          r'(?P<PEAK>(PK\sWND\s[\d]{5,6}/[\d]{2,4}))*'
          r'(?P<SLP>SLP[\d]{3}\s)*'
          r'(?P<PRECIP>P[\d]{4}\s)*'   
          r'(?P<remains>.*)'
          )

example = "METAR KCSM 162353Z AUTO 07011KT 10SM TS SCT100 28/19 A3000 RMK AO2 PK WND 06042/2325 WSHFT 2248 LTG DSNT ALQDS PRESRR SLP135 T02780189 10389 20272 53007="

data = pattern.match(example)

It seems to work for the first 10 groups, but that is about it....

again thanks everybody

+1  A: 

You need to use the | operator and findall:

>>> re.compile("(regex\d+|optregex\d+)")
>>> regex.findall(string)
[u'regex1', u'regex2', u'regex3', u'regex4', u'regex5', u'optregex1', u'optregex2', u'optregex3', u'optregex4']

An advice: there are several tools (GUIs) that allow you to experiment with (and actually help writing) regular expressions. For python, I'm quite fond of kodos.

Paolo Tedesco
Thanks, I'll have to play with the findall command.... not sure it solves my problem, but I may find a solution faster that way.
+4  A: 

If all the data is in that format I'd go with split instead. I think it will be faster.


str = "regex1 regex2 regex3 unwanted-regex regex4 random number of tokens regex5 optregex1 optregex2 more unknown unwanted junk separated with white spaces optregex3 optregex4 etc"
parts = str.split() # now you have each part as an element of the array.
for index,item in enumerate(parts):
   if index == 3:
      continue # this is unwanted-regex
   else:
      # do what you want with the information here
Geo
Yeah, that was my initial approach, but some of my fields include white spaces, that said I may just have to go that route.Thanks
you can use string's `join` to merge some of your data.
Geo
+1 This is not a regex problem.
hughdbrown
A: 

If all of your targets consist of things like "foo1", "bar22" etc (in other words a sequence of letters followed by a sequence of digits) and everything else (sequences of digits, "words" without numeric suffixes, etc) is "junk" then the following seems to be sufficient:

re.findall(r'[A-Za-z]+\d+', targetstr)

(We can't use just r'\w+\d+' because \w matches digits and _ (underscores) as well as letters).

If you're looking for a limited number of key patterns, or some of the junk might match "foo123 ... then you'll obviously have to be more specific.

Jim Dennis