tags:

views:

132

answers:

4

I have several strings which look like the following:

<some_text> TAG[<some_text>@11.22.33.44] <some_text>

I want to get the ip_address and only the ip_address from this line. (For the sake of this example, assume that the ip address will always be in this format xx.xx.xx.xx)

Edit: I'm afraid I wasn't clear.

The strings will look something like this:

<some_text> TAG1[<some_text>@xx.xx.xx.xx] <some_text> TAG2[<some_text>@yy.yy.yy.yy] <some_text>

Note that the 'some_text' can be a variable length. I need to associate different regex's to different tags so that when r.group() is called, the ip address will be returned. In the above case the regex's would not be different but it is a bad example.

The regexes I have tried so far have been inadequate.

Ideally, I would like something like this:

r = re.search('(?<=TAG.*@)(\d\d.\d\d.\d\d.\d\d)', line)

where line is in the format specified above. However, this does not work because you need to have a fixed width look-behind assertion.

Additionally, I have tried non-capturing groups as such:

r = re.search('(?<=TAG\[)(?:.*@)(\d\d.\d\d.\d\d.\d\d)', line)

However, I cannot use this because r.group() will return [email protected]

I understand that r.group(1) will return just the ip address. Unfortunately, the script I am writing requires that all my regex will return the correct result after calling r.group().

What kind of regex could I use for this situation?

The code is in python.

Note: All of the some_text can be variable length

+1  A: 

Why do you want to use groups or look behinds at all? What is wrong with re.search('TAG\[.*@(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\]')?

Frank
This regex will return the whole section: TAG[[email protected]], when called with r.group(). I need it so r.group() only returns the ip_address
anon-user
Sorry, forgot the opening parenthesis before the first \d. I edited it, and it should be correct now.
Frank
Shouldn't those be `{1,3}`, not `{1-3}`?
JAB
This will still return the whole TAG[[email protected]] string if I am not mistaken.
anon-user
Yes, I corrected it. Thank you for finding this.
Frank
+1  A: 

I don't think it's possible to do that - r.group() will always return the whole string that matched, so you're forced to use lookbehind, which as you say must be fixed width.

Instead, I'd suggest modifying the script that you're writing. I'm guessing that you have a whole load of regexps that it matches, and you don't want to have to specify for each one "this one uses r.group(0)", "this one uses r.group(3)" etc.

In that case, you could use Python's named groups facility: you can name a group in a regular expression like this:

(?P<name>CONTENTS)

then retrieve what matched with r.group("name").

What I suggest doing in your script is: match the regular expression, then test if r.group("usethis") is set. If so - use that; if not - then use r.group() as before.

That way you can cope with awkward situations like this by specifying the group name usethis in the regexp - but your other regexps don't have to know or care.

psmears
The problem is exactly as you mentioned. I do not want to specify that this 'tag' uses r.group(0) and this other 'tag' uses r.group(3). I have thought about using python's name facility which from looking at the responses seems to be the best option.
anon-user
+1  A: 

Try re.search('(?<=@)\d\d\.\d\d\.\d\d\.\d\d(?=\])', line).

In fact, re.search('\d\d\.\d\d\.\d\d\.\d\d', line) may get you what you need if the only occurrence of the xx.xx.xx.xx format in the strings being checked is in those IP address sections.

EDIT: As stated in my comment, to find all occurrences of the wanted pattern in a string, you just do re.findall(pattern_to_match, line). So in this case, re.findall('\d\d\.\d\d\.\d\d\.\d\d', line) (or more generally, re.findall('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', line)).

EDIT 2: From your comment, this should work (with tagname being the tag of the IP address you currently want).

r = re.search(tagname + '\[.+?@(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})', line)

And then you'd just refer to it with r.group("ip") like psmears said.

...In fact, there's an easy way to make the regex a bit more concise.

r = re.search(tagname + r'\[.+?@(?P<ip>(?:\d{1,3}\.?){4})', line)

In fact, you could even do this:

r = re.findall('(?P<tag>\S+)\[.+?@(?P<ip>(?:\d{1,3}\.?){4})', line)

Which would return you a list containing the tags and their associated IP addresses, and so you wouldn't have to recheck any one string once you found the matches if you wanted to refer to the IP address of a different tag from the same string.

...In fact, going two steps further (farther?), you could do the following:

r = dict((m.group("tag"), m.group("ip")) for m in re.finditer('(?P<tag>\S+)\[.+?@(?P<ip>(?:\d{1,3}\.?){4})', line))

Or in Python 3:

r = {(m.group("tag"), m.group("ip")) for m in re.finditer('(?P<tag>\S+)\[.+?@(?P<ip>(?:\d{1,3}\.?){4})', line)}

And then r would be a dict with the tags as keys and the IP addresses as the respective values.

JAB
The problem is there are multiple occurences of @xx.xx.xx.xx in the string
anon-user
In that case you just use `re.findall(pattern)`
JAB
My apologies. I was not clear enough in the question. The string will look something like this: some_text TAG1[[email protected]] some_text TAG2[[email protected]] some_text. I need it to find say just yy.yy.yy.yy.
anon-user
Ah, I see. updated my answer again, then.
JAB
A: 

Almost but I think that you need to change the .* at the start to .*? since you may have multiple TAGs on a single line (I believe - as there is in the example)

re.search('TAG(\d+)\[.*?@(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})]')

The Tag ID will be in the first backreference and the IP address will be in the second back reference

Jonathan Stanton