tags:

views:

151

answers:

4

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.

I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!

Why?

A: 

The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?

The expression you want is simply as follows:

(\w+)
Konrad Rudolph
i bekomme nothing
farka
i use regex , i want only temperatoA and CelcieusB vor !
farka
@farka: Show us *how* you’re using the expression. It’s not the expression that’s wrong, it’s how you’re using it.
Konrad Rudolph
i want to extract only words bevor ! and betwenn 1/word1 2/word2 3/word21...n/word
farka
+1  A: 

With a Perl-compatible regex engine you can search for

(?<=\d/)\w+(?=.*!)

(?<=\d/) asserts that there is a digit and a slash before the start of the match

\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.

(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.

Depending on the language you're using, you might need to escape some of the characters in the regex.

E. g., for use in C (with the PCRE library), you need to escape the backslashes:

myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Tim Pietzcker
i use pcrl perl comapatibe regular expression
farka
In which programming language? PCRE is available for many different languages. The good news is that the regex will work because PCRE supports lookaround.
Tim Pietzcker
but not working :-)))
farka
Please answer my question - otherwise there's no way to help you.
Tim Pietzcker
i write programm in c
farka
A: 

Will this work?

/([[:alpha:]]\w+)\b(?=.*!)

I made the following assumptions...

  1. A word begins with an alphabetic character.
  2. A word always immediately follows a slash. No intervening spaces, no words in the middle.
  3. Words after the exclamation point are ignored.
  4. You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.

[[:alpha:]] matches any alphabetic character.

The \b matches a word boundary.

And the (?=.*!) came from Tim Pietzcker's post.

Robert Wohlfarth
+2  A: 

Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:

qr{ (       # begin group 
      \d+   # at least one digit
      /     # followed by a slash
     (\w+)  # followed by at least one word characters
     ,?     # maybe a comma
    )*      # ANY number of repetitions of this pattern.
}x;

'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.

Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the *g*lobal flag and assign the captures into an array. Since an array does is not a single scalar like $1, it can hold all the values that were captured for capture #1.

When I do this:

my $str   = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my @matches = $str =~ /$regex/g ) { 
    print Dumper( \@matches );
}

I get this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB',
          '23/33',
          '33',
          '55/66',
          '66'
        ];

Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.

So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB'
        ];

And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.

However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.

Axeman
+1 This looks like a damn fine explanation to me - above and beyond the call of duty.
Mike