tags:

views:

96

answers:

3

Hey there,

I'm having trouble building the correct regex for my string. What I want to do is get all entities from my string; they start and end with '. The entities are identifiable by an amount of numbers and a # in front. However, entities (in this case a phone number starting with #) that don't start or end with ' should not be matched at all.

I hope someone can help me, or at least tell me that what I want to do isn't possible in one regex. Thanks :)

String:

'Blaa lablalbl balbla balb lbal '#39'blaaaaaaaa'#39' ('#39#226#8218#172#39') blaaaaaaaa #7478347878347834 blaaaa blaaaa'

RegEx:

'[#[0-9]+]*'

Wanted matches:

  • '#39'
  • '#39'
  • '#39'
  • '#226'
  • '#8218'
  • '#172'
  • '#39'

Found matches:

  • '#39'
  • '#39'
  • '#39#226#8218#172#39' <- Needs to be split(if possible in the same RegEx)

Another RegEx:

#[0-9]+

Found matches:

  • '#39'
  • '#39'
  • '#39'
  • '#226'
  • '#8218'
  • '#172'
  • '#39'
  • '#7478347878347834' <- Should not be here :(

Language: C# .NET (4.0)

+3  A: 

You cannot do this in one regex, you'll need two:

First take all matches that are between single quotes:

'[\d#]+'

Then over all those matches, do this:

#\d+

So you'll end up with something like (in C#):

foreach(var m in Regex.Matches(inputString, @"'[\d#]+'"))
{
    foreach(var m2 in Regex.Matches(m.Value, @"#\d+"))
    {
          yield return m2.Value;
    }
}
Jan Jongboom
Too bad that it isn't possible in one RegEx, guess this'll have to do. Thanks for typing it out for me aswell ;)
Willy
Gnarf posted an answer that does it in one RegEx, thank though!
Willy
+1  A: 

Assuming you can use lookbehind/lookaheads and that your regexp supports variable length lookbehinds (JGSoft / .NET only)

(?<='[#0-9]*)#\d+(?=[#0-9]*')

Should work... Tested it using this site and got these results:

   1. #39
   2. #39
   3. #39
   4. #226
   5. #8218
   6. #172
   7. #39

Breaking it down is pretty simple:

(?<=        # Start positive lookbehind group - assure that the text before the cursor
            # matches the following pattern: 
  '         # Match the literal '
  [#0-9]*   # Matches #, 0-9, zero or more times
)           # End lookbehind...
#\d+        # Match literal #, followed by one or more digits
(?=         # Start lookahead -- Ensures text after cursor matches (without advancing)
  [#0-9]*   # Allow #, 0-9, zero or more times
  '         # Match a literal '
)

So, this pattern will match #\d+ if the text before it is '[#0-9]* and the text after is [#0-9]*'

gnarf
Wow, that works perfect! Exactly what I was looking for. Could you explain what this does exactly? Thanks alot :)
Willy
You sir, are KING!
Willy
@Willy - Honestly though -- I voted for @Jan's answer.. It is WAY easier to understand what you are doing there...
gnarf
You are correct sir. It IS alot easier to understand, but I wanted to do it in one RegEx if possible, which is what your method does :). Which method would be faster and better performance wise?
Willy
@Willy - It's hard to say which method will perform better (especially since I don't have a .NET compiler), you should setup some sort of profiling testing to see...
gnarf
+2  A: 

As you don't specify a language, here is a solution in perl :

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $s = qq!Blaa lablalbl balbla balb lbal '#39'blaaaaaaaa'#39' ('#39#226#8218#172#39') blaaaaaaaa #7478347878347834 blaaaa blaaaa!;

my @n = $s =~ /(?<=['#\d])(#\d+)(?=[#'\d])/g;

print Dumper(\@n);

Output :

$VAR1 = [
          '#39',
          '#39',
          '#39',
          '#226',
          '#8218',
          '#172',
          '#39'
        ];
M42
I had no idea that RegEx was language specific, the RegEx bit works universally right? This does the trick aswell, #\d+(?=#|'). Thanks! Your RegEx is alot shorter then the one Gnarf posted, what are the differences?
Willy
His only tests that the character following the match is a `#` or `'` -- and not all regular expressions can handle lookahead, lookbehind, etc. If you put a `#` after `#7478347878347834` in your test string, it would then match that as well...
gnarf
Tested, you are right :)
Willy
@gnarf: Yes, you're right, i've updated the regex, adding a lookbehind of fix length because variable length lookaround isn't allowed in perl and in some other languages.
M42