tags:

views:

40

answers:

3

I have a test list that I am trying to capture data from using a regex.

Here is a sample of the text format:

(1) this is a sample string /(2) something strange /(3) another bit of text /(4) the last one/ something!/

I have a Regex that currently captures this correctly, but I am having some difficulty with making it work under outlier conditions.

Here is my regex

/\(?\d\d?\)([^\)]+)(\/|\z)/

Unfortunately some of the data contains parentheses like this:

(1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/

The substrings '(1998-1999)' and '(blah)' make it fail!

Anyone care to have a crack at this one? Thank you :D

+1  A: 

I would try this:

\((\d+)\)\s+(.*?)(?=/(?:\(\d+\)|\z))

This rather scary looking regex does the following:

  • It looks for one or more digits wrapped in parentheses and captures them;
  • There must be at least one white space character after the digits in parentheses. This white space is ignored (not captured);
  • A non-greedy wildcard expression is used. This is (imho) the preferable way to using negative character groups (eg [^/]+) for this kind of problem;
  • The positive lookahead ((?=...)) says the expression must be followed by a backslash and then one of:
    • one or more digits wrapped in parentheses; or
    • the string terminator.

To give you an example in PHP (you don't specify your language):

$s = '(1) this is a sample string (1998-1999) /(2) something strange (blah) /(3) another bit of text /(4) the last one/ something!/';
preg_match_all('!\((\d+)\)\s+(.*?)(?=/(?:\(\d+\)|\z))!', $s, $matches);
print_r($matches);

Output:

Array
(
    [0] => Array
        (
            [0] => (1) this is a sample string (1998-1999) 
            [1] => (2) something strange (blah) 
            [2] => (3) another bit of text 
            [3] => (4) the last one/ something!
        )

    [1] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
            [3] => 4
        )

    [2] => Array
        (
            [0] => this is a sample string (1998-1999) 
            [1] => something strange (blah) 
            [2] => another bit of text 
            [3] => the last one/ something!
        )

)

Some notes:

  • You don't specify what you want to capture. I've assumed the list item number and the text. This could be wrong in which case just drop those capturing parentheses. Either way you can get the whole match;
  • I've dropped the trailing slash from the match. This may not be your intent. Again just change the capturing to suit;
  • I've allowed any number of digits for the item number. Your version allowed only two. If you prefer it that way replace \d+ with \d\d?.
cletus
This was certainly the Rolls Royce of answers. It captures everything nicely in Ruby too. Formatted for Ruby I'm using this ... /\(\d+\).*?\/(?=\(|$)/
crunchyt
Cletus: I just noticed the embedded forward slash in the last entry is being clipped. I've already voted you up, and I'm deciphering the regex now, but can you suggest how to include text after a forward slash? Thx
crunchyt
@crunchyt can you explain? The trailing `/`, do you want it in the second captured group? Or do you mean something else?
cletus
Hi @cletus, the last part of the string was "/(4) the last one/ something!/" but the regex missed out "/ something". In your sample result, the 3rd array dimension is what I'm looking to capture, but including any text after an embedded forward slash. Cheers
crunchyt
@crunchyt fixed. Check out the new version.
cletus
@cletus, that fixed it. Your answer is the one. I worked out a similar solution to yours, but you outdid me with the \(\d+\)|\z))/ on the end. Very nice!!! Thank you.
crunchyt
+1  A: 

Prepend a / to the beginning of string, append a (0) to the end of the string, then split the whole string with the pattern \/\(\d+\), and discard the first and last empty elements.

KennyTM
+1  A: 

As long as / cannot appear in the text...

 \(?\d?\d[^/]+
Paul Creasey
This was close, but i need the whole string in between the numbers.
crunchyt