tags:

views:

135

answers:

6

I needed a utililty function earlier today to strip some data out of a file and wrote an appaling regular expresion to do it. The input was a file with lots of line with the format:

<address> <11 * ascii character value>      <11 characters>
00C4F244  75 6C 74 73 3E 3C 43 75 72 72 65  ults><Curre

I wanted to strip out everything bar the 11 characters at the end and used the following expression:

"^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}"

This matched to the bits I didn't want which I then removed from the original string. I'd like to see how you'd do this but the particular areas I couldn't get working were:

1: having the regex engine return the characters I wanted rather than the characters I didn't and

2: finding a way of repeating the match on a single ascii value followed by the space (eg "75 " = [0-9A-F]{2}[\s]{1}?) and repeating that 11 times rather than grabbing 34 characters.

Looking at it again the easiest thing to do would be to match to the last 11 characters of each input line but this isn't very flexible and in the interests of learning regex I would like to see how you can match through from the start of the sequence.

Edit: Thanks guys, this is what I wanted:

"(?:^[0-9A-F]{8}  )(?:[0-9A-F]{2} ){11} (.*)"

Wish I could turn more than one of you green.

+5  A: 

As the file has a fixed format, you could use this regular expression to just match the last 11 characters.

^.{44}(.{11})
Gumbo
A: 

The address and ascii char value are all hex so:

^[0-9A-F\s]{42}

+1  A: 

1) ^[0-9A-F+]{8}[\s]{2}[0-9A-F\s]{34}(.*)

Parens are used for grouping with extraction. How you retrieve it depends on your language context, but now some sort of $1 is set to everything after the initial pattern.

2) ^[0-9A-F+]{8}[\s]{2}(?:[0-9A-F\s]){11}\s(.*)

(?:) is grouping without extraction. So (?:[0-9A-F\s]){11} considers the subpattern there as a unit and looks for it repeated 11 times.

I'm assuming PCRE here, by the way.

chaos
+2  A: 

Last eleven is:

...........$

or:

.{11}$

Matching a hex byte + space and repeat eleven times:

([0-9A-Fa-f]{2} ){11}
kmkaplan
A: 

Matching the end of the line would be

.{11}$

To match only the end, you can use a positive look behind.

"(?<=(^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}))(.*?)$"

This would match any character until the end of the line, providing that it is preceded by the "look behind" expression.

(?<=....) defines a condition that must be met before matching is possible.

I am a bit short of time, but if you look on the net for any tutorial that contain the words "regex" and "lookbehind", you will find good stuff (if a regex tutorial covers look ahead/behind, it will usually be pretty complete and advanced).

Another advice is to get a regex training tool and play with it. Have a look at this excellent Regex designer.

Sylverdrag
The one with look-behind assertion causes horrible backtracking. Don’t use it.
Gumbo
A: 

If you're using Perl, you could also use unpack(), to get each element.

my @data;

open my $fh, '<', $filename or die;
for my $line(<$fh>){
  my($address,@list) = unpack 'a8xx(a2x)11xa11', $line;
  my $str = pop @list;

  # unpack the hexadecimal bytes
  my $data = join '', map { pack 'H2',$_ } @list;

  die unless $data eq $str;

  push @data, [$address,$data,$str];
}
close $fh;

I also went ahead and converted the 11 hexadecimal codes back into a string, using pack().

Brad Gilbert