views:

80

answers:

3

I am parsing a text file and I get multiple lines in the form shown below.

Then I try to split each line to three segments: Part1: sf; part2: name; part3:direction.

But now I am encountering difficulty in how to write my regular expression. I have thought about splitting on whitespace and using an array to concatenate new strings:

S15,F49  Large Recipe Download Request (LRDR)   S,H->E,reply

my ($sf, $name, $direction) =~ / I don't know how to implement here/

How can I get $sf = S15,F49 // other lines like S1,F11; S6,F1; etc

$name = Large Recipe Download Request (LRDR) // different name for different $sf.

$direction = S,H->E,reply; // some time it is M,H<-E,reply or S,H<->E or S,H->E,[reply], etc. There is no white space between each of sub items for part3: $direction

+4  A: 

If there is no whitespace within the $sf and the $direction items, then you could apply the following code to each line:

if ($subject =~ m/^(\S+)\s+(.*?)\s+(\S+)$/) {
    $sf = $1;
    $name = $2;
    $direction = $3;
} else {
    // no match found
}

Explanation:

^: Anchor the regex at the start of the string.

(\S+): Match one or more non-space characters. Capture the match in $1.

\s+: Match one or more space characters (= separator(s) to the next item).

(.*?): Match any number of characters, as few as possible to still allow the overall match to succeed, and capture that in $2.*

\s+(\S+): Similar to the above - match space separator(s) and non-space characters --> $3.

$: Anchor the search at the end of the string.


*The reason for the lazy quantifier *? is that otherwise, this part of the regex would also capture all the following space separators except the last one.

Tim Pietzcker
At first I was curious whether this would work because of the non-greedyness of the second group. But since the last group requires at least one character it works fine. If the second group were greedy, I think it could be a tiny teeny little bit faster because it should backtrack less frequently but I'm not 100% sure. Of course this would be a micro optimization but we don't know how often this code gets called.
musiKk
I don't think it's going to make much of a difference as for performance. However, the match results will be different, depending on whether I use a lazy or a greedy quantifier (see my edit at the bottom).
Tim Pietzcker
@Tim Pietzcker, It works quite well.
Nano HE
+2  A: 
my $str = "S15,F49  Large Recipe Download Request (LRDR)   S,H->E,reply";

$str =~ /^([^\s]+)   # sf: anything except whitespace until first whitespace
           \s+
           (.+)      # name: anything 
           \s+
           ([^\s]+)$ # direction: anything except whitespace, from last
                     # whitespace to the end
        /x;
my ($sf, $name, $direction) = ($1, $2, $3);
print $sf, "\n", $name, "\n", $direction, "\n";
Thomas Kappler
+1  A: 

From what you show, this should work:

my ( $sf, $name, $direction ) = split /\s{2,}/, $line;

Split by more two or more spaces.

This will auto-chomp:

my ( $sf, $name, $direction ) = split /\s{2,}|\n/, $line;
Axeman
@Axeman,I failed to implement with your split method.Please see the link for more details.http://codepad.org/8n5b8pAdShow warning on my laptop (ActivePerl 5.10 used): Use of uninitialized value $direction in concatenation (.) or string at D:\learning\perl\nextLine.pl line 24, <DATA> line 3.direction =
Nano HE
On the paste site, you have a single tab. between the name and direction. So I would changed the regex to read `/\s{2,}|\t|\n/` and got what I needed. Up in your example, you had 3 spaces.
Axeman