views:

143

answers:

5

The following lines of comma-separated values contains several consecutive empty fields:

$rawData = 
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"

I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.

I tried this first of all:

$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'

which returned

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n

Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.

I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:

$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'

which resulted in:

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n

That didn't work either. It just shifted the comma-pairings by one.

I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?

The final string should look like this:

2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
+3  A: 

EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:

#!/usr/bin/perl

use strict; use warnings;
use autodie;

my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA

open my $str_h, '<', \$str;

while(my $row = <$str_h>) {
    chomp $row;
    print join(',',
        map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
    ), "\n";
}

Output:

E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A

You can also use:

pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;

Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use

$str =~ s{,(,|\n)}{,N/A$1}g;

Therefore, I used a loop to move pos $str back by a character after each successful substitution.

Now, as @ysth shows:

$str =~ s!,(?=[,\n])!,N/A!g;

would make the while unnecessary.

Sinan Ünür
Nice. Good example that while regular expressions are frequently used in Perl, they're not always the best solution.
jamessan
@Sinan: I'd rather not deal with filehandles. The data is already loaded into a string with `\n`s. Is what I want possible with one regex `s///`?
Zaid
@Sinan: Evidently I have much to learn about Perl. That's a wonderful one-liner, which does exactly what I need it to do. Absolutely stunning.
Zaid
decrement works too: `--pos $str`
ysth
+1  A: 

The quick and dirty hack version:

my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;

Not the fastest code, but the shortest. It should loop through at max twice.

Jack M.
Concise, but like you said, quick and dirty.
Zaid
+2  A: 

I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.

Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:

s!,(?=[,\n])!,N/A!g;

Example:

my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";

use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);

Output:

"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
ysth
@ysth: Agreed. This definitely works. Is this because the lookahead assertion is non-capturing?
Zaid
+1 I don't know why but I avoid assertions. In this case, I could not see the obvious because of my aversion.
Sinan Ünür
Funny how simple these regex solutions tend to be....
Zaid
@Zaid: non-capturing isn't good enough (`(?: )` wouldn't work). What matters is how much of the string has matched. The lookahead part is not included in what s/// considers to have matched, so the next iteration of the substitution matching starts looking for a match right after the new N/A.
ysth
+2  A: 

You could search for

(?<=,)(?=,|$)

and replace that with N/A.

This regex matches the (empty) space between two commas or between a comma and end of line.

Tim Pietzcker
+1 but it would have to be `s!(?<=,)(?=,|\n)!N/A!g;` to catch an empty field at the end of a line.
Sinan Ünür
Yeah, I had just noticed that, too.
Tim Pietzcker
+1  A: 

Not a regex, but not too complicated either:

$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);

The ,-1 is needed at the end to force split to include any empty fields at the end of the string.

mobrule
This would fail for an empty field at the end of the line because it would contain `"\n"` which is why I `chomp` first in my `split` example.
Sinan Ünür
@SU - good catch. Best to use this on chomped input.
mobrule