views:

107

answers:

5

I have a bunch of strings in perl that all look like this:

10 NE HARRISBURG
4 E HASWELL
2 SE OAKLEY
6 SE REDBIRD
PROVO
6 W EADS
21 N HARRISON

What I am needing to do is remove the numbers and the letters from before the city names. The problem I am having is that it varies a lot from city to city. The data is almost never the same. Is it possible to remove this data and keep it in a separate string?

+1  A: 

Looks like you always want the very last element in the result of split(). Or you can go with m/(\S+)$/.

Arkadiy
I think it is always the last element, but you never know when someone else is inputing the data.
shinjuo
@shinjuo: If you cannot define your input, how do you expect to define your output? You need at least *some* specification. Even if it's broad, you still need to say that you will reject data that doesn't conform.
Daenyth
That is what I will have to do. I wasnt saying I wanted to keep screwed up data I am just saying there may be the chance something could be tacked on the end so that may change how it is written
shinjuo
+3  A: 

Try this:

for my $s (@strings) {
    my @fields = split /\s+/, $s, 3;
    my $city = $fields[-1];
}

You can test the array size to determine the number of fields:

my $n = @fields;
eugene y
I'd add a limit to the number of fields that can be split, otherwise you'll get a surprise when you try to parse `2 SE SAN FRANCISCO`.
Ether
@Ether: Thanks, corrected.
eugene y
+1  A: 

Can't we assume there is always a city name and that it appears last on a line? If that's the case, split the line and keep the last portion of it. Here's a one liner command line solution:

perl -lne 'split ; print $_[-1]' input.txt

Output:

HARRISBURG
HASWELL
OAKLEY
REDBIRD
PROVO
EADS
HARRISON

Update 1

This solution won't work if you have composed city names like SAN FRANCISCO (case spotted in a comment below).

Where is your input data coming from? If you have generated it yourself, you should add delimiters. If someone generated it for you, ask them to regenerate it with delimiters. Parsing it will then become child's play.

# replace ";" for your delimiter
perl -lne 'split ";" ; print $_[-1]' input.txt
Philippe A.
I want to keep the to keep the front portion also.
shinjuo
@Philippe: You can probably reduce that to `perl -anE 'say $F[-1]' input.txt` if you're using whitespace as the delimiter.
Daenyth
I am not making it nor can I ask them to adjust it.
shinjuo
@Daenyth: good to know. Thanks!
Philippe A.
+3  A: 
my @l = (
'10 NE HARRISBURG',
'4 E HASWELL',
'2 SE OAKLEY',
'6 SE REDBIRD',
'PROVO',
'6 W EADS',
'21 N HARRISON',
);

foreach(@l) {

according to hoobs i changed the regex

    my($beg, $rest) = ($_ =~ /^(\d*\s(?:[NS]|[NS]?[EW])*)?(.*)$/);
    print "beg=$beg \trest=$rest\n";    
}

output:

beg=10 NE   rest=HARRISBURG
beg=4 E     rest=HASWELL
beg=2 SE    rest=OAKLEY
beg=6 SE    rest=REDBIRD
beg=    rest=PROVO
beg=6 W     rest=EADS
beg=21 N    rest=HARRISON

for shinjuo, if you want to run only one string you can do :

  my($beg, $rest) = ($l[3] =~ /^(\d*\s(?:[NS]|[NS]?[EW])*)?(.*)$/);
  print "beg=$beg \trest=$rest\n";

and to avoid warning on uninitialized value you have to test if $beg is defined:

print defined$beg?"beg=$beg\t":"", "rest=$rest\n";
M42
Awesome this looks like it will work well
shinjuo
@M42,@shinjuo:I think, in the second last record, regular expression fails.It should be: beg= 6 W rest= EADS.
Nikhil Jain
I did not notice that, but you are correct thanks
shinjuo
You're right. i've corrected the regex.
M42
I like this one because it's using a feature of the data that the others don't. You could probably extend it even a little more, in that a direction isn't just `/[NSEW]*/`; it's `/[NS]|[NS]?[EW]/` (that is, it's either N, S, E, or W alone, or it's one of N/S followed by one of E/W. The number and the order aren't arbitrary. That might save you some day if the city happens to be `NEW ABILENE` :)
hobbs
How can I make so that it runs them one at a time instead of an array of them? instead of foreach I tried using this: for($fields[3]) { ($beg, $rest) = ($_ =~ /^(\d*\s[NSEW]*)?(.*)$/); print $beg; }
shinjuo
But it gives me an unitialized variable error on $beg
shinjuo
@hoobs thanks, updated regex
M42
Is hoobs, hobbs?
Armando
@Armando they're similar, but not the same ;)
hobbs
@hobbs : sorry for mispelling.
M42
@hobbs/hoobs: =]
Armando
+1  A: 

Regex Solution


Solution 1: Keep everything (vol7ron's emailed solution)


#!/usr/bin/perl -w    

use strict; 
use Data::Dumper;   

   sub main{    
      my @strings = (    
                      '10 NE HARRISBURG'    
                    , '4 E HASWELL'    
                    , '2 SE OAKLEY'    
                    , '6 SE REDBIRD'    
                    , 'PROVO'    
                    , '6 W EADS'    
                    , '21 N HARRISON'    
                    , '32 SAN FRANCISCO' 
                    , ''   
                    , '15 NEW YORK'    
                    , '15 NNW NEW YORK'    
                    , '15 NW NEW YORK'     
                    , 'NW NEW YORK'    
                    );       

      my %hash;
      my $count=0;
      for (@strings){    
         if (/\d*\s*[NS]{0,2}[EW]{0,1}\s+/){
            # if there was a speed / direction
            $hash{$count}{wind} = $&;
            $hash{$count}{city} = $';
         } else {
            # if there was only a city
            $hash{$count}{city} = $_;
         }
         $count++;
      }    

      print Dumper(\%hash);  
   }    

   main();  


Solution 2: Strip off what you don't need


#!/usr/bin/perl -w    

use strict;    

   sub main{    
      my @strings = (    
                      '10 NE HARRISBURG'    
                    , '4 E HASWELL'    
                    , '2 SE OAKLEY'    
                    , '6 SE REDBIRD'    
                    , 'PROVO'    
                    , '6 W EADS'    
                    , '21 N HARRISON'    
                    , '32 SAN FRANCISCO'    
                    , '15 NEW YORK'    
                    , '15 NNW NEW YORK'    
                    , '15 NW NEW YORK'     
                    , 'NW NEW YORK'     
                    );    

      for my $elem (@strings){    
         $elem =~ s/\d*\s*[NS]{0,2}[EW]{0,1}\s+(\w*)/$1/;    
      }    

      $"="\n";    
      print "@strings\n";        
   }    

   main();    

Update:

Making the changes with vol7ron's suggestion and example, using the repetition operator worked. This will strip off leading digits and the direction and won't break if the digits or direction (or both) are missing.

Armando
looks good, but instead of `\w+` might want to use `\w{1,2}`, since the direction seems to only be a max of 2 chars. If the OP uses 3 char directions (eg `NNE`,`NSW`) then you'd change the 2 for a 3.
vol7ron
Instead of `\w` you might also want to use char selection (`[NSEW]{0,3}`). That way if something like `2 SAN FRANCISCO` comes along it won't chop off the `SAN`.
vol7ron
I haven't tried any of these suggestions out, but perhaps `[NS]{0,2}[EW]{0,1}` would be what you want, since it would take care of `N,S,NE,SE,NW,SW,NNE,NNW,NSE,NSW,SSE,SSW,SNE,SNW`, which wouldn't fail on `NEW` as Hobbs pointed out might happen.
vol7ron