ansaurus

Question

How do I skip lines that aren't whitespace or a number in Perl?

Answer 1

+8 A:

[@tmp = split;] is shorthand for:

@tmp = split " ", $_, 0;

which is similar to

@tmp = split /\s+/, $_, 0;

but ignores any leading whitespace, so " foo bar baz" becomes ("foo", "bar", "baz") instead of ("", "foo", "bar", "baz").

It takes each line in the filehandler $fh and splits it, using spaces as a delimiter.

Regarding what you want to do, why don't you just run the regex on $_ to begin with? That's a string.

You could do:

while (<$fh>) {
    last unless  /^[\s\d]*$/; # break if a line containing something 
                              # other than whitespace or a number is found
    @tmp = split;
    push @AoA, [@tmp];
}

Nathan Fellman 2009-04-04 04:21:43

I was writing the same answer as you wrote :)

Omnipresent 2009-04-04 04:26:24

That is not quite true. split splits on any whitespace, not just space characters so it would be equivalent to split /\w/, $_

1800 INFORMATION 2009-04-04 04:26:48

thanks. I updated it

Nathan Fellman 2009-04-04 04:27:21

\s *is* 'any whitespace'. \w is word characters, not whitespace.

chaos 2009-04-04 04:27:50

That's basically what I was trying to say, but you said it better. :-)

Benson 2009-04-04 04:28:08

yeah sorry I should have written \s

1800 INFORMATION 2009-04-04 04:28:55

split; is not the same as split /\s/;. it is more akin to split /\s+/;, but even that does not do the same thing. split; and split " "; both skip leading whitespace. See the third paragraph on http://perldoc.perl.org/5.8.8/functions/split.html

Chas. Owens 2009-04-04 04:33:47

Your "next if" line will skip lines matching the regex. The user said he wants to stop adding lines if the line doesn't match. That would be "last unless".

Chris Lutz 2009-04-04 04:58:48

By the way, if you are curious about the what Perl expands shortcuts like this to, you can use the B::Deparse module: perl -MO=Deparse -lne 'print +(split)[0]'

Chas. Owens 2009-04-04 05:13:33

Answer 2

+3 A:

[@tmp = split;] splits each incoming line of the file on whitespace and stores the words, as an array, in @tmp. (The while() loop is iterating across each line in the file.) An array reference containing @tmp is then pushed onto @AoA.

The best way to accomplish 'converting @tmp into a string', if you want to do something with it right there, is to never converted it out of being a string; the split is operating on $_, which is a string (the while loop is implicitly setting this). If you do regex operations like s/foo/bar/ within that loop, they'll automatically operate on $_.

So one way to accomplish what you say you want (with the code simplified somewhat) is:

while(<$fh>) {
    last
        if /[^\s\d]/;
    push @AoA, [split];
}

If you truly desired to reconvert @tmp to a string, you could do:

my $tmp = join ' ', @tmp;

chaos 2009-04-04 04:22:48

Answer 3

A:

The first line is a while loop like any other, but its "condition" reads a line of input from the filehandle $fh into the default variable $_. If the read succeeds (i.e. we're not at the end of the file), the body executes. It's essentially a "for each line in the file $fh".

The next line is splitting the items in $_ (the default variable, remember, so it's left out of the call to split) by whitespace (the default separator), and storing the result in @tmp. The last line adds a REFERENCE to @tmp to @AoA, an array of array references.

So, what you want to do is say (at the top of the loop)

last if $_ =~ <apropriate regex here>;

Benson 2009-04-04 04:24:39

It doesn't add a reference to @tmp. It adds a reference to an anonymous array, and @tmp is copied into this anonymous array. If it was a reference to @tmp, it would look like push @AoA, \@tmp;

Chris Lutz 2009-04-04 04:52:48

Which would be bad since @tmp is not a lexical variable (all of the references in @AoA would point to the same array, @tmp).

Chas. Owens 2009-04-04 05:17:10

A bit pedantic, but you're correct. However, what I said was effectively true (I tested both the [@tmp] and \@tmp syntax). Effectively the interpreter is either creating another reference to the @tmp list, or it's creating an anonymous list with the same values. It works the same either way.

Benson 2009-04-04 05:41:41

No, it isn't. All of the \@tmp references in @AoA point to the same array, this is a bad thing. See example here: http://codepad.org/KEvNAyzh

Chas. Owens 2009-04-04 06:06:24

That's a bit odd, then, because I tested it and it worked like I expected.

Benson 2009-04-06 16:55:15

Answer 4

A:

split takes the string it is given and turns it into an array by splitting on whitespace - since no parameter is given, it will split the $_ variable (this is given each line from the file in $fh in turn.

It is not necessary to convert @tmp into a string, since that string is already in the $_ variable.

In order to stop the loop if you match any single character that is not whitespace or numeric:

last if /[\s\d]/;

This is slightly different from your version, which would match any complete line that consisted of only non-whitespace and/or non-numeric.

1800 INFORMATION 2009-04-04 04:24:54

Answer 5

A:

ok cool!

shorthand explains a lot.

So I can do this..

while (<$fh>)
{
        if( /^[/s/d]*$/ ){
          //do something
        }else{
          //do something else;
        }

        @tmp = split;
        push @AoA, [@tmp];
}

Omnipresent 2009-04-04 04:25:20

I would say "push @AoA, [split];" instead, there is no need for the temporary variable.

Chas. Owens 2009-04-04 04:27:37

oh god, these shorthands are mesmerizing!

Omnipresent 2009-04-04 04:46:28

coming from the java world, perl seems SO MUCH stronger

Omnipresent 2009-04-04 04:47:04

Take a look a Groovy (http://groovy.codehaus.org/) it is a high level language that targets the JVM and can interoperate with Java. Still not as good as Perl though.

Chas. Owens 2009-04-04 05:53:40

I would prefer while( my $line = <$fh> ){

Brad Gilbert 2009-05-07 21:23:17

Answer 6

+3 A:

while(<$fh>) {

This reads the file in line-by-line. The current line of the file is stored in $_. It's basically the same as while($_ = <$fh>) {. Technically it expands to while(defined($_ = <$fh>)) {, but they're very close to the same thing (and either way, it's automatic, so you don't need to worry about it).

  @tmp = split;

"split" with no arguments is (mostly) equivalent to "split /\s+/, $_". It splits the current line into a list of items between whitespace. So it splits the current line into a list of words (more or less) and stores this list in an array. However, this line is bad. @tmp should be qualified with my. Perl would catch this if you have use strict; and use warnings; at the top.

  push @AoA, [@tmp];
}

This pushes a reference to an anonymous array containing the elements that were in @tmp into @AoA, which is an array of arrays (as you probably already knew).

So in the end, you have a list @AoA where each element in the list corresponds to a line of the file, and each element of the list is another list of the words on that line.

In short, @tmp should really be declared using my, and you should use strict; and use warnings;. In fact, as has been said, you could do away with @tmp altogether:

while(<$fh>) { push @AoA, [split] }

But using a temporary array may be nicer on anyone who has to add to this code later.

EDIT: I missed the regex you wanted to add:

while(<$fh>) {
  last unless /^[\d\s]*$/;
  push @AoA, [split];
}

However, /^[\d\s]*$/ won't catch all integers - specifically, it won't match -1. If you want it to match negative numbers, use /^[\d\s-]*$/. Also, if you want to match non-integers (floating-point numbers), you could use /^[\d\s\.-]*$/, but I don't know if you want to match those. However, these regexes will match invalid entries like 1-3 and 5.5.5, which are NOT integers or numbers. If you want to be more strict about that, try this:

LOOP: while(<$fh>) {
  my @tmp = split;
  for(@tmp) {
    # this line for floating points:
    last LOOP unless /^-?\d+(?:\.\d+|)$/;
    # this line for just integers:
    last LOOP unless /^-?\d+$/;
  }
  push @AoA, [@tmp];
}

Chris Lutz 2009-04-04 04:27:29

while (<$fh>) { expands to while(defined($_ = <$fh>)) {, of course, while ($_ = <$fh>) { also expands to while(defined($_ = <$fh>)) {

Chas. Owens 2009-04-04 05:21:21

+1 for mentioning strict/warnings. I would give another +1 for addressing negative and floating-point numbers, but, alas, I can't.

Dave Sherohman 2009-04-04 12:46:48

Answer 7

+2 A:

Actually, the while (<$fh>) line splits the file by lines; each iteration of the loop will have a new line stored in $_.

The marked line splits the line stored in $_ by whitespace. So, @tmp will be an array containing all of the words on the line: if the line contains foo bar baz, @tmp will be ('foo', 'bar', 'baz').

If you want to do a regexp match on the line in question, then you should do that before you split the line. A regular expression in perl matches against $_ by default, so the line is pretty simple:

while (<$fh>)
{
    last unless /^[\s\d]*$/;
    @tmp = split;
    push @AoA, [@tmp];
}

Brian Campbell 2009-04-04 04:53:54

Wow, I somehow missed that 6 other people had already answered this, and it had been accepted.

Brian Campbell 2009-04-04 04:55:36

Answer 8

+1 A:

Warning, \d doesn't mean [0-9] in Perl 5.8 and 5.10 (unless you use the bytes pragma). It means any UNICODE character that has the digit property, such as MONGOLIAN DIGIT FIVE U+1815 (᠕), if you want to restrict it to only whitespace and numbers you can do math with, you need to say /^[\s0-9]$/.

Chas. Owens 2009-04-04 05:09:50

Answer 9

+5 A:

When you wonder what a Perl built-in does, read its documentation. Most of the answers you are getting are merely restating the documentation. The key to using any language is the learning how to use its documentation. If you've read the docs and don't understand that, mention that in your question :)

You can look in the perlfunc page to see all the built-ins.
At the command line, you can use the -f switch to perldoc to pull out just the documentation for a built-in: perldoc -f split

Good luck, :)

brian d foy 2009-04-04 12:34:32

Answer 10

A:

The core questions have been pretty well covered already, but there's one aspect of the "turning @tmp back into a string" subquestion that hasn't been explicitly mentioned:

$_ and join ' ', @tmp are not equivalent. $_ will contain the line as originally read. join ' ', @tmp will contain the words found on the line, joined by single spaces. If the line contains non-space whitespace (e.g., tabs), words separated by multiple spaces, or leading whitespace, then the two versions of the "complete" line will be different.

Dave Sherohman 2009-04-04 12:53:46

ansaurus

tags:

views:

answers:

How do I skip lines that aren't whitespace or a number in Perl?

related questions