views:

323

answers:

10

I am reading data from a file like this

while (<$fh>)
{
        @tmp = split; # <-- ?
        push @AoA, [@tmp];
}

I have a couple of questions regarding this. What does the marked line do? Does it split the file by lines and store elements of each line into an array?? If so, is it possible to convert @tmp into a string or do a regex on @tmp?

Basically I want to stop pushing data onto the AoA if I find anything other than a space or an integer in the file. I have the regex for it already: \^[\s\d]*$\

+8  A: 

[@tmp = split;] is shorthand for:

@tmp = split " ", $_, 0;

which is similar to

@tmp = split /\s+/, $_, 0;

but ignores any leading whitespace, so " foo bar baz" becomes ("foo", "bar", "baz") instead of ("", "foo", "bar", "baz").

It takes each line in the filehandler $fh and splits it, using spaces as a delimiter.

Regarding what you want to do, why don't you just run the regex on $_ to begin with? That's a string.

You could do:

while (<$fh>) {
    last unless  /^[\s\d]*$/; # break if a line containing something 
                              # other than whitespace or a number is found
    @tmp = split;
    push @AoA, [@tmp];
}
Nathan Fellman
I was writing the same answer as you wrote :)
Omnipresent
That is not quite true. split splits on any whitespace, not just space characters so it would be equivalent to split /\w/, $_
1800 INFORMATION
thanks. I updated it
Nathan Fellman
\s *is* 'any whitespace'. \w is word characters, not whitespace.
chaos
That's basically what I was trying to say, but you said it better. :-)
Benson
yeah sorry I should have written \s
1800 INFORMATION
split; is not the same as split /\s/;. it is more akin to split /\s+/;, but even that does not do the same thing. split; and split " "; both skip leading whitespace. See the third paragraph on http://perldoc.perl.org/5.8.8/functions/split.html
Chas. Owens
Your "next if" line will skip lines matching the regex. The user said he wants to stop adding lines if the line doesn't match. That would be "last unless".
Chris Lutz
By the way, if you are curious about the what Perl expands shortcuts like this to, you can use the B::Deparse module: perl -MO=Deparse -lne 'print +(split)[0]'
Chas. Owens
+3  A: 

[@tmp = split;] splits each incoming line of the file on whitespace and stores the words, as an array, in @tmp. (The while() loop is iterating across each line in the file.) An array reference containing @tmp is then pushed onto @AoA.

The best way to accomplish 'converting @tmp into a string', if you want to do something with it right there, is to never converted it out of being a string; the split is operating on $_, which is a string (the while loop is implicitly setting this). If you do regex operations like s/foo/bar/ within that loop, they'll automatically operate on $_.

So one way to accomplish what you say you want (with the code simplified somewhat) is:

while(<$fh>) {
    last
        if /[^\s\d]/;
    push @AoA, [split];
}

If you truly desired to reconvert @tmp to a string, you could do:

my $tmp = join ' ', @tmp;
chaos
A: 

The first line is a while loop like any other, but its "condition" reads a line of input from the filehandle $fh into the default variable $_. If the read succeeds (i.e. we're not at the end of the file), the body executes. It's essentially a "for each line in the file $fh".

The next line is splitting the items in $_ (the default variable, remember, so it's left out of the call to split) by whitespace (the default separator), and storing the result in @tmp. The last line adds a REFERENCE to @tmp to @AoA, an array of array references.

So, what you want to do is say (at the top of the loop)

last if $_ =~ <apropriate regex here>;
Benson
It doesn't add a reference to @tmp. It adds a reference to an anonymous array, and @tmp is copied into this anonymous array. If it was a reference to @tmp, it would look like push @AoA, \@tmp;
Chris Lutz
Which would be bad since @tmp is not a lexical variable (all of the references in @AoA would point to the same array, @tmp).
Chas. Owens
A bit pedantic, but you're correct. However, what I said was effectively true (I tested both the [@tmp] and \@tmp syntax). Effectively the interpreter is either creating another reference to the @tmp list, or it's creating an anonymous list with the same values. It works the same either way.
Benson
No, it isn't. All of the \@tmp references in @AoA point to the same array, this is a bad thing. See example here: http://codepad.org/KEvNAyzh
Chas. Owens
That's a bit odd, then, because I tested it and it worked like I expected.
Benson
A: 

split takes the string it is given and turns it into an array by splitting on whitespace - since no parameter is given, it will split the $_ variable (this is given each line from the file in $fh in turn.

It is not necessary to convert @tmp into a string, since that string is already in the $_ variable.

In order to stop the loop if you match any single character that is not whitespace or numeric:

last if /[\s\d]/;

This is slightly different from your version, which would match any complete line that consisted of only non-whitespace and/or non-numeric.

1800 INFORMATION
A: 

ok cool!

shorthand explains a lot.

So I can do this..

while (<$fh>)
{
        if( /^[/s/d]*$/ ){
          //do something
        }else{
          //do something else;
        }

        @tmp = split;
        push @AoA, [@tmp];
}
Omnipresent
I would say "push @AoA, [split];" instead, there is no need for the temporary variable.
Chas. Owens
oh god, these shorthands are mesmerizing!
Omnipresent
coming from the java world, perl seems SO MUCH stronger
Omnipresent
Take a look a Groovy (http://groovy.codehaus.org/) it is a high level language that targets the JVM and can interoperate with Java. Still not as good as Perl though.
Chas. Owens
I would prefer while( my $line = <$fh> ){
Brad Gilbert
+3  A: 
while(<$fh>) {

This reads the file in line-by-line. The current line of the file is stored in $_. It's basically the same as while($_ = <$fh>) {. Technically it expands to while(defined($_ = <$fh>)) {, but they're very close to the same thing (and either way, it's automatic, so you don't need to worry about it).

  @tmp = split;

"split" with no arguments is (mostly) equivalent to "split /\s+/, $_". It splits the current line into a list of items between whitespace. So it splits the current line into a list of words (more or less) and stores this list in an array. However, this line is bad. @tmp should be qualified with my. Perl would catch this if you have use strict; and use warnings; at the top.

  push @AoA, [@tmp];
}

This pushes a reference to an anonymous array containing the elements that were in @tmp into @AoA, which is an array of arrays (as you probably already knew).

So in the end, you have a list @AoA where each element in the list corresponds to a line of the file, and each element of the list is another list of the words on that line.

In short, @tmp should really be declared using my, and you should use strict; and use warnings;. In fact, as has been said, you could do away with @tmp altogether:

while(<$fh>) { push @AoA, [split] }

But using a temporary array may be nicer on anyone who has to add to this code later.

EDIT: I missed the regex you wanted to add:

while(<$fh>) {
  last unless /^[\d\s]*$/;
  push @AoA, [split];
}

However, /^[\d\s]*$/ won't catch all integers - specifically, it won't match -1. If you want it to match negative numbers, use /^[\d\s-]*$/. Also, if you want to match non-integers (floating-point numbers), you could use /^[\d\s\.-]*$/, but I don't know if you want to match those. However, these regexes will match invalid entries like 1-3 and 5.5.5, which are NOT integers or numbers. If you want to be more strict about that, try this:

LOOP: while(<$fh>) {
  my @tmp = split;
  for(@tmp) {
    # this line for floating points:
    last LOOP unless /^-?\d+(?:\.\d+|)$/;
    # this line for just integers:
    last LOOP unless /^-?\d+$/;
  }
  push @AoA, [@tmp];
}
Chris Lutz
while (<$fh>) { expands to while(defined($_ = <$fh>)) {, of course, while ($_ = <$fh>) { also expands to while(defined($_ = <$fh>)) {
Chas. Owens
+1 for mentioning strict/warnings. I would give another +1 for addressing negative and floating-point numbers, but, alas, I can't.
Dave Sherohman
+2  A: 

Actually, the while (<$fh>) line splits the file by lines; each iteration of the loop will have a new line stored in $_.

The marked line splits the line stored in $_ by whitespace. So, @tmp will be an array containing all of the words on the line: if the line contains foo bar baz, @tmp will be ('foo', 'bar', 'baz').

If you want to do a regexp match on the line in question, then you should do that before you split the line. A regular expression in perl matches against $_ by default, so the line is pretty simple:

while (<$fh>)
{
    last unless /^[\s\d]*$/;
    @tmp = split;
    push @AoA, [@tmp];
}
Brian Campbell
Wow, I somehow missed that 6 other people had already answered this, and it had been accepted.
Brian Campbell
+1  A: 

Warning, \d doesn't mean [0-9] in Perl 5.8 and 5.10 (unless you use the bytes pragma). It means any UNICODE character that has the digit property, such as MONGOLIAN DIGIT FIVE U+1815 (᠕), if you want to restrict it to only whitespace and numbers you can do math with, you need to say /^[\s0-9]$/.

Chas. Owens
+5  A: 

When you wonder what a Perl built-in does, read its documentation. Most of the answers you are getting are merely restating the documentation. The key to using any language is the learning how to use its documentation. If you've read the docs and don't understand that, mention that in your question :)

  • You can look in the perlfunc page to see all the built-ins.

  • At the command line, you can use the -f switch to perldoc to pull out just the documentation for a built-in: perldoc -f split

Good luck, :)

brian d foy
A: 

The core questions have been pretty well covered already, but there's one aspect of the "turning @tmp back into a string" subquestion that hasn't been explicitly mentioned:

$_ and join ' ', @tmp are not equivalent. $_ will contain the line as originally read. join ' ', @tmp will contain the words found on the line, joined by single spaces. If the line contains non-space whitespace (e.g., tabs), words separated by multiple spaces, or leading whitespace, then the two versions of the "complete" line will be different.

Dave Sherohman