views:

299

answers:

3

Sample Data:

603       Some garbage data not related to me, 55, 113 ->

1-ENST0000        This is sample data blh blah blah blahhhh
2-ENSBTAP0        This is also some other sample data
21-ENADT)$        DO NOT WANT TO READ THIS LINE. 
3-ENSGALP0        This is third sample data
node #4           This is 4th sample data
node #5           This is 5th sample data

This is also part of the input file but i dont wish to read this. 
Branch -> 05 13, 
      44, 1,1,4,1

17, 1150

637                   YYYYYY: 2 : %

EDIT: In the above data. The column width is fixed for the sections but there might be some sections I do not wish to read. above sample data has been edited to reflect that.

So in this input file I want to read contents of first section '1-ENST0000' into an array and contents of '2-ENSBTAP0' into a separate array and so on.

I am having trouble coming up with a regex that will define the pattern ...first three lines have <someNumber>-ENS<someotherstuf> and then there can also be node #<some number here>

A: 

Good question! Looks very similar to this one (linking so original answer can get more votes):

Reading sections from a file in Perl

DreadPirateShawn
They look similar because they were asked by the same person, who presumably hasn't bothered to learn anything
friedo
A: 

OK, based on your later comment, this is a little different than the previous question. Also, I now realize that node #54 is a valid entry in the first column.

Update: I now also realize you do not need the first column.

Update: In general, you neither want to nor need to deal with character arrays in Perl.

Update: Now that you clarified the what should and should not be skipped, here is a version that deals with that. Add patterns to taste in the if condition.

#!/usr/bin/perl

use strict;
use warnings;

my @data;

while ( <DATA> ) {
    chomp;

    if ( /^[0-9]+-ENS.{5} +(.+)$/
            or /^node #[0-9]+ +(.+)$/
    ) {
        push @data, [ split //, $1 ];
    }
}

use Data::Dumper;
print Dumper \@data;

__DATA__
603       Some garbage data not related to me, 55, 113 ->

1-ENST0000        This is sample data blh blah blah blahhhh
2-ENSBTAP0        This is also some other sample data
21-ENADT)$        DO NOT WANT TO READ THIS LINE. 
3-ENSGALP0        This is third sample data
node #4           This is 4th sample data
node #5           This is 5th sample data

This is also part of the input file but i dont wish to read this. 
Branch -> 05 13, 
      44, 1,1,4,1

17, 1150

637                   YYYYYY: 2 : %

As for learning how to fish, I recommend you read everything related in perldoc perltoc.

Sinan Ünür
Also in this if I again want each character to be store din different elemtnt of array I should change @row = split ' ', $_, 2; to @row = split \\, $_, 2; ?
no no !...data does begin at a fixed column but there are other sections in the file with the same column width which i do not wish to read. So I'll take the regex from your previous edited version.
Here is your comment from above: "yeah. fourth and fifth lines do have the heading of node #4 and node #5. After the heading there are spaces, Yes. So contents for all heading start at the same location and are aligned.... – Aaron 15 mins ago"
Sinan Ünür
:( I'm sorry....
I've updated the question to bring more clarity
Nope, you did not bring clarity, you added one more twist. Maybe you could put some more work into formulating your question the next time. So, what really is the criterion for skipping. The sample case you give above does not a *specification* make I am afraid.
Sinan Ünür
ok thanks! But is it possible for you to explain the regex with a simple 1 line comment. There is other so much crap in the file which i dont want to read so maybe I can modify your regex to fix that. I think All I want to read is integer-ENS[anyfivecharacters] followed by 9 spaces OR node #integer followed by 9 spaces
please please explain your code in while loop. I'm not a perlmonk :(
@Aaron All you need is an intro book to understand what is going on in this code. Of course, reading `perldoc perlretut` would also help. For quick reference, see `perldoc perlreref`.
Sinan Ünür
+1  A: 

Is this really a fixed-column file? If so, then don't bother with regexps. Just split at the column width, perhaps trimming trailing white space from columen 1.

djna
+1 for pointing that out ... although it is hard to be sure that is the case based on the wording of the question.
Sinan Ünür
Edited the question to reflect this.