views:

76

answers:

4

Hello,

I am relatively new to Perl and have only used it for converting small files into different formats and feeding data between programs.

Now, I need to step it up a little. I have a file of DNA data that is 5,905 lines long, with 32 fields per line. The fields are not delimited by anything and vary in length within the line, but each field is the same size on all 5905 lines.

I need each line fed into a separate array from the file, and each field within the line stored as its own variable. I am having no problems storing one line, but I am having difficulties storing each line successively through the entire file.

This is how I separate the first line of the full array into individual variables:

my $SampleID = substr("@HorseArray", 0, 7);
my $PopulationID = substr("@HorseArray", 9, 4);
my $Allele1A  = substr("@HorseArray", 14, 3);
my $Allele1B = substr("@HorseArray", 17, 3);
my $Allele2A  = substr("@HorseArray", 21, 3);
my $Allele2B = substr("@HorseArray", 24, 3);

...etc.

My issues are: 1) I need to store each of the 5905 lines as a separate array. 2) I need to be able to reference each line based on the sample ID, or a group of lines based on population ID and sort them.

I can sort and manipulate the data fine once it is defined in variables, I am just having trouble constructing a multidimensional array with each of these fields so I can reference each line at will. Any help or direction is much appreciated. I've poured over the Q&A sections on here, but have not found the answer to my questions yet.

+3  A: 

Do not store each line in it's own array. You need to construct a data structure. Start by reading the following tutorials form perldoc:

Here's some starter code:

use strict;
use warnings;

# Array of data samples. We could use a hash as well; which is better 
# depends on how you want to use the data.
my @sample;

while (my $line = <DATA>) {
    chomp $line;

    # Parse the input line
    my ($sample_id, $population_id, $rest) = split(/\s+/, $line, 3);

    # extract A/B allele pairs
    my @pairs;
    while ($rest =~ /(\d{1,3})(\d{3})|(\d{1,3}) (\d{1,2})/g) {
        push @pairs, {
            A => defined $1 ? $1 : $3,
            B => defined $2 ? $2 : $4,
        };
    }

    # Add this sample to the list of samples. Store it as a hashref so
    # we can access attributes by name
    push @sample, {
        sample     => $sample_id,
        population => $population_id,
        alleles    => \@pairs,
    };
}


# Print out all the values of alleles 2A and 2B for the samples in
# population py18. Note that array indexing starts at 0, so allele 2
# is at index 1.
foreach my $sample (grep { $_->{population} eq 'py18' } @sample) {
    printf("%s: %d / %d\n",
        $sample->{sample},
        $sample->{alleles}[1]{A},
        $sample->{alleles}[1]{B},
    );
}

__DATA__
00292-97 py17 97101 129129 152164 177177 100100 134136 163165 240246 105109 124124 166166 292292 000000 000000 000000
00293-97 py18 89 97 129139 148154 179179 84 90 132134 167169 222222 105105 126128 164170 284292 000000 000000 000000
00294-97 py17 91 97 129133 152154 177183 100100 134140 161163 240240 103105 120128 164166 290292 000000 000000 000000
00295-97 py18 97 97 131133 148162 177179 84100 132134 161167 240252 111111 124128 164166 284290 000000 000000 000000
Michael Carman
Michael, that's what I'm trying to do... set up an array of arrays. Each line is it's own array with 32 elements, all are encompassed by the whole array, called @HorseArray in my script.
Brian D.
Brian, I think by "array" you probably actually mean "hash", but you might not realize it if you haven't used hashes before.
Nathan
Michael, thank you! That does go along the lines of where I want to go. However, it is important that each of the alleles are accessible separately. If you look at the input, and the initial code I posted, each allele is 3 characters long and must be stored in separate variables. The only way I can figure out to do this is by using substr. The Sample ID is 7char, the popID is 4char, and each allele is 3char. The output from the programs I use puts two three-character alleles together without a space, which is a problem, else I could separate each value by using ' ' as the delimiter.
Brian D.
For instance, I would need to print alleles 2A and 2B for all samples in population py18, along with their associated SampleID.
Brian D.
I've revised the example per your comments, but it's not clear how some thing should be handled. For example, is the `89 97` for the second sample one allele pair or two? If two, what are the A/B values?
Michael Carman
Thank you so much for your help, Michael. In every sample, there are two alleles per six-digit column (A allele, B allele). I need to separate them into 1A, 1BB, 2A, 2B, etc... So the alleles for sample 1 are: 1A - 97. 1B - 101, 2A - 129, 2B - 129, etc... Each needs to be able to be accessed individually, or by its sample ID, or in a group by the population ID. I have played with this overnight and can't seem to get the script to behave.
Brian D.
Error is Argument "" isn't numeric in printf horse.pl line 55 <DATA> line 5905.This only occurs on the alleles that are 2 digits, with a space for the first character. perhaps I need to do a global whitespace removal before printf.I sincerely cannot thank you enough. Being a geneticist, I barely have time to learn what I need experimentally, much less trying to train my brain to be an effective coder. I bow.
Brian D.
Ah, okay. The problem is that your allele data isn't consistently formatted. Some are six chars, some are five, some are two and two. A simple `split` or `substr` won't handle that; we have to parse them out instead. I've updated the answer to do this.
Michael Carman
A: 

This problem calls for a hash of arrays or a hash of hashes.


More specifically, the keys of the top level hash could be the sampleID (which is unique for each line, right?)

$line = "@HorseArray";

# consider  unpack  as a concise and fast alternative to multiple  substr  calls
my @array = unpack("A7aA4aA3A3aA3A3", $line);

# for hash of lists
$MY_DATA{$array[0]} = [ @array ];    # for hash of lists
...
$allele2a = $MY_DATA{"00292-97"}[4];

# for hash of hashes
$MY_DATA{$array[0]} = { SampleID => $array[0], PopulationID => $array[1],
                        Allele1A => $array[2], ... };
...
$allele2a = $MY_DATA{"00292-97"}{"Allele2A"};

This is a great problem for digging into some really powerful features of Perl, and I wish you the best. Once you have gotten a handle on this problem, you won't be so new to Perl anymore.

mobrule
I have seen that a HoA or AoA is the answer. I am having problems getting the data stored into one and accurately referencing the elements based on sample or population ID.
Brian D.
Nice... this is what I was looking for. I'm going to try it out and praise you if I can get all the data loaded and referenced appropriately.
Brian D.
+2  A: 

I'd start by looping through the lines and parsing each into a hash of fields, and I'd build a hash for each index along the way.

my %by_sample_id;           # this will be a hash of hashes
my %by_population_id;       # a hash of lists of hashes
foreach (<FILEHANDLE>) {
    chomp;  # remove newline
    my %h;  # new hash
    $h{SampleID} = substr($_, 0, 7);
    $h{PopulationID} = substr($_, 9, 4);
    # etc...

    $by_sample_id{ $h{SampleID} } = \%h;   # a reference to %h
    push @{$by_population_id{ $h{PopulationID} }}, \%h;  # pushes hashref onto list
}

Then, you can use either index to access the data in which you're interested:

say "Allele1A for sample 123123: ", $by_sample_id{123123}->{Allele1A};
say "all the Allele1A values for population 432432: ", 
     join(", ", map {$_->{Allele1A}} @{$by_population_id{432432}});
Nathan
Hmm, it looks like Michael's `split()` line is probably more what you want than the `substr()` calls from your original sample.
Nathan
+1  A: 

I'm going to assume this isn't a one-off program, so my approach would be slightly different. I've done a fair amount of data-mashing, and after a while, I get tired of writing queries against data structures.

So -

I would feed the data into a SQLite database(or other sql DB), and then write Perl queries off of that, using Perl DBI. This cranks up the complexity to well past a simple 'parse-and-hack', but after you've written several scripts doing queries on the same data, it becomes obvious that this is a pain, there must be a better way.

You would have a schema that looks similar to this create table brians_awesome_data (id integer, population_id varchar(32), chunk1 integer, chunk2 integer...);

Then, after you used some of mobrule and Michael's excellent parsing, you'd loop and do some INSERT INTO your awesome_data table.

Then, you could use a CLI for your SQL program and do "select ... where ..." queries to quickly get the data you need.

Or, if it's more analytical/pipeliney, you could Perl up a script with DBI and get the data into your analysis routines.

Trust me, this is the better way to do it than writing queries against data structures over and over.

Paul Nathan