views:

468

answers:

3

Let's say I have a tab-delimited text file that contains data arranged in columns (with headers).

It is possible that different columns may be "stacked" into a "worksheet"-like arrangement, i.e. there is some divider (that may or may not be known ahead of time) that allows different columns to be arranged vertically.

Is there a Perl module that facilitates parsing of columnar data in this text file into a data structure (e.g., a hash table with the key being the column header, and the value being an array of column data scalars)?

EDIT By "stacked", I mean that a column of text may include multiple, individual "vectors" of data, each with different headers and different lengths. Admittedly, this complicates parsing.

EDIT I'm honestly not sure where the confusion is. Nonetheless, here's an example:

header_one\theader_three
data_1\tdata_7
data_2\tdata_8
data_3\tdata_9
\tdata_10
header_two\tdata_11
data_4\theader_four
data_5\tdata_12
data_6\tdata_13
\tdata_14

The script would turn this into a hash table with four keys: header_one, header_two, header_three, and header_four, each key referencing an array reference pointing to the data_n elements underneath the header.

+2  A: 

I'd start with DBD::CSV if possible, though your "stacked" requirement (which I don't fully understand) probably would require some manual parsing, using Text::CSV_XS.

Don't be fooled by their names - they can parse with any separators, not just commas.

Tanktalus
A: 

Not very smooth, but I've been doing it like this:

        my $recordType = unpack("A3", $_);

        if ($recordType eq "APT")
        {
            $currentKey = parseFAAAirportAirportRecord($_);
        }
        elsif ($recordType eq "ATT")
        {
            parseFAAAirportAttendenceRecord($currentKey, $_);
        }
        elsif ($recordType eq "RWY")
        {
            parseFAAAirportRunwayRecord($currentKey, $_);
        }
        elsif ($recordType eq "RMK")
        {
            parseFAAAirportRemarkRecord($currentKey, $_);
        }
...
sub parseFAAAirportAirportRecord($)
{
    my ($line) = @_;

    my ($recordType, $datasource_key, $type, $id, $effDate, $faaRegion,
        $faaFieldOffice, $state, $stateName, $county, $countyState,
        $city, $name, $ownershipType, $facilityUse, $ownersName,
        $ownersAddress, $ownersCityStateZip, $ownersPhone, $facilitiesManager,
        $managersAddress, $managersCityStateZip, $managersPhone,
        $formattedLat, $secondsLat, $formattedLong, $secondsLong,
        $refDetermined, $elev, $elevDetermined, $magVar, $magVarEpoch, $tph,
        $sectional, $distFromTown, $dirFromTown, $acres,
        $bndryARTCC, $bndryARTCCid,
        $bndryARTCCname, $respARTCC, $respARTCCid, $respARTCCname,
        $fssOnAirport, $fssId, $fssName, $fssPhone, $fssTollFreePhone,
        $altFss, $altFssName,
        $altFssPhone, $notamFacility, $notamD, $arptActDate,
        $arptStatusCode, $arptCert,
        $naspAgreementCode, $arptAirspcAnalysed, $aoe, $custLandRights,
        $militaryJoint, $militaryRights, $nationalEmergency, $milUse,
        $inspMeth, $inspAgency, $lastInsp, $lastInfo, $fuel, $airframeRepairs
,
        $engineRepairs, $bottledOyxgen, $bulkOxygen,
        $lightingSchedule, $tower, $unicomFreqs, $ctafFreq, $segmentedCircle,
        $lens, $landingFee, $isMedical,
        $numBasedSEL, $numBasedMEL, $numBasedJet,
        $numBasedHelo, $numBasedGliders, $numBasedMilitary,
        $numBasedUltraLight,
        $numScheduledOperation, $numCommuter, $numAirTaxi,
        $numGAlocal, $numGAItinerant,
        $numMil, $countEndingDate,
        $aptPosSrc, $aptPosSrcDate, $aptElevSrc, $aptElevSrcDate,
        $contractFuel, $transientStorage, $otherServices, $windIndicator,
        $icaoId) =
        unpack("A3 A11 A13 A4 A10 A3 A4 A2 A20 A21 A2 A40 " .
        "A42 A2 A2 A35 A72 A45 A16 A35 A72 A45 A16 A15 A12 A15 A12 A1 A5 A1 " .
        "A3 A4 A4 A30 A2 A3 A5 A4 A3 A30 A4 A3 A30 A1 A4 A30 A16 A16 " .
        "A4 A30 A16 A4 " .
        "A1 A7 A2 A15 A7 A13 A1 A1 A1 A1 A18 A6 A2 A1 A8 A8 A40 A5 A5 A8 " .
        "A8 A9 A1 A42 A7 A4 A3 A1 A1 A3 A3 A3 A3 A3 A3 A3 " .
        "A6 A6 A6 A6 A6 A6 A10" .
        "A16 A10 A16 A10 A1 A12 A71 A3 A7", $line);
Paul Tomblin
+2  A: 

I think this is close to what you are talking about. If the number of columns changes then the input is treated as if it were a different table. This code could be easily modified to recognize some other marker (such as a line of equal signs) instead of using the column counts.

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS;

#setup the parser, here we want tab separated and we allow
#loose quoting, so qq/foo\t"bar\tbaz"\tquux/ is 
#("foo", "bar\tbaz", "quux")
my $p = Text::CSV_XS->new(
    {
     sep_char           => "\t",
     allow_loose_quotes => 1,
    }
);

my @stacked;
my $cur = 0;
while (<>) {
    $p->parse($_) or die $p->error_input;
    my @rec = $p->fields;
    #normal case, just add the record to the last
    #section in @stacked
    if (@rec == $cur) {
     push @{$stacked[-1]}, \@rec;
     next;
    }
    #if the number of columns don't match then
    #we have a new section
    push @stacked, [\@rec];
    $cur = @rec; #set the new number of columns
}

for my $table (@stacked) {
    print "header: ", join("::", @{$table->[0]}), "\n";
    for my $i (1 .. $#$table) {
     print "data: ", join("::", @{$table->[$i]}), "\n";
    }
    print "\n";
}
Chas. Owens