tags:

views:

200

answers:

2

I am looping through a big data file and like to detect the type of variable in each column, eg if it is an Intenger or a Float etc. It works perfectly fine, however, at the moment it is still very basic and I like to add another idea. So far the declaration of the variable is based on the second row of the data set. (The first one is used as the header.) Here is the beginning of the code:

#!/usr/bin/perl

use warnings;
use diagnostics;
use Getopt::Std;

getopts("i:s:t:") or die "bad options: $!";

if($opt_i) {
open INFILE, "< $opt_i";
chomp($headerline = <INFILE>);
$second = <INFILE>;
} else {
die "the input file has to be given\n";
}

if($opt_t) {
$tablename = $opt_t;
} else {
$tablename = $opt_i;
$tablename =~ s/\.\w+//;
}

if($opt_s) {
$sep = $opt_s;
} else {
$sep = ",";
}

$headerline =~ s/\"//g;
$headerline =~ s/\./\_/g;
@header = split/$sep/, $headerline;

$second =~ s/\"//g ;
@second = split/$sep/, $second;
@terms = split/$sep/, $second;
@types = split/$sep/, $second;

And now I have implemented a small loop. The problem is that I don't know how to handle the missing values which are declared with NULL. At the moment the loop simply assigns "" i.e. nothing to the variable $vartype[$j].

$j = 0;
while($j <= $#second) {
if ($types[$j] =~ /NULL/) {
$vartype[$j] = "";
} elsif($types[$j] =~ /[A-Za-z]/) {
$vartype[$j] = "varchar";
} elsif ($types[$j] =~ /\./) {
$vartype[$j] = "double";
} else {
$vartype[$j] = "int";
}
$j++;
}

So how can I implement another loop structure into the existing loop so that whenever I have a NULL value in one column the loop reads the next value in that same column and does so until he finds a number or a word.

A sample of my data would be eg:

Country.Name        Time.Name  AG.LND.AGRI.ZS   NY.GDP.MKTP.CD   NE.IMP.GNFS.ZS
Brunei Darussalam   1960       NULL             1139121335.16    3.46
Brunei Darussalam   1960       NULL             1677595756.64    0.9
Brunei Darussalam   1960       NULL             1488339328.59    4.19
Brunei Darussalam   1961       3.98             1869828587.8     3.14
Brunei Darussalam   1961       3.98             2346769422.22    3.38
Brunei Darussalam   1961       3.98             2363109706.3     3.17

As already mentioned the for loop only uses the second row to decide on the type of the variable.

Now I would like to implement another loop so that eg in the third column (AG.LND.AGRI.ZS) he goes through the column until he detects the first real value, in this case 3.98. At the moment the loop recognizes the missing value marked with NULL and just assigns an empty value.

+2  A: 

Stop programming like C.

for my $variable (@types) {
  if ($variable =~ /NULL/) {
    push(@vartype, undef);
  }
  elsif ($variable =~ /[A-Za-z]/) {
    push(@vartype, "varchar");
  }
  elsif ($variable =~ /\./) {
    push(@vartype, "double";
  }
  else {
    push(@vartype, "int");
  }
}

Although, for perl, you should really be storing related data in a datastructure of hashes. Something like:

my $data = [ { value => 'NULL', type => undef },
             { value => 'a string', type => 'varchar' },
             { value => 9.5, type => 'double'},
             { value => 30, type => 'int'},
           ];
Oesor
Kettle, meet pot ;-). Use `elsif` or `given/when`. See http://perldoc.perl.org/perlsyn.html#Switch-statements
Sinan Ünür
Actually I think that is not what I am looking for. The code is still only using the second row of the dataset to detect the variables and to assign a classification. So shouldn't there be an additional read in of another row in the dataset?
mropa
In my defense, I rarely program if-elsif-else loops as it's much easier to map a dispatch table to my dataset. And one of my platforms is 5.8.8 so I avoid switches. :/
Oesor
does `else if` even work in perl, given that all else clauses must use braces?
Ether
@mropa: Show a small sample of your input data (a few lines) and the expected output data. Show the contents of your `@second` array.
toolic
@mropa: I guess I'm not quite sure what your code is operating on. Can you give an example of a typical dataset in $types and $second?
Oesor
@Ether: That's why I use strict and warnings; to avoid brain farts like that
Oesor
...do I post the data sample as an answer or here in the comment section? still relatively new ;-)
mropa
@mropa: Update your original question; do not add comments or answers.
toolic
....thanks so far
mropa
@Oesor How about adding `(` and `)` around those conditions?
Sinan Ünür
@mropa You're just loading a CSV file and figuring out what the contents of each field contain? How about just using Text::CSV::Slurp and operating on the arrayref in each hash to figure it out?edit: well, not CSV; but Text::CSV handles arbitrary separators.
Oesor
@Oesor: yes i am loading an .csv file. i have heard about the Text::CSV modul but haven't looked into it. i take a look at it. thanks!
mropa
+1  A: 

I am having a hard time figuring out what you are trying to do. Assuming you are trying to guess column types based on column contents, here is a way to do it. The important thing to do is not to set anything when the field is NULL, skip a field if you have already decided its type, and get out of the loop once all field types have been determined.

#!/usr/bin/perl

use strict; use warnings;
use Scalar::Util qw(looks_like_number);

my @names = split ' ', scalar <DATA>;
my @types;

while ( <DATA> ) {
    chomp;
    my @values = split / {2,}/;

    for my $i ( 0 .. $#values ) {
        next if defined $types[$i];
        my $val = $values[$i];
        next if $val eq 'NULL';
        if ( $val =~ /^[0-9]+\z/ ) {
            $types[$i] = 'int';
        }
        elsif ( $val =~ /^[0-9.]+\z/
                and looks_like_number($val) ) {
            $types[$i] = 'double';
        }
        else {
            $types[$i] = 'varchar';
        }
    }
    last unless grep { not defined } @types;
}

print "$_\n" for @types;


__DATA__
Country.Name        Time.Name  AG.LND.AGRI.ZS   NY.GDP.MKTP.CD   NE.IMP.GNFS.ZS
Brunei Darussalam   1960       NULL             1139121335.16    3.46
Brunei Darussalam   1960       NULL             1677595756.64    0.9
Brunei Darussalam   1960       NULL             1488339328.59    4.19
Brunei Darussalam   1961       3.98             1869828587.8     3.14
Brunei Darussalam   1961       3.98             2346769422.22    3.38
Brunei Darussalam   1961       3.98             2363109706.3     3.17

Output:

varchar
int
double
double
double
Sinan Ünür
yes, that is what i am trying to do. i have a couple of datasets which i like to load into a database and i like to write a perl script which automatically detects each column type. in that why i don't have to open each dataset and browse through the columns myself.thanks for your help, i take a look at your answer!
mropa
@Sinan: Oesor has suggested to use the Text::CSV modul. Is that going to reduce the amount of code and a recommendable approach?
mropa
@mropa Whether you use `Text::CSV` is more or less orthogonal to your problem as it is stated. If the data fields are tab-separated (as opposed to apparently multiple spaces) and the fields may contain quoted strings, using it would make your life easier. It would not reduce the amount of code though. The point of my code is to show you the logic for deducing field types from field contents.
Sinan Ünür