ansaurus

Question

Looping through a dataset and handling missing values

Answer 1

+2 A:

Stop programming like C.

for my $variable (@types) {
  if ($variable =~ /NULL/) {
    push(@vartype, undef);
  }
  elsif ($variable =~ /[A-Za-z]/) {
    push(@vartype, "varchar");
  }
  elsif ($variable =~ /\./) {
    push(@vartype, "double";
  }
  else {
    push(@vartype, "int");
  }
}

Although, for perl, you should really be storing related data in a datastructure of hashes. Something like:

my $data = [ { value => 'NULL', type => undef },
             { value => 'a string', type => 'varchar' },
             { value => 9.5, type => 'double'},
             { value => 30, type => 'int'},
           ];

Oesor 2010-02-03 14:58:50

Kettle, meet pot ;-). Use `elsif` or `given/when`. See http://perldoc.perl.org/perlsyn.html#Switch-statements

Sinan Ünür 2010-02-03 15:29:50

Actually I think that is not what I am looking for. The code is still only using the second row of the dataset to detect the variables and to assign a classification. So shouldn't there be an additional read in of another row in the dataset?

mropa 2010-02-03 15:38:35

In my defense, I rarely program if-elsif-else loops as it's much easier to map a dispatch table to my dataset. And one of my platforms is 5.8.8 so I avoid switches. :/

Oesor 2010-02-03 15:54:45

does `else if` even work in perl, given that all else clauses must use braces?

Ether 2010-02-03 15:57:59

@mropa: Show a small sample of your input data (a few lines) and the expected output data. Show the contents of your `@second` array.

toolic 2010-02-03 15:58:03

@mropa: I guess I'm not quite sure what your code is operating on. Can you give an example of a typical dataset in $types and $second?

Oesor 2010-02-03 16:01:19

@Ether: That's why I use strict and warnings; to avoid brain farts like that

Oesor 2010-02-03 16:02:23

...do I post the data sample as an answer or here in the comment section? still relatively new ;-)

mropa 2010-02-03 16:07:44

@mropa: Update your original question; do not add comments or answers.

toolic 2010-02-03 16:12:22

....thanks so far

mropa 2010-02-03 16:20:39

@Oesor How about adding `(` and `)` around those conditions?

Sinan Ünür 2010-02-03 16:58:51

@mropa You're just loading a CSV file and figuring out what the contents of each field contain? How about just using Text::CSV::Slurp and operating on the arrayref in each hash to figure it out?edit: well, not CSV; but Text::CSV handles arbitrary separators.

Oesor 2010-02-03 17:55:41

@Oesor: yes i am loading an .csv file. i have heard about the Text::CSV modul but haven't looked into it. i take a look at it. thanks!

mropa 2010-02-03 18:22:29

Answer 2

+1 A:

I am having a hard time figuring out what you are trying to do. Assuming you are trying to guess column types based on column contents, here is a way to do it. The important thing to do is not to set anything when the field is NULL, skip a field if you have already decided its type, and get out of the loop once all field types have been determined.

#!/usr/bin/perl

use strict; use warnings;
use Scalar::Util qw(looks_like_number);

my @names = split ' ', scalar <DATA>;
my @types;

while ( <DATA> ) {
    chomp;
    my @values = split / {2,}/;

    for my $i ( 0 .. $#values ) {
        next if defined $types[$i];
        my $val = $values[$i];
        next if $val eq 'NULL';
        if ( $val =~ /^[0-9]+\z/ ) {
            $types[$i] = 'int';
        }
        elsif ( $val =~ /^[0-9.]+\z/
                and looks_like_number($val) ) {
            $types[$i] = 'double';
        }
        else {
            $types[$i] = 'varchar';
        }
    }
    last unless grep { not defined } @types;
}

print "$_\n" for @types;


__DATA__
Country.Name        Time.Name  AG.LND.AGRI.ZS   NY.GDP.MKTP.CD   NE.IMP.GNFS.ZS
Brunei Darussalam   1960       NULL             1139121335.16    3.46
Brunei Darussalam   1960       NULL             1677595756.64    0.9
Brunei Darussalam   1960       NULL             1488339328.59    4.19
Brunei Darussalam   1961       3.98             1869828587.8     3.14
Brunei Darussalam   1961       3.98             2346769422.22    3.38
Brunei Darussalam   1961       3.98             2363109706.3     3.17

Output:

varchar
int
double
double
double

Sinan Ünür 2010-02-03 17:11:51

yes, that is what i am trying to do. i have a couple of datasets which i like to load into a database and i like to write a perl script which automatically detects each column type. in that why i don't have to open each dataset and browse through the columns myself.thanks for your help, i take a look at your answer!

mropa 2010-02-03 17:18:45

@Sinan: Oesor has suggested to use the Text::CSV modul. Is that going to reduce the amount of code and a recommendable approach?

mropa 2010-02-03 18:24:20

@mropa Whether you use `Text::CSV` is more or less orthogonal to your problem as it is stated. If the data fields are tab-separated (as opposed to apparently multiple spaces) and the fields may contain quoted strings, using it would make your life easier. It would not reduce the amount of code though. The point of my code is to show you the logic for deducing field types from field contents.

Sinan Ünür 2010-02-03 18:43:54

ansaurus

tags:

views:

answers:

Looping through a dataset and handling missing values

related questions