tags:

views:

531

answers:

6

I'm trying to create a method that provides "best effort" parsing of decimal inputs in cases where I do not know which of these two mutually exclusive ways of writing numbers the end-user is using:

  • "." as thousands separator and "," as decimal separator
  • "," as thousands separator and "." as decimal separator

The method is implemented as parse_decimal(..) in the code below. Furthermore, I've defined 20 test cases that show how the heuristics of the method should work.

While the code below passes the tests it is quite horrible and unreadable. I'm sure there is a more compact and readable way to implement the method. Possibly including smarter use of regexpes.

My question is simply: Given the code below and the test-cases, how would you improve parse_decimal(...) to make it more compact and readable while still passing the tests?

Clarifications:

  • Clarification #1: As pointed out in the comments the case ^\d{1,3}[\.,]\d{3}$ is ambiguous in that one cannot determine logically which character is used as thousands separator and which is used as a decimal separator. In ambiguous cases we'll simply assume that US-style decimals are used: "," as thousands separator and "." as decimal separator.
  • Clarification #2: If you believe that any of test cases is wrong, then please state which of the tests that should be changed and how.

The code in question including the test cases:

#!/usr/bin/perl -wT

use strict;
use warnings;
use Test::More tests => 20;

ok(&parse_decimal("1,234,567") == 1234567);
ok(&parse_decimal("1,234567") == 1.234567);
ok(&parse_decimal("1.234.567") == 1234567);
ok(&parse_decimal("1.234567") == 1.234567);
ok(&parse_decimal("12,345") == 12345);
ok(&parse_decimal("12,345,678") == 12345678);
ok(&parse_decimal("12,345.67") == 12345.67);
ok(&parse_decimal("12,34567") == 12.34567);
ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);
ok(&parse_decimal("12.345,67") == 12345.67);
ok(&parse_decimal("12.345.678") == 12345678);
ok(&parse_decimal("12.34567") == 12.34567);
ok(&parse_decimal("123,4567") == 123.4567);
ok(&parse_decimal("123.4567") == 123.4567);
ok(&parse_decimal("1234,567") == 1234.567);
ok(&parse_decimal("1234.567") == 1234.567);
ok(&parse_decimal("12345") == 12345);
ok(&parse_decimal("12345,67") == 12345.67);
ok(&parse_decimal("1234567") == 1234567);

sub parse_decimal($) {
    my $input = shift;
    $input =~ s/[^\d,\.]//g;
    if ($input !~ /[,\.]/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d,\d+\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d\.\d+,\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d\.\d+\.\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d,\d+,\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d{4},\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d{4}\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d,\d{3}$/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /\d\.\d{3}$/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d,\d/) {
        return &parse_with_separators($input, ',', '.');
    } elsif ($input =~ /\d\.\d/) {
        return &parse_with_separators($input, '.', ',');
    } else {
        return &parse_with_separators($input, '.', ',');
    }
}

sub parse_with_separators($$$) {
    my $input = shift;
    my $decimal_separator = shift;
    my $thousand_separator = shift;
    my $output = $input;
    $output =~ s/\Q${thousand_separator}\E//g;
    $output =~ s/\Q${decimal_separator}\E/./g;
    return $output;
}
+1  A: 

Here is a somewhat shorter and more complete version of the function:

sub parse_decimal($) {
    my $input = shift;
    my %other = ("." => ",", "," => ".");
    $input =~ s/[^\d,.]//g;
    if ($input !~ /[,.]/) {
        return &parse_with_separators($input, '.', ',');
    } elsif ($input =~ /(\.).*(,)/ or $input =~ /(,).*(\.)/) { # Both separators present
        return &parse_with_separators($input, $2, $1);
    } elsif ($input =~ /([,.])$/ or $input =~ /^([,.])/) { # Number ends or begins with decimal separator
        return &parse_with_separators($input, $1, $other{$1});
    } elsif ($input =~ /\d{4}([,.])/ or $input =~ /([,.])\d{4}/) { # group of 4+ digits next to a separator
        return &parse_with_separators($input, $1, $other{$1});
    } elsif ($input =~ /([,.]).*\1/) { # More than one of the same separator
        return &parse_with_separators($input, $other{$1}, $1);
    } elsif ($input =~ /\d*([,.])\d{0,2}$/) { # Fewer than 2 digits to the right of the separator
        return &parse_with_separators($input, $1, $other{$1});
    } else { # Assume '.' is decimal separator and ',' is thousands separator
        return &parse_with_separators($input, '.', ',');
    }
}

Some important things:

  • Your "12.345" test case is also ambiguous. Anything of the form \d{1,3}[,.]\d\d\d is ambiguous.
  • This function handles numbers of the form ".123" "123."
  • This function makes terrible, terrible assumptions about the input being well-formed.
Eric
Very nice, but it fails test case #10. In ambiguous cases "." should be treated as the decimal separator (US style).
knorv
12.345 using US style separators should be 12.345, not 12345 (as your test case expects).
Eric
That may be shorter, but by golly, it's hard on the eyes!
dland
+5  A: 

This is like programs that automatically guess the character encoding of the input -- it might work sometimes, but in general is a very bad strategy that leads to buggy and confusing non-deterministic behavior.

For example, if you see "123,456", you don't have enough information to guess what that means.

So I would approach this with caution, and never use this technique for anything important.

jrockway
I agree. Op's test cases show that `12.34` should be intepreted as `12.34`, but that `12.345` should be interpreted as `12345`. I don't see how this can be correct unless he already has tight control over his inputs, in which case why would he need a "flexible" parser?
Adam Bellaire
Good point! I've added a clarification regarding ^\d{1,3}[\.,]\d{3}$ ambiguity.
knorv
Adam: In the case of "12.34" the dot is clearly a decimal separator.
knorv
In these sorts of tasks, sometimes the trick is to convert as much data as you can and flag the odd cases for human decisions. If data were always clean, life would be grand, but we know that's not reality. I think "non-deterministic" is a bit harsh. It might have unexpected conversions, but the same input should give the same output even if it is incorrect.
brian d foy
A: 

Trying to guess the locale of anything is always an ongoing effort, at best. What are you using this function for? The following tests look simply wrong to me:

ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);

If I were parsing a single document with values on it I would be very irritated to find this result.

I would design this function with some knobs and dials in the package to use either the locale information (using localeconv()) or ad-hoc values (like in this answer.)

Edit: Okay, let me try to explain it better. For "single source" I mean a context or scope delimiter. I know you can import from different sources; that's the very nature of importing data. We also know that we cannot know beforehand the encoding of these different sources.

What I would do is to is to do a preliminary scan of the file being imported (just taking a sample, not reading it whole) and check the numeric values. If I can determine the locale from the sample, then I would try to import the whole file using the same locale. To me, one file is a single source, and I wouldn't expect it to suddenly change its locale.

That's why I would ask again: what is the purpose of this program?

Leonardo Herrera
The dot in "12.34" is clearly not a thousand separator, so it must hence be a decimal separator giving the resulting output 12.34.The dot in "12.345" could be either a thousand separator or a decimal separator (see clarification #1). As stated in clarification #1 we will use the dot as a thousand separator in all ambiguous cases.Does that answer your question?
knorv
I understand what it is expected to do, I don't understand the reasoning behind this design.
Leonardo Herrera
In my experince with data migration and merging, locale is irrelevant. You just have a set of strings from all sorts of sources and your processing them locally to normalize them. You do the best you can, maybe develop some odd rules, and hope you can handle as many cases as you can without human inspection. As a friend of mine likes to say about problems like these: "I didn't make this (dog pooh) sandwich, I just have to eat it".
brian d foy
Oh, that is true. But --also in my experience with data migration-- the locale context is shared between _all data coming from a single source_; thus, importing a big file with numbers in it and getting 12.34 in one row but 12345 on the next is just plain wrong. So I not only agree that some babysitting is required, that's precisely why I'm questioning this solution.
Leonardo Herrera
Leonardo: If the locale context was shared between all data coming from a single source, do you really think I would have asked this question? :-)
knorv
A: 

To answer your question, not questioning the validity of the application,

Your algorithm boils down to the following pseudo-code:

if (num_separators == 1) {
    if (trailing_digits != 3 || leading_digits > 3) {
         # Replace comma with period
         return output
    }
}

# Replace first separator with nothing, and second with '.'

return output

Which is the following in perl:

my $output = $input;

my $num_separators = (scalar split /[\,\.]/, $input) - 1;

if($num_separators == 1) {
    my ($leading_digits, $trailing_digits) = split /[\,\.]/, $input;

    if(length($trailing_digits) != 3 || length($leading_digits) > 3) {
        $output =~ s/\,/\./;
        return eval($output);
    }
}

if($output =~ /^\d+\.\d+/) {
    # Swap commas and periods if periods are first
    $output =~ tr/\.\,/\,\./;
}

# remove commas
$output =~ s/\,//g;

return eval($output);

Don't know if that is any better, but it is generalized.

Jeff B
You're on the right track, but you're working too hard and using a string eval when you don't need it. :)
brian d foy
+3  A: 

The idea in these problems is to look at the code and figure out where you typed anything twice. When you see that, work to remove it. My program handles everything in your test data, and I don't have to repeat program logic structures to do it. That lets me focus on the data rather than program flow.

First, let's clean up your tests. You really have a set of pairs that you want to test, so let's put them into a data structure. You can add or remove items from the data structure as you like, and the tests will automatically adjust:

use Test::More 'no_plan';

my @pairs = (
     #  got          expect
 [ "1,234,567",  1234567  ],
 [ "1,234567",   1.234567 ],
 [ "1.234.567",  1234567  ],
 [ "1.234567",   1.234567 ],
 [ "12,345",     12345    ],
 [ "12,345,678", 12345678 ],
 [ "12,345.67",  12345.67 ],
 [ "12,34567",   12.34567 ],
 [ "12.34",      12.34    ],
 [ "12.345",     12345    ],  # odd case!
 [ "12.345,67",  12345.67 ],
 [ "12.345.678", 12345678 ],
 [ "12.34567",   12.34567 ],
 [ "123,4567",   123.4567 ],
 [ "123.4567",   123.4567 ],
 [ "1234,567",   1234.567 ],
 [ "1234.567",   1234.567 ],
 [ "12345",      12345    ],
 [ "12345,67",   12345.67 ],
 [ "1234567",    1234567  ],
);

Now that you have it in a data structure, your long line of tests reduces to a short foreach loop:

foreach my $pair ( @pairs ) {
  my( $original, $expected ) = @$pair;
  my $got = parse_number( $original );
  is( $got, $expected, "$original translates to $expected" );
  }

The parse_number routine likewise condenses into this simple code. Your trick is to find out what you are doing over and over again in the source and not do that. Instead of trying to figure out weird calling conventions and long chains of conditionals, I normalize the data. I figure out which cases are odd, then turn them into not-odd cases. In this code, I condense all of the knowledge about the separators into a handful of regexes and return one of two possible lists to show me what the thousands separator and decimal separator are. Once I have that, I remove the thousands separator completely and make the decimal separator the full stop. As I find more cases, I merely add a regex that returns true for that case:

sub parse_number
 {
 my $string = shift;

 my( $separator, $decimal ) = do {
  local $_ = $string;
  if( 
   /\.\d\d\d\./           || # two dots
   /\.\d\d\d,/            || # dot before comma
   /,\d{4,}/              || # comma with many following digits
   /\d{4,},/              || # comma with many leading digits
   /^\d{1,3}\.\d\d\d\z/   || # odd case of 123.456
   0
   )
   { qw( . , ) }
  else { qw( , . ) }     
  };

 $string =~ s/\Q$separator//g;
 $string =~ s/\Q$decimal/./;

 $string;
 }

This is the sort of thing I talk about in the dynamic subroutines chapter of Mastering Perl. Although I won't go into it here, I would probably turn that series of regexes into a pipeline of some sort and use a grep.

This is just the part of the program that passes your tests. I'd add another step to verify that the number is an expected format to deal with dirty data, but that's not so hard and is just a simple matter of programming.

brian d foy
Your first two code blocks are basically a rewrite of Test::TableDriven from CPAN, btw :)
jrockway
Hi Brian! Thanks for an excellent, educating and well-structured answer. Your answers continue to impress me! Thanks a lot!
knorv
@jrockway: yeah, but my rewrite is less typing :)
brian d foy
A: 

My thoughts would run like this -- I think they're basically the same as your current rules, but a bit more succintly organized:

  1. If one of dot and comma occurs more than once, and the other one occurs no more than once, then the one that occurs more than once should be taken to be the thousands separator.
  2. If dot and comma each occur more than once, then fall back on some other heuristic because this string is weird. Looking for groups of threes might work. I don't think your test cases cover this right now.
  3. If each of them occurs exactly once, then the one that appears earlier in the string should be taken to be the thousands separator.
  4. If only one of them appears, then it should be taken as a thousands separator if it has three digits to its right and no more than three digits to its left. (Optional: only if it's a comma besides all that. Depends how you want "12.345" to come out.)
  5. If neither appears then you have nothing to do.
hobbs