I'm trying to create a method that provides "best effort" parsing of decimal inputs in cases where I do not know which of these two mutually exclusive ways of writing numbers the end-user is using:
- "." as thousands separator and "," as decimal separator
- "," as thousands separator and "." as decimal separator
The method is implemented as parse_decimal(..)
in the code below. Furthermore, I've defined 20 test cases that show how the heuristics of the method should work.
While the code below passes the tests it is quite horrible and unreadable. I'm sure there is a more compact and readable way to implement the method. Possibly including smarter use of regexpes.
My question is simply: Given the code below and the test-cases, how would you improve parse_decimal(...) to make it more compact and readable while still passing the tests?
Clarifications:
- Clarification #1: As pointed out in the comments the case
^\d{1,3}[\.,]\d{3}$
is ambiguous in that one cannot determine logically which character is used as thousands separator and which is used as a decimal separator. In ambiguous cases we'll simply assume that US-style decimals are used: "," as thousands separator and "." as decimal separator. - Clarification #2: If you believe that any of test cases is wrong, then please state which of the tests that should be changed and how.
The code in question including the test cases:
#!/usr/bin/perl -wT
use strict;
use warnings;
use Test::More tests => 20;
ok(&parse_decimal("1,234,567") == 1234567);
ok(&parse_decimal("1,234567") == 1.234567);
ok(&parse_decimal("1.234.567") == 1234567);
ok(&parse_decimal("1.234567") == 1.234567);
ok(&parse_decimal("12,345") == 12345);
ok(&parse_decimal("12,345,678") == 12345678);
ok(&parse_decimal("12,345.67") == 12345.67);
ok(&parse_decimal("12,34567") == 12.34567);
ok(&parse_decimal("12.34") == 12.34);
ok(&parse_decimal("12.345") == 12345);
ok(&parse_decimal("12.345,67") == 12345.67);
ok(&parse_decimal("12.345.678") == 12345678);
ok(&parse_decimal("12.34567") == 12.34567);
ok(&parse_decimal("123,4567") == 123.4567);
ok(&parse_decimal("123.4567") == 123.4567);
ok(&parse_decimal("1234,567") == 1234.567);
ok(&parse_decimal("1234.567") == 1234.567);
ok(&parse_decimal("12345") == 12345);
ok(&parse_decimal("12345,67") == 12345.67);
ok(&parse_decimal("1234567") == 1234567);
sub parse_decimal($) {
my $input = shift;
$input =~ s/[^\d,\.]//g;
if ($input !~ /[,\.]/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d,\d+\.\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d\.\d+,\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d\.\d+\.\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d,\d+,\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d{4},\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d{4}\.\d/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d,\d{3}$/) {
return &parse_with_separators($input, '.', ',');
} elsif ($input =~ /\d\.\d{3}$/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d,\d/) {
return &parse_with_separators($input, ',', '.');
} elsif ($input =~ /\d\.\d/) {
return &parse_with_separators($input, '.', ',');
} else {
return &parse_with_separators($input, '.', ',');
}
}
sub parse_with_separators($$$) {
my $input = shift;
my $decimal_separator = shift;
my $thousand_separator = shift;
my $output = $input;
$output =~ s/\Q${thousand_separator}\E//g;
$output =~ s/\Q${decimal_separator}\E/./g;
return $output;
}