tags:

views:

243

answers:

8

I am working on a system that is recognizing paper documents using OCR engines. These documents are invoices containing amounts such as total, vat and net amounts. I need to parse these amount strings into numbers, but they are coming in many formats and flavors using different symbols for decimal and thousands separation in the number in each invoice. If I am trying to use the normal double.tryparse and double.parse methods in .NET then they normally fail for some of the amounts

These are some of the examples I receive as amount

"3.533,65" =>  3533.65 
"-133.696" => -133696
"-33.017" => -33017
"-166.713" => -166713
"-5088,8" => -5088.8 
"0.423" => 0.423
"9,215,200" => 9215200
"1,443,840.00" => 1443840

I need some way to guess what the decimal separator and the thousand separator is in the number and then present the value to the user to decide if this is correct or not.

I am wondering how to solve this problem in an elegant way.

+2  A: 

You should be able to that with Double.TryParse. Your biggest problem as I see it is that you have inconsistencies in the way you interpret the numbers.

For example, how can

"-133.696" => -133696

When

"-166.713" => -166.713

?

klausbyskov
The amounts are consistent inside the document, but not consistent if we look at them over all of the documents
gyurisc
+9  A: 

I'm not sure you'll be able to get an elegant way of figuring this out, because it's always going to be ambigious if you can't tell it where the data is from.

For example, the numbers 1.234 and 1,234 are both valid numbers, but without establishing what the symbols mean you won't be able to tell which is which.

Personally, I would write a function which attempted to do a "best guess" based on some rules...

  • If the number contains , BEFORE ., then , must be for thousands and . must be for decimals
  • If the number contains . BEFORE ,, then . must be for thousands and , must be for decimals
  • If there are >1 , symbols, the thousand separator must be ,
  • If there are >1 . symbols, the thousand separator must be .
  • If there is only 1 , how many numbers follow it? If it's NOT 3, then it must be the decimal separator (same rule for .)
  • If there are 3 numbers separating it (e.g. 1,234 and 1.234), perhaps you could put this number aside and parse other numbers on the same page to try and figure out if they use different separators, then come back to it?

Once you've figured out the decimal separate, remove any thousand separators (not needed for parsing the number) and ensure the decimal separator is . in the string you are parsing. Then you can pass this into Double.TryParse

Richard
Your first rule is wrong for European numbers, of which it looks liek there are some examples, e.g. 1.840.456,34 is a number in European format.
ck
Yeah, I thought that would probably happen. I missed the . as thousand separators before. I've duplicated the rules to account for them now.
Richard
+2  A: 

If the rules for converting the numbers aren't consistent then you won't be able to solve this in code. As klausbyskov pointed out, why does the period in "-133.696" have a different meaning than the one in "-166.713"? How would you know what to do with a number containing a decimal point given these 2 examples where one is using it as expected but the other is using it as a thousand separator?

TabbyCool
Correct. In that case my algorithm is going to fail and the user should decide about the correct format
gyurisc
Good luck with that! I think that for those of us working with external or legacy data, this kind of thing can really be a pain(you should see the variety of different date formats we have to deal with here!). Did you see the comment from ammoQ? Is there any pattern regarding the number of digits after a comma or decimal point that could give you a hint as to how the numbers should be formatted?
TabbyCool
+2  A: 

You'll need to define the various cases you're likely to encounter, create some logic to match each incoming string to one of your cases, and then parse it specifying an appropriate FormatProvider. For example - if your string contains a decimal point BEFORE a comma, then you can assume that for this particular string, they're using the decimal point as the thousands separator and the comma as the decimal separator, so you can construct a format provider to cope with this scenario.

Try something along these lines:

public IFormatProvider GetParseFormatProvider(string s) {
  var nfi = new CultureInfo("en-US", false).NumberFormat;
  if (/* s contains period before comma */) {
    nfi.NumberDecimalSeparator = ",";
    nfi.NumberGroupSeparator = ".";
  } else if (/* some other condition */) {
     /* construct some other format provider */
  }
  return(nfi);
}

and then use Double.Parse(myString, GetParseFormatProvider(myString)) to perform the actual parsing.

Dylan Beattie
+1  A: 

"and then present the value to the user to decide if this is correct or not."

If there are multiple possibilities, why not show the user both of them?

You can have multiple methods calling TryParse with the different cultures you want to be able to handle, and collect the parse results for those methods that succeed in a list (removing duplicates).

You could even estimate the likelihood of the different possiblities being correct based on what frequency the various formats are used elsewhere in the document, and present the alternatives in a list sorted by likelihood of being correct. For example, if you have seen a lot of numbers like 3,456,231.4 already then you can guess that comma is probably the thousands seperator when you see 4,675 later in the same document, and present "4675" first in the list, and "4.675" second.

Mark Byers
+3  A: 

You will have to create your own function to guess what is the decimal separator and the thousand separator. Then you will be able to double.Parse but with the corresponding CultureInfo.

I recommend to do something like this (just an i.e. this is not a production tested function):

private CultureInfo GetNumbreCultureInfo(string number)
    {
        CultureInfo dotDecimalSeparator = new CultureInfo("En-Us");
        CultureInfo commaDecimalSeparator = new CultureInfo("Es-Ar");

        string[] splitByDot = number.Split('.');
        if (splitByDot.Count() > 2) //has more than 1 . so the . is the thousand separator
            return commaDecimalSeparator; //return a cultureInfo where the thousand separator is the .

        //the same for the ,
        string[] splitByComma = number.Split(',');
        if (splitByComma.Count() > 2)
            return dotDecimalSeparator;

        //if there is no , or . return an invariant culture
        if (splitByComma.Count() == 1 && splitByDot.Count() == 1)
            return CultureInfo.InvariantCulture;

        //if there is only 1 . or 1 , lets check witch is the last one
        if (splitByComma.Count() == 2)
            if (splitByDot.Count() == 1)
                if (splitByComma.Last().Length != 3) // , its a decimal separator
                    return commaDecimalSeparator;
                else// here you dont really know if its the dot decimal separator i.e 100.001 this can be thousand or decimal separator
                    return dotDecimalSeparator;
            else //here you have something like 100.010,00 ir 100.010,111 or 100,000.111
            {
                if (splitByDot.Last().Length > splitByComma.Last().Length) //, is the decimal separator
                    return commaDecimalSeparator;
                else
                    return dotDecimalSeparator;
            }
        else
            if (splitByDot.Last().Length != 3) // . its a decimal separator
                return dotDecimalSeparator;
            else
                return commaDecimalSeparator; //again you really dont know here... i.e. 100,101
    }

you can do a quick test like this:

string[] numbers = { "100.101", "1.000.000,00", "100.100,10", "100,100.10", "100,100.100", "1,00" };

        decimal n;
        foreach (string number in numbers)
        {
            if (decimal.TryParse(number, NumberStyles.Any, GetNumbreCultureInfo(number), out n))
                MessageBox.Show(n.ToString());//the decimal was parsed
            else
                MessageBox.Show("there was problems parsing");
        }

Also look the if where you dont really know witch is the separator (like 100,010 or 100.001) where can be a decimal or thousand separator.

You can save this looking in the document for a number with the amount of data necessary to know witch is the culture of the document, save that culture and use always the same culture (if you can asume that the document is all in the same culture...)

Hope this will help

MF
You could also add a couple of additional checks: if `splitByDot[0]` is either `0` or `-0` then return `dotDecimalSeparator`, and likewise for `splitByComma[0]`/`commaDecimalSeparator`.
LukeH
Seems a very long winded way of doing things when you could do the same very simply using regular expressions...
BobTheBuilder
Yeap you are right, I doesn't thought that when I response...
MF
A: 

If you have a dot or comma followed by no more than two digits, it's the decimal point. Otherwise, ignore it.

ammoQ
+6  A: 

I would probably set up a list of rules that are specified in order of preference, this way you can plug rules in by precedence. You can then parse the list based on regex matches returning the correct rule.

A quick prototype would be very easy to set up similar to:

public class FormatRule
{
    public string Pattern { get; set; }
    public CultureInfo Culture { get; set; }

    public FormatRule(string pattern, CultureInfo culture)
    {
        Pattern = pattern;
        Culture = culture;
    }
}

Now a list of of FormatRule used to store your rules in order of precedence:

List<FormatRule> Rules = new List<FormatRule>()
{
    /* Add rules in order of precedence specifying a culture
     * that can handle the pattern, I've chosen en-US and fr-FR
     * for this example, but equally any culture could be swapped
     * in for various formats you may need to use */
    new FormatRule(@"^0.\d+$", CultureInfo.GetCultureInfo("en-US")),
    new FormatRule(@"^0,\d+$", CultureInfo.GetCultureInfo("fr-FR")),
    new FormatRule(@"^[1-9]+.\d{4,}$", CultureInfo.GetCultureInfo("en-US")),
    new FormatRule(@"^[1-9]+,\d{4,}$", CultureInfo.GetCultureInfo("fr-FR")),
    new FormatRule(@"^-?[1-9]{1,3}(,\d{3,})*(\.\d*)?$", CultureInfo.GetCultureInfo("en-US")),
    new FormatRule(@"^-?[1-9]{1,3}(.\d{3,})*(\,\d*)?$", CultureInfo.GetCultureInfo("fr-FR")),

    /* The default rule */
    new FormatRule(string.Empty, CultureInfo.CurrentCulture)
}

You should then be able to iterate your list looking for the correct rule to apply:

public CultureInfo FindProvider(string numberString)
{
    foreach(FormatRule rule in Rules)
    {
        if (Regex.IsMatch(numberString, rule.Pattern))
            return rule.Culture;
    }
    return Rules[Rules.Count - 1].Culture;
}

This setup allows you to easily manage rules and set precedence on when something should be handled one way or another. It also allows you to be able to specify different cultures to handle one format one way and a different format another.

public float ParseValue(string valueString)
{
    float value = 0;
    NumberStyles style = NumberStyles.Any;
    IFormatProvider provider = FindCulture(valueString).NumberFormat;
    if (float.TryParse(numberString, style, provider, out value))
        return value;
    else
        throw new InvalidCastException(string.Format("Value '{0}' cannot be parsed with any of the providers in the rule set.", valueString));
}

Finally, call your ParseValue() method to convert the string value you have to a float:

string numberString = "-123,456.78"; //Or "23.457.234,87"
float value = ParseValue(numberString);

You may decide to use a dictionary to save on the extra FormatRule class; the concept is the same... I used a list in the example because it makes it easier to query use LINQ. Also, you could easily replace the float type I've used for single, double or decimal if needed.

BenAlabaster
+1 This is the way I'd go about it. Nice code!
Daniel May
@Daniel - What this old thing? You're too generous ;)
BenAlabaster
I selected this as an answer because of the codesample. Thanks Ben
gyurisc
You're welcome, hope it provided inspiration
BenAlabaster