tags:

views:

260

answers:

5

Hello,

I have a UTF-8 formatted data file that contains thousands of floating point numbers. At the time it was designed the developers decided to omit the 'e' in the exponential notation to save space. Therefore the data looks like:

 1.85783+16 0.000000+0 1.900000+6-3.855418-4 1.958263+6 7.836995-4
-2.000000+6 9.903130-4 2.100000+6 1.417469-3 2.159110+6 1.655700-3
 2.200000+6 1.813662-3-2.250000+6-1.998687-3 2.300000+6 2.174219-3
 2.309746+6 2.207278-3 2.400000+6 2.494469-3 2.400127+6 2.494848-3
-2.500000+6 2.769739-3 2.503362+6 2.778185-3 2.600000+6 3.020353-3
 2.700000+6 3.268572-3 2.750000+6 3.391230-3 2.800000+6 3.512625-3
 2.900000+6 3.750746-3 2.952457+6 3.872690-3 3.000000+6 3.981166-3
 3.202512+6 4.437824-3 3.250000+6 4.542310-3 3.402356+6 4.861319-3

The problem is float.Parse() will not work with this format. The intermediate solution I had was,

    protected static float ParseFloatingPoint(string data)
    {

        int signPos;
        char replaceChar = '+';

        // Skip over first character so that a leading + is not caught
        signPos = data.IndexOf(replaceChar, 1);

        // Didn't find a '+', so lets see if there's a '-'
        if (signPos == -1)
        {
            replaceChar = '-';
            signPos = data.IndexOf('-', 1);
        }

        // Found either a '+' or '-'
        if (signPos != -1)
        {
            // Create a new char array with an extra space to accomodate the 'e'
            char[] newData = new char[EntryWidth + 1];

            // Copy from string up to the sign
            for (int i = 0; i < signPos; i++)
            {
                newData[i] = data[i];
            }

            // Replace the sign with an 'e + sign'
            newData[signPos] = 'e';
            newData[signPos + 1] = replaceChar;

            // Copy the rest of the string
            for (int i = signPos + 2; i < EntryWidth + 1; i++)
            {
                newData[i] = data[i - 1];
            }

            return float.Parse(new string(newData), NumberStyles.Float, CultureInfo.InvariantCulture);
        }
        else
        {
            return float.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
        }
    }

I can't call a simple String.Replace() because it will replace any leading negative signs. I could use substrings but then I'm making LOTS of extra strings and I'm concerned about the performance.

Does anyone have a more elegant solution to this?

+3  A: 
string test = "1.85783-16";
char[] signs = { '+', '-' };

int decimalPos = test.IndexOf('.');
int signPos = test.LastIndexOfAny(signs); 

string result = (signPos > decimalPos) ?
     string.Concat(
         test.Substring(0, signPos), 
         "E", 
         test.Substring(signPos)) : test;

float.Parse(result).Dump();  //1.85783E-16

The ideas I'm using here ensure the decimal comes before the sign (thus avoiding any problems if the exponent is missing) as well as using LastIndexOf() to work from the back (ensuring we have the exponent if one existed). If there is a possibility of a prefix "+" the first if would need to include || signPos < decimalPos.

Other results:

"1.85783" => "1.85783"; //Missing exponent is returned clean
"-1.85783" => "-1.85783"; //Sign prefix returned clean
"-1.85783-3" => "-1.85783e-3" //Sign prefix and exponent coexist peacefully.

According to the comments a test of this method shows only a 5% performance hit (after avoiding the String.Format(), which I should have remembered was awful). I think the code is much clearer: only one decision to make.

Godeke
Benchmarking against his corpus expanded to 25k lines, shows about a 40% slowdown, mostly due to the String.Format.
sixlettervariables
Hmmm. I'm not seeing anything close to that much slowdown vs standard concatenation, although it is slower. Nevertheless, edited to concatenation.
Godeke
Intel X5260: 50-60ms v. 80-90ms. Range of runtimes after warming each of them up.
sixlettervariables
You're correct, switching to String.Concat(a, b, c) improves the performance of yours greatly, to only 5% behind Brian's.
sixlettervariables
You could use `LastIndexOfAny(new[] { '+', '-' })` to find `signPos` in a single hit.
LukeH
As far as I can tell, that seems to make no appreciable difference in speed, but it is still worth switching to. I think that is probably because of how narrow the columns are (11 characters).
sixlettervariables
Cleaned up the sample to use LastIndexOfAny. I was expecting that to slow it down to be honest.
Godeke
If you look in Reflector, it calls an unmanaged routine which knows the length of the string. Probably why it doesn't slow it down.
sixlettervariables
Are you really assigning `test` to `result` unconditionally every time? If so, try changing things so that you only do that assignment when your `if` condition isn't met: `string result; if (signPos > decimalPos) result = string.Concat(...); else result = test;`
LukeH
True. I whipped it up in LINQPad and didn't make it a function. To be honest I think it would read better as result = (signPos > decimalPos) ? string.Concat(...) : test;
Godeke
A: 

Could you possibly use a regular expression to pick out each occurrence?

Some information here on suitable expresions:

http://www.regular-expressions.info/floatingpoint.html

Lee Meyers
+2  A: 

In terms of speed, your original solution is the fastest I've tried so far (@Godeke's is a very close second). @Godeke's has a lot of readability, for only a minor amount of performance degradation. Add in some robustness checks, and his may be the long term way to go. In terms of robustness, you can add that in to yours like so:

static char[] signChars = new char[] { '+', '-' };

static float ParseFloatingPoint(string data)
{
    if (data.Length != EntryWidth)
    {
        throw new ArgumentException("data is not the correct size", "data");
    }
    else if (data[0] != ' ' && data[0] != '+' && data[0] != '-')
    {
     throw new ArgumentException("unexpected leading character", "data");
    }

    int signPos = data.LastIndexOfAny(signChars);

    // Found either a '+' or '-'
    if (signPos > 0)
    {
        // Create a new char array with an extra space to accomodate the 'e'
        char[] newData = new char[EntryWidth + 1];

        // Copy from string up to the sign
        for (int ii = 0; ii < signPos; ++ii)
        {
            newData[ii] = data[ii];
        }

        // Replace the sign with an 'e + sign'
        newData[signPos] = 'e';
        newData[signPos + 1] = data[signPos];

        // Copy the rest of the string
        for (int ii = signPos + 2; ii < EntryWidth + 1; ++ii)
        {
            newData[ii] = data[ii - 1];
        }

        return Single.Parse(
            new string(newData),
            NumberStyles.Float,
            CultureInfo.InvariantCulture);
    }
    else
    {
        Debug.Assert(false, "data does not have an exponential? This is odd.");
        return Single.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
    }
}

Benchmarks on my X5260 (including the times to just grok out the individual data points):

Code                Average Runtime  Values Parsed
--------------------------------------------------
Nothing (Overhead)            13 ms              0
Original                      50 ms         150000
Godeke                        60 ms         150000
Original Robust               56 ms         150000
sixlettervariables
I greatly appreciate the benchmarking. In the end I think you're right about Godeke's better long term solution. I'll trade the maintainability and readability for the minor performance degredation.
Brian Triplett
A: 

Why not just write a simple script to reformat the data file once and then use float.Parse()?

You said "thousands" of floating point numbers, so even a terribly naive approach will finish pretty quickly (if you said "trillions" I would be more hesitant), and code that you only need to run once will (almost) never be performance critical. Certainly it would take less time to run then posting the question to SO takes, and there's much less opportunity for error.

Stephen Canon
Knowing a bit of information from the inside, this input file itself is not available for modification.
sixlettervariables
+1  A: 

Thanks Godeke for your contiually improving edits.

I ended up changing the parameters of the parsing function to take a char[] rather than a string and used your basic premise to come up with the following.

    protected static float ParseFloatingPoint(char[] data)
    {
        int decimalPos = Array.IndexOf<char>(data, '.');
        int posSignPos = Array.LastIndexOf<char>(data, '+');
        int negSignPos = Array.LastIndexOf<char>(data, '-');

        int signPos = (posSignPos > negSignPos) ? posSignPos : negSignPos;

        string result;
        if (signPos > decimalPos)
        {
            char[] newData = new char[data.Length + 1];
            Array.Copy(data, newData, signPos);
            newData[signPos] = 'E';
            Array.Copy(data, signPos, newData, signPos + 1, data.Length - signPos);
            result = new string(newData);
        }
        else
        {
            result = new string(data);
        }

        return float.Parse(result, NumberStyles.Float, CultureInfo.InvariantCulture);
    }

I changed the input to the function from string to char[] because I wanted to move away from ReadLine(). I'm assuming this would perform better then creating lots of strings. Instead I get a fixed number of bytes from the data file (since it will ALWAYS be 11 char width data), converting the byte[] to char[], and then performing the above processing to convert to a float.

Brian Triplett