ansaurus

Question

Answer 1

+3 A:

string test = "1.85783-16";
char[] signs = { '+', '-' };

int decimalPos = test.IndexOf('.');
int signPos = test.LastIndexOfAny(signs); 

string result = (signPos > decimalPos) ?
     string.Concat(
         test.Substring(0, signPos), 
         "E", 
         test.Substring(signPos)) : test;

float.Parse(result).Dump();  //1.85783E-16

The ideas I'm using here ensure the decimal comes before the sign (thus avoiding any problems if the exponent is missing) as well as using LastIndexOf() to work from the back (ensuring we have the exponent if one existed). If there is a possibility of a prefix "+" the first if would need to include || signPos < decimalPos.

Other results:

"1.85783" => "1.85783"; //Missing exponent is returned clean
"-1.85783" => "-1.85783"; //Sign prefix returned clean
"-1.85783-3" => "-1.85783e-3" //Sign prefix and exponent coexist peacefully.

According to the comments a test of this method shows only a 5% performance hit (after avoiding the String.Format(), which I should have remembered was awful). I think the code is much clearer: only one decision to make.

Godeke 2009-11-13 22:33:57

Benchmarking against his corpus expanded to 25k lines, shows about a 40% slowdown, mostly due to the String.Format.

sixlettervariables 2009-11-13 22:54:13

Hmmm. I'm not seeing anything close to that much slowdown vs standard concatenation, although it is slower. Nevertheless, edited to concatenation.

Godeke 2009-11-13 23:02:03

Intel X5260: 50-60ms v. 80-90ms. Range of runtimes after warming each of them up.

sixlettervariables 2009-11-13 23:11:32

You're correct, switching to String.Concat(a, b, c) improves the performance of yours greatly, to only 5% behind Brian's.

sixlettervariables 2009-11-13 23:16:45

You could use `LastIndexOfAny(new[] { '+', '-' })` to find `signPos` in a single hit.

LukeH 2009-11-13 23:28:27

As far as I can tell, that seems to make no appreciable difference in speed, but it is still worth switching to. I think that is probably because of how narrow the columns are (11 characters).

sixlettervariables 2009-11-13 23:39:28

Cleaned up the sample to use LastIndexOfAny. I was expecting that to slow it down to be honest.

Godeke 2009-11-13 23:44:27

If you look in Reflector, it calls an unmanaged routine which knows the length of the string. Probably why it doesn't slow it down.

sixlettervariables 2009-11-13 23:55:11

Are you really assigning `test` to `result` unconditionally every time? If so, try changing things so that you only do that assignment when your `if` condition isn't met: `string result; if (signPos > decimalPos) result = string.Concat(...); else result = test;`

LukeH 2009-11-14 00:24:15

True. I whipped it up in LINQPad and didn't make it a function. To be honest I think it would read better as result = (signPos > decimalPos) ? string.Concat(...) : test;

Godeke 2009-11-14 00:39:59

Answer 2

A:

Could you possibly use a regular expression to pick out each occurrence?

Some information here on suitable expresions:

http://www.regular-expressions.info/floatingpoint.html

Lee Meyers 2009-11-13 22:37:13

Answer 3

+2 A:

In terms of speed, your original solution is the fastest I've tried so far (@Godeke's is a very close second). @Godeke's has a lot of readability, for only a minor amount of performance degradation. Add in some robustness checks, and his may be the long term way to go. In terms of robustness, you can add that in to yours like so:

static char[] signChars = new char[] { '+', '-' };

static float ParseFloatingPoint(string data)
{
    if (data.Length != EntryWidth)
    {
        throw new ArgumentException("data is not the correct size", "data");
    }
    else if (data[0] != ' ' && data[0] != '+' && data[0] != '-')
    {
     throw new ArgumentException("unexpected leading character", "data");
    }

    int signPos = data.LastIndexOfAny(signChars);

    // Found either a '+' or '-'
    if (signPos > 0)
    {
        // Create a new char array with an extra space to accomodate the 'e'
        char[] newData = new char[EntryWidth + 1];

        // Copy from string up to the sign
        for (int ii = 0; ii < signPos; ++ii)
        {
            newData[ii] = data[ii];
        }

        // Replace the sign with an 'e + sign'
        newData[signPos] = 'e';
        newData[signPos + 1] = data[signPos];

        // Copy the rest of the string
        for (int ii = signPos + 2; ii < EntryWidth + 1; ++ii)
        {
            newData[ii] = data[ii - 1];
        }

        return Single.Parse(
            new string(newData),
            NumberStyles.Float,
            CultureInfo.InvariantCulture);
    }
    else
    {
        Debug.Assert(false, "data does not have an exponential? This is odd.");
        return Single.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
    }
}

Benchmarks on my X5260 (including the times to just grok out the individual data points):

Code                Average Runtime  Values Parsed
--------------------------------------------------
Nothing (Overhead)            13 ms              0
Original                      50 ms         150000
Godeke                        60 ms         150000
Original Robust               56 ms         150000

sixlettervariables 2009-11-13 23:33:39

I greatly appreciate the benchmarking. In the end I think you're right about Godeke's better long term solution. I'll trade the maintainability and readability for the minor performance degredation.

Brian Triplett 2009-11-16 03:10:39

Answer 4

A:

Why not just write a simple script to reformat the data file once and then use float.Parse()?

You said "thousands" of floating point numbers, so even a terribly naive approach will finish pretty quickly (if you said "trillions" I would be more hesitant), and code that you only need to run once will (almost) never be performance critical. Certainly it would take less time to run then posting the question to SO takes, and there's much less opportunity for error.

Stephen Canon 2009-11-14 01:06:05

Knowing a bit of information from the inside, this input file itself is not available for modification.

sixlettervariables 2009-11-15 16:38:46

Answer 5

+1 A:

Thanks Godeke for your contiually improving edits.

I ended up changing the parameters of the parsing function to take a char[] rather than a string and used your basic premise to come up with the following.

    protected static float ParseFloatingPoint(char[] data)
    {
        int decimalPos = Array.IndexOf<char>(data, '.');
        int posSignPos = Array.LastIndexOf<char>(data, '+');
        int negSignPos = Array.LastIndexOf<char>(data, '-');

        int signPos = (posSignPos > negSignPos) ? posSignPos : negSignPos;

        string result;
        if (signPos > decimalPos)
        {
            char[] newData = new char[data.Length + 1];
            Array.Copy(data, newData, signPos);
            newData[signPos] = 'E';
            Array.Copy(data, signPos, newData, signPos + 1, data.Length - signPos);
            result = new string(newData);
        }
        else
        {
            result = new string(data);
        }

        return float.Parse(result, NumberStyles.Float, CultureInfo.InvariantCulture);
    }

I changed the input to the function from string to char[] because I wanted to move away from ReadLine(). I'm assuming this would perform better then creating lots of strings. Instead I get a fixed number of bytes from the data file (since it will ALWAYS be 11 char width data), converting the byte[] to char[], and then performing the above processing to convert to a float.

Brian Triplett 2009-11-16 04:29:57

ansaurus

tags:

views:

answers:

Non-exponential formatted float

related questions