views:

335

answers:

7

I have a string, that is in the following format:

[Season] [Year] [Vendor] [Geography]

so an example might be: Spring 2009 Nielsen MSA

I need to be able to parse out Season and Year in the fastest way possible. I don't care about prettiness or cleverness. Just raw speed. The language is C# using VS2008, but the assembly is being built for .NET 2.0

+4  A: 

Try this.

        string str = "Spring 2009 Nielsen MSA";
        string[] words = str.Split(' ');
        str = words[0] + " " + words[1];
Spidey
I think you mean str = words[0] + " " + words[1];
AngryHacker
Are you sure it wouldn't be [0] and [1] ? But it would be the way I would have do it.
Daok
Split is the easiest to code, but not a good answer to the OP's question, which asks for the *fastest* implementation.Split will iterate over the whole length of the string. Methods like the one Jon Skeet proposes stop after the second space character in the string.
JeffH
No it is 2 and 3 because he wanted to parse out the season and year which would be 0 and 1.
Spidey
@Spidey: "parse out" means "get". Are you thinking "parse out" means "skip"?
JeffH
I think I misunderstood his question, 0 + 1 would be the correct indexes.
Spidey
+10  A: 

If you only need the season and year, then:

int firstSpace = text.IndexOf(' ');
string season = text.Substring(0, firstSpace);
int secondSpace = text.IndexOf(' ', firstSpace + 1);
int year = int.Parse(text.Substring(firstSpace + 1, 
                                    secondSpace - firstSpace - 1));

If you can assume the year is always four digits, this is even faster:

int firstSpace = text.IndexOf(' ');
string season = text.Substring(0, firstSpace);
int year = int.Parse(text.Substring(firstSpace + 1, 4));

If additionally you know that all years are in the 21st century, it can get stupidly optimal:

int firstSpace = text.IndexOf(' ');
string season = text.Substring(0, firstSpace);
int year = 2000 + 10 * (text[firstSpace + 3] - '0') 
                + text[firstSpace + 4] - '0';

which becomes even less readable but possibly faster (depending on what the JIT does) as:

int firstSpace = text.IndexOf(' ');
string season = text.Substring(0, firstSpace);
int year = 1472 + 10 * text[firstSpace + 3] + text[firstSpace + 4];

Personally I think that's at least one step too far though :)

EDIT: Okay, taking this to extremes... you're only going to have a few seasons, right? Suppose they're "Spring", "Summer", "Fall", "Winter" then you can do:

string season;
int yearStart;
if (text[0] == 'S')
{
    season = text[1] == 'p' ? "Spring" : "Summer";
    yearStart = 7;
}
else if (text[0] == 'F')
{
    season = "Fall";
    yearStart = 5;
}
else
{
    season = "Winter";
    yearStart = 7;
}

int year = 1472 + 10 * text[yearStart + 2] + text[yearStart + 3];

This has the advantage that it will reuse the same string objects. Of course, it assumes that there's never anything wrong with the data...

Using Split as shown in Spidey's answer is certainly simpler than any of this, but I suspect it'll be slightly slower. To be honest, I'd at least try that first... have you measured the simplest code and found that it's too slow? The difference is likely to be very slight - certainly compared with whatever network or disk access you've got reading in the data in the first place.

Jon Skeet
Would that actually execute faster? I can see that it looks obviously faster but I'm always suspicious of 'looks faster'.
Lazarus
Also if the year is always 4 digits there is probably a further optimisation eliminating the second IndexOf search.
Lazarus
@Lazarus: Good point about the year. Will edit.
Jon Skeet
Wow Jon... You must be having a slow day at work (that beats out any optimized insanity I would have considered, at least until after I saw it was still too slow)
Matthew Whited
+1  A: 
string input = "Spring 2009 Nielsen MSA";

int seasonIndex = input.IndexOf(' ') + 1;

string season = input.SubString(0, seasonIndex - 2);
string year = input.SubString(seasonIndex, input.IndexOf(' ', seasonIndex) - seasonIndex);
Adam Robinson
I think that's going to miss off the last letter of the season.
Jon Skeet
A: 

Class Parser:

public class Parser : StringReader {

    public Parser(string s) : base(s) {
    }

    public string NextWord() {
        while ((Peek() >= 0) && (char.IsWhiteSpace((char) Peek())))
            Read();
        StringBuilder sb = new StringBuilder();
        do {
            int next = Read();
            if (next < 0)
                break;
            char nextChar = (char) next;
            if (char.IsWhiteSpace(nextChar))
                break;
            sb.Append(nextChar);
        } while (true);
        return sb.ToString();
    }
}

Use:

    string str = "Spring 2009 Nielsen MSA";
    Parser parser = new Parser(str);
    string season = parser.NextWord();
    string year = parser.NextWord();
    string vendor = parser.NextWord();
    string geography = parser.NextWord();
DreamWalker
I don't think this would be faster than `IndexOf`.
Groo
A: 

I'd got with Spidey's suggestion, which should be decent enough performance, but with simple, easy to follow, easy to maintain code.

But if you really need to push the perf. envelope (and C# is the only tool available) then probably a couple of loops in series that search for the spaces, then pull the strings out using substr would marginally outdo it.

You could do the same with IndexOf instead of the loops, but rolling your own may be slightly faster (but you'd have to profile that).

Phil Nash
+5  A: 

To add to the other answers, if you are expecting them to be in this format:

Spring xxxx
Summer xxxx
Autumn xxxx
Winter xxxx

then an even faster way would be:

string season = text.Substring(0, 6);
int year = int.Parse(text.Substring(7, 4);

That is rather nasty, though. :)

I wouldn't even consider coding like this.

Groo
My guess is that it'll be "Fall" rather than "Spring".
Jon Skeet
:) You're right, it never occurred to me (not my native language, sorry).
Groo
And I hope Jon meant "Autumn" and not "Spring"
Matthew Whited
+1  A: 
string[] split = stringName.Split(' ');
split[0]+" "+split[1];
Antony Koch