most efficient character counting algorithm?

tags:

algorithm

views:

1103

answers:

+1 Q:

most efficient character counting algorithm?

Let's say you want to count the occurances of chars in some text.

The fastest way i could think of was to use an array like unsigned char charcounts[256], initialize it to zeros, then look at each char in the text input and do charcounts[c]++. then linear search charcounts[] using two vars to keep of track of the lowest (so far) char and its count, replacing it with new char/count when we find a lower one, until we get to the end.

So "text" would be t=2, e=1, x=1.

Is there a faster way to do this?

That sounds like one of the most efficient ways to do what you describe. I'm not sure what you want to do with the second part, it sounds like you want to find the character that has the minimum number of occurrences in the sort data?

Greg Hewgill 2008-10-10 08:46:22

yeah that's right.

2008-10-10 08:48:36

So you want to know the character that occurs least in the string, but at least once?How about if two characters have the same number of occurrences?

Matthew Schinckel 2008-10-10 08:54:04

the first (ascii ordinal) occurring least occuring char is mine. i'm mainly just curious about the linear search of the counts array. it's O(n), and I was curious if there is a faster algorithm. I looked into heaps which can return the lowest in O(1) but adjust in O(lg n), which would be O(n lg n)

2008-10-10 09:00:34

+1 A:

You've described two tasks here. The first is to count the number of times each ASCII character occurs in a stream, and the second attempts to find the lowest frequency character.

The first algorithm seems like it's pretty efficient. Off the top of my head I can't think of a quicker way.

I'm less sure of your second algorithm, however. You don't explicitly say why you want to find the lowest frequency character, or what the input data is, but I can imagine it's easily possible to have more than one character that has a frequency count of zero, so how do you want to differentiate between them?

Andrew 2008-10-10 08:48:15

+3 A:

The first part - Counting letter frequencies Two issues to point out, assuming the language here is C or C++:

Your code won't handle letters occuring > 255 times (or 127 if char happens to be signed.) Making "charcounts" an array of ints probably won't impact performance much at all.
Your code won't work for unicode/international characters

The second part - locating the least frequent letter

If you're dealing with short strings ("text", "fred"), then scanning all 256 entries in your table is the rate-determining step. You're better off keeping track of the lowest frequency letter in the initial scanning loop.
But, if you do want to scan all 256 entries, you can exit the loop as soon as you hit a "one" value (or zero, if that's how your algorithm is intended to work)

Roddy 2008-10-10 09:04:13

i tried to accept your answer but it didn't work. that's seems like the fastest way...

2008-10-10 09:15:43

+2 A:

The first part of your algorithm is counting the characters - this is just generating keys to sort by.

If you know that you are only using alphabetical characers [A-Za-z ]* then you could optimize your algorithm by reducing the number of buckets used, but this is only a minor tweak.

The second part is just a stable sort - there are many ways of doing this - the wikipedia page on sorting gives a good summary. If you are only interested in the character that occurs least, then the ("Phase 2") method you describe is probably as efficient as you can get.

The only other way I can think of improving this is if you can split your letters in to a fixed number of buckets (say, 16) evenly across the character range, and then recursing on each bucket. Any buckets with no characters can be thrown away, saving some time in the scanning/sorting phase. Similarly if a bucket has one character then it is done. You also want to make sure that you only split a bucket into 16 more when you know that there is more than one different character in it.

Using the word test as an example (assuming 4 buckets and only lower case characters:

generate 4 buckets (A-G, H-M, N-T, U-Z)
split the word test:
- A-G: e,
- H-M:
- N-T: tst
- U-Z:
recurse to the other buckets - (A-G has one character - this must be the least so we can stop
If this wasn't the case (As for the word "testes"), we can see H-M and U-Z are empty, so we would only need to check N-T (which would contain tsts).
- We create 4 buckets (N-O, P-Q, R-S and T).
- Split up the letters
- etc.

The advantage of this method is that we have not had to scan every letter. If the character range is the same size, then both these methods are at best O(n) where n is the length of the string (this is inevitable since we always have to look at every character), although constructing the lists of characters in my example may make the algorithm as bad as O(n^2). However as the character range increases, especially for short strings, using sub buckets will greatly increase the performance. For a unicode string you could use a hybrid approach - for example seperating all the non-ascii characters in the first phase, and using your simpler method for the ascii portion.

Oliver Hallam 2008-10-10 09:10:03

ansaurus

tags:

views:

answers:

most efficient character counting algorithm?

related questions