tags:

views:

1303

answers:

5

Given a string S, what is the best algorithm to find a substring which repeats maximum number of times.

For example, in "assdssfssd", it is "ss" which repeats maximum number of times.

+1  A: 

I can see building a tree to solve that particular problem.

There is a notional root node. The first character is the first child. The second character is a child of the first character a -> s in your case. It also begins a new leaf of the root node. If, in adding a node, you visit an existing node, you increment its count (initial value 1).

Once done, you visit every node of the tree to find the one with the highest count at the deepest level (because if "asdf" occurs 5 times then "a", "as" and "asd" occur a minimum of 5 times, by definition).

cletus
+1  A: 

The substring that repeats the most will be a single letter, so you'll find the letter that occurs the most. This is quite easy:

>>> str = 'Can Berk Güder'
>>> letters = [l for l in str]
>>> uniq_letters = set(letters)
>>> counts = [(letters.count(l), l) for l in uniq_letters]
>>> counts
[(1, 'B'), (1, 'C'), (1, 'G'), (1, 'a'), (1, 'd'), (1, 'k'), (1, 'n'), (1, 'ü'), (2, ' '), (2, 'e'), (2, 'r')]
Can Berk Güder
Don’t forget the empty string! :)
Bombe
That depends on the definition of the problem. In the case of "asasd", what is the most common substring? "a", "s" or "as"? I'd say its the longest substring in the case of a tie otherwise the problem is trivial, like you say, unless you need to find all equally occuring substrings.
cletus
@Bombe: I assumed the empty substring was ignored, since it occurs infinitely many times in any string. =)
Can Berk Güder
@cletus: Note that a longer substring cannot occur more than a shorter one, so a shorter answer (substring) is just as good. Of course, things change if all correct answers need to be listed.
Can Berk Güder
@CanBerk: the problem also needs to be clarified as to whether "any", "all" or "the longest" most frequent substring(s) need to be returned.
cletus
A: 
// C# code, finds one of the most occurred non-empty substrings in O(n), proof by the reader!
int[] x = new int[65536];
foreach (char c in myString)
     x[(int)c]++;
int max = 0;
for (int i = 0; i < x.Length; ++i)
    if (x[max] < x[i])
        max = i;
return ((char)max).ToString();

This probably isn't what you want, though. You might need to look at something like Huffman coding...

Mehrdad Afshari
+1  A: 

Sounds like you are looking for something close to a compression algorithm. Compression works by finding redundant (repeated) information, and replacing it with a pointer to the first occurrence. Here are a few samples of code to do so:

http://www.developerfusion.com/code/1642/string-compression/

http://www.howtodothings.com/computers/a1223-simple-string-compression.html

Bork Blatt
A: 

In a "N" lenghted Strings,

   No Of "1" character will be "N" which requires comparision of N * (N-1) / 2

   No of "2" characters will be "N-1" which requires comparision of (N-1) * (N-2) / 2


   No of "3" characters will be "N-2"  which requires comparision of (N-2) * (N-3) / 2

.............

and No of "N" characters will be "1" which requires compariosn of (1 * 0 / 2)

Hence No Of Max Substrings = "N" + "N-1" + .... "1" = (N * ( N+1) / 2 ) and comparions required is (N+1) * (N) * (N-1) / 6

If you do a Bucket placing (not sorting) at each no of characters to the same size, then

   No Of "1" character will be "N" which requires comparision of N -1 with buckets of N

   No of "2" characters will be "N-1" which requires comparision of (N-2) with Buckets of N-1

   No of "3" characters will be "N-2"  which requires comparision of (N-3) with Buckets of N-2

.............

and No of "N" characters will be "1" which requires compariosn of 0 with Bucket of 1

Here it reduces the total comparisions to "N * (N-1) / 2"

Finally after doing bucket placing, take the bucket which has highest nos for your answer.

lakshmanaraj