ansaurus

Question

algorithm to find maximum occurence of a substring

Answer 1

+1 A:

I can see building a tree to solve that particular problem.

There is a notional root node. The first character is the first child. The second character is a child of the first character a -> s in your case. It also begins a new leaf of the root node. If, in adding a node, you visit an existing node, you increment its count (initial value 1).

Once done, you visit every node of the tree to find the one with the highest count at the deepest level (because if "asdf" occurs 5 times then "a", "as" and "asd" occur a minimum of 5 times, by definition).

cletus 2008-12-18 14:09:54

Answer 2

+1 A:

The substring that repeats the most will be a single letter, so you'll find the letter that occurs the most. This is quite easy:

>>> str = 'Can Berk Güder'
>>> letters = [l for l in str]
>>> uniq_letters = set(letters)
>>> counts = [(letters.count(l), l) for l in uniq_letters]
>>> counts
[(1, 'B'), (1, 'C'), (1, 'G'), (1, 'a'), (1, 'd'), (1, 'k'), (1, 'n'), (1, 'ü'), (2, ' '), (2, 'e'), (2, 'r')]

Can Berk Güder 2008-12-18 14:10:03

Don’t forget the empty string! :)

Bombe 2008-12-18 14:14:08

That depends on the definition of the problem. In the case of "asasd", what is the most common substring? "a", "s" or "as"? I'd say its the longest substring in the case of a tie otherwise the problem is trivial, like you say, unless you need to find all equally occuring substrings.

cletus 2008-12-18 14:14:09

@Bombe: I assumed the empty substring was ignored, since it occurs infinitely many times in any string. =)

Can Berk Güder 2008-12-18 15:24:06

@cletus: Note that a longer substring cannot occur more than a shorter one, so a shorter answer (substring) is just as good. Of course, things change if all correct answers need to be listed.

Can Berk Güder 2008-12-18 15:25:36

@CanBerk: the problem also needs to be clarified as to whether "any", "all" or "the longest" most frequent substring(s) need to be returned.

cletus 2008-12-19 01:47:10

Answer 3

A:

// C# code, finds one of the most occurred non-empty substrings in O(n), proof by the reader!
int[] x = new int[65536];
foreach (char c in myString)
     x[(int)c]++;
int max = 0;
for (int i = 0; i < x.Length; ++i)
    if (x[max] < x[i])
        max = i;
return ((char)max).ToString();

This probably isn't what you want, though. You might need to look at something like Huffman coding...

Mehrdad Afshari 2008-12-18 14:11:10

Answer 4

+1 A:

Sounds like you are looking for something close to a compression algorithm. Compression works by finding redundant (repeated) information, and replacing it with a pointer to the first occurrence. Here are a few samples of code to do so:

http://www.developerfusion.com/code/1642/string-compression/

http://www.howtodothings.com/computers/a1223-simple-string-compression.html

Bork Blatt 2008-12-18 14:13:39

Answer 5

A:

In a "N" lenghted Strings,

   No Of "1" character will be "N" which requires comparision of N * (N-1) / 2

   No of "2" characters will be "N-1" which requires comparision of (N-1) * (N-2) / 2


   No of "3" characters will be "N-2"  which requires comparision of (N-2) * (N-3) / 2

.............

and No of "N" characters will be "1" which requires compariosn of (1 * 0 / 2)

Hence No Of Max Substrings = "N" + "N-1" + .... "1" = (N * ( N+1) / 2 ) and comparions required is (N+1) * (N) * (N-1) / 6

If you do a Bucket placing (not sorting) at each no of characters to the same size, then

   No Of "1" character will be "N" which requires comparision of N -1 with buckets of N

   No of "2" characters will be "N-1" which requires comparision of (N-2) with Buckets of N-1

   No of "3" characters will be "N-2"  which requires comparision of (N-3) with Buckets of N-2

.............

and No of "N" characters will be "1" which requires compariosn of 0 with Bucket of 1

Here it reduces the total comparisions to "N * (N-1) / 2"

Finally after doing bucket placing, take the bucket which has highest nos for your answer.

lakshmanaraj 2008-12-19 11:14:29

ansaurus

tags:

views:

answers:

algorithm to find maximum occurence of a substring

related questions