views:

217

answers:

3

[Description] Given a string of char type, find a shortest digest, which is defined as: a shortest sub-string which contains all the characters in the original string.

[Example]

A = "aaabedacd"

B = "bedac" is the answer.

[My solution]

  1. Define an integer table with 256 elements, which is used to record the occurring times for each kind of character in the current sub-string.

  2. Scan the whole string, statistic the total kinds of character in the given string by using the above table.

  3. Use two pointers, start, end, which are initially pointing to the start and (start + 1) of the given string. The current kinds of character is 1.

  4. Expand sub-string[start, end) at the end until it contains all kinds of character. Update the shortest digest if possible.

  5. Contract sub-string[start, end] at the start by one character each time, try to restore its digest property if necessary by step 4.

The time cost is O(n), and the extra space cost is constant.

Any better solution without extra space?

+1  A: 

That works very poorly, when you consider the fact that characters are not in fact limited to 256; there are closer to 2^32 codepoints in Unicode; and if you try what you are planning on an UTF-8 string, it is going to blow up. In a big way.

A better approach would be to use a digest algorithm like MD5 or FNV, or doing what you are doing, but rather with a linked list sparse array; adding the codepoints of the characters as you encounter them, and concatenating the codepoints afterwards, converting to, say, UTF-8 as you go.

EDIT:

Counterexample: "På japansk heter regn '雨'."

Williham Totland
or using a map (in C++: std::map<wchar_t,size_t>).
Patrick
@Patrick: True, some form of hashmap might be a better solution, or any kind of sparse array, really.
Williham Totland
By using std::map, the search time is O(lgn), which would make the digest algorithm run in O(n*lgn).What about some kind sparse hash table? Such as Google Sparse Hashtable...
meta
I really don't see how this algorithm is O(n) to begin with; surely the worst-case scenario (/a+b+/) would run in O(2n)?
Williham Totland
For my solution, first time you've to scan the whole string to update the values in the table for each character. This is done by O(n).Then use my algorithm to find the shortest digest, this also done by O(n), since you've scan characters as most n times. Each time get the occurring time for the character is O(1).The total time is O(n) + O(n) = O(n).
meta
Any other solution without extra space, and still keep O(n) time cost...
meta
A hash algorithm is not a digest in the same sense as the OP is using.
Nick Johnson
Well, the time complexity isn't really the point. The point is that you can't use a 255 element array; seeings as how there are more than 255 characters that can appear in a string. The time complexity may increase with a sparse array (of some description); but as using a naïve array would require an array of 2^32 elements (to be safe), or at least 2^16 elements; the space cost becomes somewhat cumbersome.
Williham Totland
+1  A: 

I don't think your algorithm is correct. Consider the string: "baaabedacdc". The correct answer is still "bedac", but your algorithm will advance the start pointer forward until it finds the "e" (the only character with an occurrence count of 1), and then the end pointer backward until it finds the "e" (the only character with an occurrence count of 1), for a result of "e".

I may have misunderstood the algorithm, though.

Steve Jessop
Thanks for your carefully reading, I made a mistake on the description of my solution.Beside moving the pointer, it should also decrease the occurring number for current element in the table, just to avoid lacking some kind of character in the final sub-string.For your example, assume that index start from 1, the start will stop at 5, and the end will stop at 9. The final answer is "bedac", the same as yours.
meta
How about "bedaaacbedacbedaac"? The corrected forward pass finds the "b" of "bedaac". This is not the shortest substring which contains all the letters. Of course you didn't say you were looking for the shortest, but if you don't care which one you find, you could just always take the whole string
Steve Jessop
Yes, it should be "acbed".
meta
I've updated my algorithm, thanks for your carefully reading.
meta
A: 

Why not use a simpler algorithm for this question, may not be most time efficient, but it works ;) :

step 1. (Asssuming we are dealing with character set of 26 alphabets), create a boolean array of size 26 and scan through the string and check the boolean corresponding to the character. For instance, set elem[0] = true when you run into a, elem[1] = true when you run into b etc.

step 2. create a string using the characters where elem[x] = true. so, the string for this case would be "abcde" and its length = 5.

step 3. traverse through the given string a second time while extracting sub-strings of length 5, sorting them in ascending order and matching them with the string from step 2.

jarajapu