views:

330

answers:

4

My girlfriend got this question in an interview, and I liked it so much I thought I'd share it... Write an algorithm that receives a dictionary (Array of words). The array is sorted lexicographically, but the abc order can be anything. For example, it could be z, y, x, .., c, b, a. Or it could be completely messed up: d, g, w, y, ... It doesn't even need to include all the abc letters, and finally it doesn't have to be letters at all. It could be any symbols that form a string. For example it could be made up of 5, α, !, @, θ... You get the idea. It's up to your algorithm to discover what the letters are (easy part).

The algorithm should return the correct lexicographic order of the symbols.

Note/Things to consider: 1. For a given dictionary, can you always discover the full order of all the letters? Consider a dictionary that only has 1 word, with more than 1 symbol... 2. You CANNOT assume that the dictionary is without error. The algorithm should determine if the dictionary contains contradictions and output that there is an error. 3. HINT: Think of a good data structure to represent the relations you discover between symbols. This should make the problem much easier.

I'll post my solution probably tomorrow. By no means do I claim that it's the most efficient one. I want to see other people''s thoughts first. Hope you enjoy the question

P.S. I think the best format to post solutions is with pseudo code, but I leave this to your consideration

+10  A: 

This is topological sorting on a directed acyclic graph. You need to first build the graph: vertices are letters, and there's an edge if one is lexicographically less than the other. The topological order then gives you the answer.

A contradiction is when the directed graph is not acyclic. Uniqueness is determined by whether or not a Hamiltonian path exists, which is testable in polynomial time.


Building the graph

You do this by comparing each two consecutive "words" from the dictionary. Let's say you have these two words appearing one after another:

x156@
x1$#2z

Then you find the longest common prefix, x1 in this case, and check the immediately following characters after this prefix. In this case,, we have 5 and $. Since the words appear in this order in the dictionary, we can determine that 5 must be lexicographically smaller than $.

Similarly, given the following words (appearing one after another in the dictionary)

jhdsgf
19846
19846adlk

We can tell that 'j' < '1' from the first pair (where the longest common prefix is the empty string). The second pair doesn't tell us anything useful (since one is a prefix of another, so there are no characters to compare).

Now suppose later we see the following:

oi1019823
oij(*#@&$

Then we've found a contradiction, because this pair says that '1' < 'j'.


The topological sort

There are two traditional ways to do topological sorting. Algorithmically simpler is the depth-first search approach, where there's an edge from x to y if y < x.

The pseudocode of the algorithm is given in Wikipedia:

L ← Empty list that will contain the sorted nodes
S ← Set of all nodes with no incoming edges

function visit(node n)
    if n has not been visited yet then
        mark n as visited
        for each node m with an edge from n to m do
            visit(m)
        add n to L

for each node n in S do
    visit(n)

Upon conclusion of the above algorithm, the list L would contain the vertices in topological order.


Checking uniqueness

The following is a quote from Wikipedia:

If a topological sort has the property that all pairs of consecutive vertices in the sorted order are connected by edges, then these edges form a directed Hamiltonian path in the DAG. If a Hamiltonian path exists, the topological sort order is unique; no other order respects the edges of the path. Conversely, if a topological sort does not form a Hamiltonian path, the DAG will have two or more valid topological orderings, for in this case it is always possible to form a second valid ordering by swapping two consecutive vertices that are not connected by an edge to each other. Therefore, it is possible to test in polynomial time whether a unique ordering exists, and whether a Hamiltonian path exists.

Thus, to check if the order is unique or not, you simply check if all two consecutive vertices in L (from the above algorithm) are connected by direct edges. If they are, then the order is unique.


Complexity analysis

Once the graph is built, topological sort is O(|V|+|E|). Uniqueness check is O(|V| edgeTest), where edgeTest is the complexity of testing whether two vertices are connected by an edge. With an adjacency matrix, this is O(1).

Building the graph requires only a single linear scan of the dictionary. If there are W words, then it's O(W cmp), where cmp is the complexity of comparing two words. You always compare two subsequent words, so you can do all sorts of optimizations if necessary, but otherwise a naive comparison is O(L) where L is the length of the words.

You may also shortcircuit reading the dictionary once you've determined that you have enough information about the alphabet, etc, but even a naive building step would take O(WL), which is the size of the dictionary.

polygenelubricants
Actually I don't think you can shortcircuit reading the dict at all, since you don't know what the alphabet is apriori (regardless of order), so you never know if some new "letter" is going to appear. So every character in the dictionary must be read, all `O(WL)` of them.
polygenelubricants
+1. Terrific, full answer. Just a few quick notes:1. The psuedocode you wrote will not detect cycles (error). As wikipedia suggests in the footnote of this psuedocode: "The algorithm can be refined to detect cycles by watching for nodes which are visited more than once during any nested sequence of recursive calls to visit()".
eladidan
2. A pedantic correction: "a Hamiltonian path (or traceable path) is a path in an undirected graph which visits each vertex exactly once" (Wikipedia). In our case, our graph IS directed. So we are looking for a path that visits each vertex only once, and contains no cycles (Hamiltonian path can be 1 large cycle). This note is just to clarify the meaning of what you meant. Though it was clear from the syntax.
eladidan
I was asked the same question in an interview. I chose the second approach to find the topological sort, i.e. maintain a Map from in-degree to Collection<Node>, and always remove the node with zero in-degree. Thus recognizing errors was simpler: if there is no node with a zero degree-->cycle-->error. If there are more than one node with zero degree-->ambiguity-->error.
Dimitris Andreou
The reason I didn't go with the DFS approach is that I couldn't see how to track ambiguities. Checking whether consecutive nodes works, but it has shortcomings. You either have to spend quadratic space for an adjacency matrix to make this fast, or you no longer have a linear algorithm with incidence lists! (E.g. if you start with the graph that is the transitive closure of a simple path).
Dimitris Andreou
There is a far better way though to recognize ambiguities using the DFS approach. Find the length of longest path (in the same traversal). E.g. while visiting the nodes in post-order, label the node with zero, if it has no descedants, or with max(label of descendant) + 1. Then all you have to do is check whether all consecutive nodes differ by 1. Linear time, linear space.
Dimitris Andreou
+2  A: 

You want to find a total order of symbols.

1) You clearly may can not always determine a total order.

2) You can use a directed, acyclic graph to represent the partial order between symbols. Each letter is represented only once in the graph. You populate it by looking at all pairs of letters in in the same position in every pair of consecutive words. Any cycles you create in the graph point to an error in the dictionary. You flatten set of relations like a->d, a->b, b->d into a->b->d. If what you obtain at the end is a graph with one source (letter with no predecessors) and one sink (letter with no successors), you have a total order, like the alphabet.

Mau
+1  A: 

I solved this using the other suggested algorithm in Wikipedia. Here is the psuedocode from Wikipedia:

L ← Empty list that will contain the sorted elements
S ← Set of all nodes with no incoming edges
while S is non-empty do
    remove a node n from S
    insert n into L
    for each node m with an edge e from n to m do
        remove edge e from the graph
        if m has no other incoming edges then
            insert m into S
if graph has edges then
    output error message (graph has at least one cycle)
else 
    output message (proposed topologically sorted order: L)

I think it offers a more "common-sense" approach for finding a path (i.e. how I would solve it if I was looking at a graph and has to find a path). Still, it offers no advantage to the depth-first approach suggested by polygenelubricants (if you add the cycle detection code as suggested in the comments) Thanks for your comments. answers

eladidan
A: 

You have the dictionary in an array of words and I suggest to use this data structure, because it's good enough and a conversion takes time.

You want to find the right order of characters,

C1, C2, C3, C4, ..., Cn

You have m number of words in your dictionary. It would be excellent to have a set of rules, like (Ci, Cj), which means that Ci <= Cj, where i, j are positive natural numbers and i < m, j < m. We should have a set of errors

Because V is a huge number and it's greater than m * m, a good approach in my opinion would be the following:

foreach i = 1, m do
    foreach j = i + 1, m do
        findFirstDifferenceInTheWords
        /*charI is the first difference in Wi from Wj and charJ is the first difference in 
          Wj*/
        if (Wi <> Wj)
            if (Wi contains Wj or Wj contains Wi) == false
            charSet.add(charI, charJ)
        else if k exists and the following is true ((i < k < j) and (Wi <> Wk))
            error.addIfDoesntExist("paradox", Wi)
        else
            error.addIfDoesntExist("redundant", Wi)
        end_if
    end_foreach
end_foreach

We've found l number of rules

foreach i = 1, l do foreach j = i + 1, l do if charSet.exists(charI, charJ) and charSet.exists(charJ, charI) error.add("paradox relation", charI, charJ)

After this you can build the order of words by considering charI = charJ is both (charI, charJ) and (charJ, charI) exists in the set and where charI <= charJ is the only rule, not vice-versa, we can consider that charI < charJ.

Conclusion: Using the "<=" relation is better than using the "<" relation, because "<=" is a full and good ordered relation in number theory, "<" is just good ordering. NOTE: The output must show correctly if charI < charJ or charI = charJ. For example you can use this notation: characters(C1C2C3, C4, C5C6, ...) Where C1 = C2 = C3 < C4 < C5 = C6...

I hope this helps.

Of course, if Wj contains Wi then we can't extract a rule from there

Lajos Arpad
Of course, the deep-first search can be a very good algorithm too if we use the coloring technique (we color the nodes as "black" if they were visited and all the nodes are "white" before the visit)
Lajos Arpad