views:

70

answers:

2

Assume I have a matrix MxN, filled with values between 0 and 5. I now want to determine the largest connected tree in that matrix, where the values of the matrix are considered to be the nodes. A pair of nodes is said to be connected if it's nodes are adjacent to each other either horizontally or vertically, and if the value of both nodes is the same. The size of a tree is equal to the nodes in the tree.

An example:

1 0 3 0 0              2 2 0 0 0 0 0
1 1 2 2 2              0 2 0 0 0 0 0
0 1 0 3 0              0 2 0 0 0 0 2
3 1 0 3 0              0 2 0 2 2 2 2
                       0 0 0 0 0 0 0
                       3 0 0 3 3 0 0
                       3 3 3 3 0 0 0

On the left side, the 1-nodes on the left side form the largest tree. On the right side, the 3-nodes form the largest tree, while there are two other trees consisting of 2-nodes.

I know I could probably do a simple depth-first search, but I'm wondering if there is something well-known that I'm missing, maybe in the realm of graph theory (like Kruskal's minimum spanning tree algorithm, but for this example).

A: 

Effectively what you're looking for are Connected components. Connected component is a set of nodes, where you could travel from any node to any other within that component.

Connected components are applicable generally to a graph. Connected components can be found using BFS/DFS and from algorithm complexity perspective given adjacency matrix input there is no better way to do that. The running time of the algorithm is O(N^2), where N is a number of nodes in a graph.

In your case graph has more constrains, such as each node can be adjacent to at most 4 other nodes. With BFS/DFS, this gives you a running time of O(4N) = O(N), where N is a number of nodes. There can't possibly be an algorithm with a better complexity, as you need to consider each node at least once in the worst case.

Leonid
I am tempted to downvote this answer on the grounds that some of the statements in it are dead wrong--and demonstrably so. However, a few judicious edits to correct those statements could rehabilitate the answer to a certain degree. If I had the mojo, I would make those edits myself.
bbadour
I've modified the answer, thanks for bringing my attention to it. I misunderstood the problem from the first read. Please feel free to suggest any more improvements!
Leonid
Since you mention Connected Components, you could include a link to http://en.wikipedia.org/wiki/Connected_component_(graph_theory)#AlgorithmsYou could replace the false statement "running time ... is O(N^2).." with true a true statement like "both running time and extra storage are O(N)..." You could remove the nonsense about "contrains".Once it is cleaned up, I will be happy to delete my comments. In fact, if I had the mojo to edit your answer, I would just clean it up.
bbadour
P.S. It's probably best to restate the complexity in terms of M and N from the original question than to make up a new N. If you want to refer to the nodes and edges, I suggest set notation of V and E for vertices and edges with |V| and |E| for the cardinalities respectively.
bbadour
@bbadour: The meaning of N is clearly mentioned in the answer. You can use whatever notation you're used to in your asnwers but that doesn't make any sense to inforce N or M to be something that you like. O(N^2) complexity still holds for *a graph*, where N is a number of nodes, and it's very common to denote number of nodes as N in a graph. So your comment about the "false statement" obviously doesn't hold. Not really sure what do you mean by "contrains", or constraints? Maybe you don't have a motto because you're not precise enough in your comments and make spelling errors?
Leonid
@bbadour: P.S. |V| and |E| notations are used more when describing graphs in math terms, I prefer to use N or M, which are more generic. In my last 5 years I solved more than 1000 algorithmical problems and participated in a number of competitions like Google Code Jam Onsite, ACM ICPC, Top Coder, Challenge 24 - for a reason, so I do know something about Graph Theory. I agree that my answer may lack precision and verbosity, but my original intent was simply to help. Btw. you can always type in Google `Connected Components` and click the first link in the result set.
Leonid
Good for you. I apologize if you feel offended or if you feel I have been anything less than helpful. If my comments here seem terse, keep in mind that comments are limited in length and offer no options for formatting.
bbadour
You are welcome to define any symbol to mean anything you want as long as you are explicit. However, by redefining N to mean something different from the definition already given in the question, you do not achieve anything, and you lose the ability to state the correct complexity for your algorithm, which is O(MN) according to the symbols defined in the original question.
bbadour
The correct complexity for DFS/BFS is O(|V| + |E|). As the graph approaches a fully connected graph, |E| predominates with a maximum of O(|V|^2). However, if we accept the original premise that the graphs are trees, |E| = |V| - |T| where T is the set of trees so O(|E|) = O(|N|). Even if we reject that premise, |E| <= (MN - M - N) due to the representation of the graphs making the DFS/BFS O(MN). Regardless, the complexity of the overall algoritm is O(MN) due to the requirement to look for nodes everywhere in the array.
bbadour
While anyone can google 'Connected Components', you told me to feel free to make suggestions, and I made suggestions. If I had the mojo, I would have just improved your answer for you. Sadly, that will have to wait a while.
bbadour
A: 

You are looking for disjoint sets so I would suggest a disjoint-set data structure and a find/union algorithm:

see http://en.wikipedia.org/wiki/Disjoint-set_data_structure#Disjoint-set_forests

The union operation is symmetric so you really only need to compare each element of the matrix with its neighbor to the right and its neighbor below applying the union operation when the compared elements have the same value.

Sweep through each of the elements again using the find operation to count the size of each set keeping track of the largest. You will need storage for the counts.

The computational complexity will be O(MN A-1(MN,MN)) where A-1 is the Inverse Ackermann function, which one can consider a small constant (< 5) for any useful value of MN. And the extra storage complexity will be O(MN).

bbadour
Disjoint sets are a bit more complex than BFS/DFS, wondering if there is any reason to use Disjoint sets over BFS/DFS to solve the problem?
Leonid
Pedagogy. The original question already offered DFS as a solution and asked about other solutions the asker might **not** know of. The asker mentioned Kruskal's algorithm, which is much more superlinear than Disjoint Sets, and for all practical purposes the big-O complexity of Disjoint Sets is equivalent to DFS. However, I suspect my best DFS would outperform my best Disjoint Sets solution for all N.
bbadour