views:

20

answers:

1

I am importing massive amounts of data from Excel that have various table layouts. I have good enough table detection routines and merge cell handling, but I am running into a problem when it comes to dealing with borders. Namely performance. The bordered regions in some of these files have meaning.

Data Setup:
I am importing directly from Office Open XML using VB6 and MSXML. The data is parsed from the XML into a dictionary of cell data. This wonks wonderfully and is just as fast as using docmd.transferspreadsheet in Access, but returns much better results. Each cell contains a pointer to a style element which contains a pointer to a border element that defines the visibility and weight of each border (this is how the data is structured inside OpenXML, also).

Challenge:
What I'm trying to do is find every region that is enclosed inside borders, and create a list of cells that are inside that region.

What I have done:
I initially created a BFS(breadth first search) fill routine to find these areas. This works wonderfully and fast for "normal" sized spreadsheets, but gets way too slow for imports into the thousands of rows. One problem is that a border in Excel could be stored in the cell you are checking or the opposing border in the adjacent cell. That's ok, I can consolidate that data on import to reduce the number of checks needed.

One thing I thought about doing is to create a separate graph that outlines the cells using the borders as my edges and using a graph algorithm to find regions that way, but I'm having trouble figuring out how to implement the algorithm. I've used Dijkstra in the past and thought I could do similar with this. So I can span out using no endpoint to search the entire graph, and if I encounter a closed node I know that I just found an enclosed region, but how can I know if the route I've found is the optimal one? I guess I could flag that to run a separate check for the found closed node to the previous node ignoring that one edge.

This could work, but wouldn't be much better performance wise on dense graphs. Can anyone else suggest a better method? Thanks for taking the time to read this.

+1  A: 

Your question is pretty complicated, but it sounds as though you need an algorithm to find the connected components of a graph (connected component = set of nodes all connected to one another but to no other nodes), which can be accomplished in linear time by repeated traversals. Pseudocode:

FindComponents(G):
    For all vertices v in G:
        Let C be a mutable empty collection
        Traverse(G, C, v)
        If C is nonempty, then it is a connected component

Traverse(G, C, v):
    If v has not been visited:
        Mark v as visited
        Add v to C
        For each neighbor w of v in G:
            Traverse(G, C, w)

Iterative variant of Traverse:

Traverse(G, C, r):
    Let S be an empty stack
    Push r onto S
    While S is not empty:
        Pop the top element v of S
        If v is not marked as visited:
            Mark v as visited
            Add v to C
            For each neighbor w of v in G:
                Push w onto S
What algorithm is that? VB6 would certainly throw a stack overflow with that much recursion.
dmaruca
It's a depth-first traversal that keeps track of the components of an undirected graph. I don't think it has a catchy name. In any case, I gave an iterative variant that uses a stack.
Thanks for that. These graph algorithms can really send my head spinning. I have to read them 30 times lol. I appreciate you taking the time to answer my post.
dmaruca