views:

77

answers:

2

I am working with some spreadsheet data and I have a set of cell regions that are of arbitrary bounds. Given any cell, what is the fastest way to determine the subset of regions which contain the cell?

Currently, the best I have is to sort the regions with the primary sort field being the region's starting row index, followed by its ending row index, starting column index, and then ending column index. When I want to search based on a given cell, I binary search to the first region whose starting row index is after the cell's row index and then I check all regions before that one to see if they contain the cell, but this is too slow.

+1  A: 

Based on some Googling, this is an example of the two dimensional point enclosure searching problem, or the "stabbing problem". See:

http://www.cs.nthu.edu.tw/~wkhon/ds/ds10/tutorial/tutorial6.pdf

of here (starting at p.21/52):

http://www.cs.brown.edu/courses/cs252/misc/slides/orthsearch.pdf

The key data structure involved is the segment tree:

http://en.wikipedia.org/wiki/Segment_tree

For the 2-D case, it looks like you can build a segment tree containing segment trees and get O(log^2(n)) query complexity. (I think your current solution is O(n) since on average you'll just exclude half of your regions with your binary search.)

However, you said "spreadsheet", which means you've probably got a relatively small area to work with. More importantly, you've got integer coordinates. And you said "fastest", which means you're probably willing to trade space and setup time for a faster query.

You didn't say which spreadsheet, but the code below is a wildly-inefficient, but dirt-simple, brute-force Excel/VBA implementation of a 2-D lookup table that, once set up, has O(1) query complexity:

Public Sub brutishButShort()
    Dim posns(1 To 65536, 1 To 256) As Collection

    Dim regions As Collection
    Set regions = New Collection

    Call regions.Add([q42:z99])
    Call regions.Add([a1:s100])
    Call regions.Add([r45])

    Dim rng As Range
    Dim cell As Range
    Dim r As Long
    Dim c As Long

    For Each rng In regions
        For Each cell In rng
            r = cell.Row
            c = cell.Column

            If posns(r, c) Is Nothing Then
                Set posns(r, c) = New Collection
            End If

            Call posns(r, c).Add(rng)
        Next cell
    Next rng

    Dim query As Range
    Set query = [r45]

    If Not posns(query.Row, query.Column) Is Nothing Then
        Dim result As Range
        For Each result In posns(query.Row, query.Column)
            Debug.Print result.address
        Next result
    End If
End Sub

If you have a larger grid to worry about or regions that are large relative to the grid, you can save a ton of space and setup time by using two 1-D lookup tables instead. However, then you have two lookups, plus a need to take the intersection of the two resulting sets.

jtolle
Thanks, this is very helpful. I can't use the brute force algorithm because I am working with Excel 2007 data, so there are potentially 1048576 rows by 65536 columns, and that would use too much memory. I also can't use the segment tree because while regions aren't added often, they are added occasionally, so the build up time for the tree would be too much of a slowdown. But I think the two 1-D interval trees might be the way to go. I'll try it out.
Mike Dour
Actually I was thinking two straight 1-D lookups, not actual trees. You'd build them in muuch the same way as above, only you'd just pre-process the rows for each region into the rows lookup, and the columns into the columns, etc. Then you'd still have to find just the unique regions between the two lookup results, but that is easily done with a Scripting.Dictionary.
jtolle
And adding to the lookup tables piecemeal is easy. Removing from them with the above approach would require using string keys in the collections. You could use the region address as long as you're sure that your regions are never identical.
jtolle
Oh I see. That's not a bad idea, except now I'd be allocating up to 1114112 collections, which I guess isn't horrible. I might try to take a hybrid approach and create your lookup table, but in blocks of 128 rows/columns. This limits my maximum collection count to 8704, but increases the lookup time slightly because I will have to check each region in the returned set to see if it contains the cell. I think I'll implement this as well as the two 1-D interval trees and see which one is quicker in actual code. Thanks again.
Mike Dour
Yeah, based on what you know about the number and distribution of your regions, there are all kinds of potential intermediate trade-offs between space and time. I'd start maximally simple/brute and then move more complex/efficient only as needed.
jtolle
After working on this a bit, I realized I could use the segment tree (sort of). I created a 1D segment tree that is generated as segments are added. I was able to do this because there are a finite number of rows and columns. I just assume that each row or column could be the end point of some segment in the future and that the full tree will be needed. But I don't create any nodes in the tree until they actually need to store a segment or they have a child node that has to store a segment. I use one tree for rows and one for columns, and then take the intersection of the results for a cell.
Mike Dour
A: 

I think you want to determine if the Intersect of the cell and the region is Nothing

Sub RegionsContainingCell(rCell As Range, ParamArray vRegions() As Variant)

    Dim i As Long

    For i = LBound(vRegions) To UBound(vRegions)
        If TypeName(vRegions(i)) = "Range" Then
            If Not Intersect(rCell, vRegions(i)) Is Nothing Then
                Debug.Print vRegions(i).Address
            End If
        End If
    Next i

End Sub

Sub test()

    RegionsContainingCell Range("B50"), Range("A1:Z100"), Range("C2:C10"), Range("B1:B70"), Range("A1:C30")

End Sub
Dick Kusleika
This is a O(N) algorithm. I need something faster as I am working with a lot of regions. But thanks for your help anyway.
Mike Dour