views:

332

answers:

6

I'm working on moving some spatial searching capabilities from Postgres with PostGIS to SQL Server and I'm seeing some pretty terrible performance, even with indexes.

My data is around a million points, and I want to find out which of those points are within given shapes, so the query looks something like this:

DECLARE @Shape GEOMETRY = ...
SELECT * FROM PointsTable WHERE Point.STWithin(@Shape) = 1

If I select a fairly small shape, I can sometimes get sub-second times, but if my shape is fairly large (which they sometimes are), I can get times over 5 minutes. If I run the same searches in Postgres, they're always under a second (in fact, almost all are under 200 ms).

I've tried several different grid sizes on my indexes (all high, all medium, all low), different cells per object (16, 64, 256), and no matter what I do the times stay fairly constant. I'd like to try more combinations but I don't even know what direction to go. More cells per object? Less? Some strange combination of grid sizes?

I've looked at my query plans and they're always using the index, it's just not helping at all. I've even tried without the index, and it's not much worse.

Is there any advice anyone can give on this? Everything I can find suggests "we can't give you any advice on indexes, just try everything and maybe one will work", but with it taking 10 minutes to create an index, doing this blindly is a massive waste of time.

EDIT: I also posted this on a Microsoft forum. Here's some information they asked for on there:

The best working index I could get was this one:

CREATE SPATIAL INDEX MapTesting_Location_Medium_Medium_Medium_Medium_16_NDX
    ON MapTesting (Location)
 USING GEOMETRY_GRID
  WITH (
    BOUNDING_BOX = ( -- The extent of our data, data is clustered in cities, but this is about as small as the index can be without missing thousands of points
        XMIN = -12135832,
        YMIN = 4433884,
        XMAX = -11296439,
        YMAX = 5443645),
    GRIDS = (
        LEVEL_1 = MEDIUM,
        LEVEL_2 = MEDIUM,
        LEVEL_3 = MEDIUM,
        LEVEL_4 = MEDIUM),
     CELLS_PER_OBJECT = 256 -- This was set to 16 but it was much slower
  )

I had some issues getting the index used, but this is different.

For these tests I ran a test search (the one listed in my original post) with a WITH(INDEX(...)) clause for each of my indexes (testing various settings for grid size and cells per object), and one without any hint. I also ran sp_help_spatial_geometry_index using each index and the same search shape. The index listed above ran fastest and also was listed as most efficient in sp_help_spatial_geometry_index.

When running the search I get these statistics:

(1 row(s) affected)
Table 'MapTesting'. Scan count 0, logical reads 361142, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'extended_index_592590491_384009'. Scan count 1827, logical reads 8041, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 6735 ms,  elapsed time = 13499 ms.

I also tried using random points as data (since I can't give out our real data), but it turns out that this search is really fast with random data. This lead us to believe that our problem is how the grid system works with our data.

Our data is addresses across the entire state, so there are a few very high density regions, but mostly sparse data. I think the problem is that no setting for the grid sizes works well for both. With grids set to HIGH, the index returns too many cells in low-density areas, and with grids set to LOW, the grids are useless in high density areas (at MEDIUM, it's not as bad, but still not good at either).

I am able to get the index used, it's just not helping. Every test was run with "show actual execution plan" turned on, and it always shows the index.

+1  A: 

Here are some remarks about SQL-Server's spatial extensions and how to ensure that the index is efficiently used:

Apparently, the planner has difficulties to build a good plan if he does not know the actual geometry during parse time. The autor suggest to insert exec sp_executesql:

Replace:

-- does not use the spatial index without a hint
declare @latlonPoint geometry = geometry::Parse('POINT (45.518066 -122.767464)')
select a.id, a.shape.STAsText() 
from zipcodes a 
where a.shape.STIntersects(@latlonPoint)=1
go

with:

-- this does use the spatial index without using a hint
declare @latlonPoint geometry = geometry::Parse('POINT (45.518066 -122.767464)')
exec sp_executesql 
N'select a.id, a.shape.STAsText() 
from zipcodes a 
where a.shape.STIntersects(@latlonPoint)=1', N'@latlonPoint geometry', @latlonPoint
go
Luther Blissett
My spatial index is being used though. I hit "Include actual execution plan" and it shows the spatial index being used.
Brendan Long
I tried this suggestion just to be sure and the times and execution plan was the same.
Brendan Long
+1 informative, even if it did not solve the OP's problem
Peter
A: 

My gut reaction is "because Microsoft hasn't bothered to make it fast, because it's not an Enterprise Feature". Maybe I'm being cynical.

I'm not sure why you're migrating away from Postgres either.

tc.
I suspect it has more to do with it being a new feature; I heard they're supposed to make it a lot better in the next version. What confuses me is that I haven't heard anything about it being slow, so I'm worried that I'm just missing something.
Brendan Long
A: 

I'm not familiar with spatial queries, but it could be a parameterized query problem

try writing a query (without using parameters) with a fixed value (use a value that performs slow with the parameterized query) and run it. Compare the times with the parameterized version. If its much faster, then your problem is parameterized queries.

If the above is much faster, then I would dynamically build your sql string with the parameter values embedded in the string, that way you can remove parameters from causing problems.

pete
A: 

Have you set up your spatial index correctly? Is your bounding box correct? Are all points inside? In your case probably HHMM for GRIDS would work the best (depending again on a bouding box).

Can you try to use sp_help_spatial_geometry_index, to see what's wrong? http://msdn.microsoft.com/en-us/library/cc627426.aspx

Try using Filter operation instead and tell us what perf numbers you get? (it executes only primary filter (use index) without going through secondary filter (true spatial operation))

Something is wrong with your setup. Spatial is indeed new feature but it's not that bad.

Desinderlase
I've tried every combination of two sizes (LLLL, LLMM, LLHH, MMLL, etc.) and the best was MMMM with 256 cells per object. `sp_help_spatial_geometry_index` said the primary filter was 90% efficient, which I think might be the problem (others were as low as 70%). `Filter` was much faster than `STIntersects` but still 2-5x slower than Postgres (and not as accurate).
Brendan Long
We think the problem is that our data is fairly sparse with high-density regions, so the static-grid-size approach isn't helpful. If we set the grids to high, the index is too specific in sparse areas, but if we set it to low, the index is useless in high-density areas.
Brendan Long
Then try setting up multiple spatial indexes around each high-density region. Or at least break entire US into few big areas. I expect you have most of your data on east and west coast.
Desinderlase
@Desinderlase, our data isn't the whole US, it's just Colorado. The problem is that users can select areas that cross the entire state. For example, my test query is a thin box from Fort Collins to Denver. This has two high-density regions separated by a low density region, and I was under the impression that SQL Server will only use one spatial index at a time (and not indexing one of those two areas will be even worse than I have now).
Brendan Long
I can try loading this database up again on Monday to test it again, for the moment we just removed it because this configuration is way more complicated than just using two databases.
Brendan Long
A: 

I believe STIntersects is better optimized for using the index would have better performance than STWithin, especially for larger shapes.

Giri
We tried both and it made no difference.
Brendan Long
Would it be possbile for you to post the plan information after setting set statistics profile on, for STIntersects query?
Giri
A: 

You might try breaking it down into two passes:

  1. select candidates into a temp table w/ .Filter().
  2. query candidates w/ .STWithin().

eg:

SELECT * INTO #this FROM PointsTable WHERE Point.Filter(@Shape) = 1
SELECT * FROM #this WHERE Point.STWithin(@Shape) = 1

(replacing SELECT * with only the actual columns you need to reduce I/O)

This kind of micro-optimization shouldn't be necessary, but I have seen decent performance improvements before. Also, you will be able to gauge how selective your index is by the ratio of (1) to (2).

Peter