tags:

views:

91

answers:

4

I have a million points and a large shape file 8gb which is too big for to load into memory in R. The shape file is single-layer so a given x,y will hit at most one polygon - as long as it's not exactly on a boundary! Each polygon is labelled with a 'severity' - e.g. 1,2,3. I'm using R on a 64-bit ubuntu machine with 12gb ram.

What's the simplest way to be able to "tag" the data frame to the polygon 'severity' so that I get a data.frame with an extra column, i.e. x,y,severity?

Thanks, Sean

+3  A: 

I think you should preprocess the data and create a structure that lists possible polygons for rectangular areas in a grid. This way, you can reduce the polygons you'll have to check against the points and the additional structure will fit into memory as it just has pointers to the polygons.

Here's an image to illustrate:

http://img191.imageshack.us/img191/5189/stackoverflowpolygonmes.png

You want to check which polygon the yellow point is in. You would usually check against all the polygons, but with the grid optimization (the orange lines, didn't draw the whole grid, just one of its fields) you only have to checked the filled polygons as they all are inside or partially inside the grid field.

A similar way would be not to store all the polygon data in memory, but just the polygons bounding boxes which would only require 2 instead of 3 X/Y pairs for each polygon (and an additional pointer to the actual polygon data), but this doesn't save as much space as the first suggestion.

schnaader
Thanks for this schnaader - but could you give me some hint as to I do this in R? Normally for small shape files I can just use library(maptools) and read them directly into memory and have access to everything - but I don't know how to manage shape files that are just far too big to load. Thanks again.
Sean
I didn't use R so far, so I have absolutely no idea about how to do it in detail :) But I think you should try to either parse the file yourself or convert it to something you can parse yourself, ideally some big textfile where each polygon is one line in the file.
schnaader
Thanks schnaader - I'd vote up but I don't yet have the reputation! :-)
Sean
Thanks for uploading the image, JD.
schnaader
you bet! Good answer and the graphic is very helpful
JD Long
+2  A: 

I don't have a really good answer, but let me throw out an idea. Can you flip the problem around and instead of asking which poly each point fits in, instead as "which points are in each poly?" Maybe you could bust your shapefile into, for example, 2000 counties then incrementally take each county and check each point to see if it is in that county. If a point is in a given county then you tag it, and take it out of your search next time.

Along the same lines, you might break the shapefile into 4 regions. Then you can fit a single region plus all your points into memory. Then just iterate 4 times through the data.

Another idea would be to use a GIS tool to lower the resolution (number of nodes and vectors) of the shapefile. That depends, of course, on how important accuracy is to your use case.

JD Long
Thanks JD - I'd vote up but I don't yet have the reputation! :-)
Sean
+5  A: 

Just because all you have is a hammer, doesn't mean every problem is a nail.

Load your data into PostGIS, build a spatial index for your polygons, and do a single SQL spatial overlay. Export results back to R.

By the way, saying the shapefile is 8Gb is not a very useful piece of information. Shapefiles are made from at least three files, the .shp which is the geometry, the .dbf which is the database, and the .shx which connects the two. If your .dbf is 8Gb then you can easily read the shapes themselves in by replacing it with a different .dbf. Even if the .shp is 8Gb it might only be three polygons, in which case it might be easy to simplify them. How many polygons have you got, and how big is the .shp part of the shapefile?

Spacedman
Thanks for the clarity Spacedman! Much appreciated!
Sean
Glad you posted an answer Spacedman. I was just digging around in some PostGIS docs to figure out how to do this as I thought that was probably the right tool.
JD Long
A: 

Do a google search on "blocks superblocks kriging". This is old technology (at least 30 years now).

Hi User - I'm not sure how this answers the question. I'm aware of kriging but I'm not looking to do interpolation - only direct lookups on a dataset too big to fit in memory. I don't want to use an external tool if possible as I want to try and automate as much as possible. Thanks.
Sean