views:

1334

answers:

5

I have a site I'm building, it's an application that creates mail merges (more or less...) based on a couple of user preferences. It can generate Cartesian joins worth of data without a problem, but in comes the needs of enterprise to make life a bit more difficult...

I have to build the application so that, after verifying zip codes of remote employees, it creates emails to media targets based on how far from that employee the media target is. Let's say for instance employees are well known volunteers where they work. The enterprise wants to email media within a 5 mile radius of these employees a message about the work the employee is doing. This is where things get messy... I have several choices here, which I will outline the attempts and the failures:

  1. The largest radius is 20 miles. I create a database table that holds records of every zip code in the US, joined to every zip code within 20 miles of that zip code. The dataset looks something like (The names are different this is for the sake of argument):
    [SourceZip] | [City] | [State] | [CloseZip] | [City] | [State] | [Distance]
    Fails: As an example, NY has 350k records from the above dataset (and other states are worse!). Average load time on that page? 6 minutes... Not happening. I verified this by setting breakpoints, it is during the dataadapter.fill() stage that the disconnect occurs.

  2. (This one was never implemented due to a logistics problem) I make a database connection for each employee zip to media target zips with a distance of x or less. Except that the source files and the media targets combined can reach upwards of 34k individualized emails. 34k DB connections? even if I could devise a way to reuse zip code searches, I did some test checks in the DB and found that there are 500 distinct zip codes in NY where employees worked. 500 db connections? I doubt that would work but I could be surprised.

  3. My latest scheme to get around the problem is in that hoping the web server runs a better game then the .net dataset object by getting a new dataset looks like:
    [zip] | [longitude] | [latitude]
    Then doing a distance formula to figure out if the data works. This relies heavily on the processors on the web server. Is this a worthwhile gamble, or will I find the same load time damage on this attempt as well?

    Is there a better way?

    I appreciate any input, even if it confirms my fears that this project just might not work.

Additional notes: I don't have control of the server, and I'm running SQL2k :(. I'm programming the site in visual studio 2005, framework 2.0. Might get upgraded to SQL2005 and VS2008 within the next few months though.

+2  A: 

If you have a ZIP code database with longitude/latitude coordinates, you could calculate the distance on the fly with my Haversine function (See my answer to this question).

This performs very well in web apps with the entire US ZIP code data.

The query would look like something similar to this:

select * from zip where 
   dbo.udf_Haversine(zip.lat,zip.long, @lat, @lon) < 20   -- (miles)

You would not apply this to each recipient's address, but you would determine the ZIP codes within your radius first (in an nested query, or with a CTE), and then join in all the addresses that you need to send a mail to.

cdonner
The problem lies within the way the data is being generated - since many employees will be attached to multiple media targets, I'd have to make a separate database query for each source. Would all of these queries cause the kind of load times that the massive dataset would generate?
C Bauer
Nice function, I think I'll have to use that. Apparently the one we were using didn't come with the same efficiency.
neouser99
@unknown: Just try it out.
cdonner
I'll have to wait on the new database table, but i will come back to report how it went once I get it.
C Bauer
A: 

Are you using SQL 2008? If so the new spatial data features might be just what you're looking for here. You can find coordinates within range of another as easily as using a "LIKE" comparison on strings.

http://www.microsoft.com/sqlserver/2008/en/us/spatial-data.aspx

Barry Fandango
I did write in the additional notes at the bottom the versions of the applications I am able to work with. SQL2008 is not in my near future, unfortunately :)
C Bauer
Ah right, sorry I didn't RTFQ all the way :)
Barry Fandango
A: 

EDIT After investigation the answer with the Haversine function is the route I would take... it's not as intensive as the function our db uses (which will be fixed :))

You should not calculate the distances every time, it's a heavy calculation from long/lat to long/lat, and if you are doing it more than once, it's unnecessary.

That being said, I'm unsure why you wrote off option #2 already. We are actually doing something similar to this. Maybe I'm confused by the numbers, but what you are mentioning should be nothing for SQL2k to sweat.

Even if you calculate offline the distance from zip to zip in the US, there are only ~2bn rows. Yes, it's a lot, but it's roughtly static, could be sharded if it's slow, etc.

neouser99
Ah - I was more worried about the C# end - would it handle hundreds of database connections well? I suppose I assumed that making that many DB connections was just bad mojo :/
C Bauer
+1  A: 

If you have a dataset for your employees, and a dataset for your media, and a third dataset for the distance betweeen source and target zips, you may save a bit of time joining the 3 tables together...

SELECT *
FROM Employees_List
   INNER JOIN 
       (Media_List INNER JOIN Distance_List ON Media_List.Zip = Distance_List.Target_Zip)
   ON Employees_List.Zip = Distance_List.Source_Zip
WHERE distance_Miles <=5

This way you set the relations between the Employee and Media using the Distance.

Ah, I didn't know I could inner join on the result set of an inner join. This solution generates exactly what I need! Thanks
C Bauer
A: 

SELECT of 350K rows (your example for NY) will not take 6 minutes if you order the table & index by SOURCEZIP (ALTER TABLE .. ORDER BY (SOURCEZIP) ) in MySQL. It should only take a fraction of a second ... The ALTER will take a long time (or you could create the table in that order) -- but since it is a static table it would be well worth nothing.

Dave Pullin