views:

397

answers:

9

Hi folks,

I need to store a 2d matrix containing zip codes and the distance in km between each one of them. My client has an application that calculates the distances which are then stored in an Excel file. Currently, there are 952 places. So the matrix would have 952x952 = 906304 entries.

I tried to map this into a HashMap[Integer, Float]. The Integer is the hash code of the two Strings for two places, e.g. "A" and "B". The float value is the distance in km between them.

While filling in the data I run into OutOfMemoryExceptions after 205k entries. Do you have a tip how I can store this in a clever way? I even don't know if it's clever to have the whole bunch in memory. My options are SQL and MS Access...

The problem is that I need to access the data very quickly and possibly very often which is why I chose the HashMap because it runs in O(1) for the look up.

Thansk for your replies and suggestions!

Marco

+2  A: 

Can you simply boost the memory available to the JVM ?

java -Xmx512m ...

By default the maximum memory configuration is 64Mb. Some more tuning tips here. If you can do this then you can keep the data in-process and maximise the performance (i.e. you don't need to calculate on the fly).

Brian Agnew
I would point out that this isn't a general solution. For example, in the UK there are millions of postcodes (the equivalent of a ZIP code).
Stephen C
+1  A: 

You will simply need more memory. When starting your Java process, kick it off like so:

java -Xmx256M MyClass

The -Xmx defines the max heap size, so this says the process can use up to 256 MB of memory for the heap. If you still run out, keep bumping that number up until you hit the physical limit.

Benjamin Cox
+7  A: 

A 2d array would be more memory efficient. You can use a small hashmap to map the 952 places into a number between 0 and 951 . Then, just do:

float[][] distances= new float[952][952];

To look things up, just use two hash lookups to convert the two places into two integers, and use them as indexes into the 2d array.

By doing it this way, you avoid the boxing of floats, and also the memory overhead of the large hashmap.

However, 906304 really isn't that many entries, you may just need to increase the Xmx maximum heap size

Chi
Since `distance(a, b) == distance(b, a)` you can reduce memory a further 50% by using a triangular matrix.
Stephen C
As per CPerkins answer, the "hash lookups" to get integers should be HashMap<String,Integer> containing your 952 zip codes as teh keys, with integers from 0 to 951 assigned.
Stephen Denne
A: 

Create a new class with 2 slots for location names. Have it always put the alphabetically first name in the first slot. Give it a proper equals and hashcode method. Give it a compareTo (e.g. order alphabetically by names). Throw them all in an array. Sort it.

Also, hash1 = hash2 does not imply object1 = object2. Don't ever do this. It's a hack.

z5h
+5  A: 

I would have thought that you could calculate the distances on the fly. Presumably someone has already done this, so you simply need to find out what algorithm they used, and the input data; e.g. longitude/latitude of the notional centres of each ZIP code.

EDIT: There are two commonly used algorithms for finding the (approximate) geodesic distance between two points given by longitude/latitude pairs.

  • The Vicenty formula is based on an ellipsoid approximation. It is more accurate, but more complicated to implement.

  • The Haversine formula is based on a spherical approximation. It is less accurate (0.3%), but simpler to implement.

Stephen C
Note these *might* be (say) driving distances. In which case the values would have to be derived from maps etc. and wouldn't be a relatively simple geometric derivation (I realise the question doesn't specify either, btw)
Brian Agnew
Driving distances between ZIP codes is an even fuzzier concept than geodesic distances between ZIP codes. If you go down that path, you probably want driving distances between specific locations ... and a lookup table probably won't suffice.
Stephen C
We calculate the distance with an app the client is already using. So I get this f*** Excel sheets I need to process afterwards :) But the algorithms look very interesting!
Marco
@Marco - you still may be able to persuade the client to tell give you the raw coordinates ...
Stephen C
+1  A: 

Lately I've managed similar requisites for my master thesis.

I ended with a Matrix class that uses a double[], not a double[][], in order to alleviate double deref costs (data[i] that is an array, then array[i][j] that is a double) while allowing the VM to allocate a big, contiguous chunk of memory:

public class Matrix {

    private final double data[];
    private final int rows;
    private final int columns;

    public Matrix(int rows, int columns, double[][] initializer) {
        this.rows = rows;
        this.columns = columns;
        this.data = new double[rows * columns];

        int k = 0;

        for (int i = 0; i < initializer.length; i++) {
            System.arraycopy(initializer[i], 0, data, k, initializer[i].length);
            k += initializer[i].length;
        }
    }

    public Matrix set(int i, int j, double value) {
        data[j + i * columns] = value;
        return this;
    }

    public double get(int i, int j) {
        return data[j + i * columns];
    }
}

this class should use less memory than an HashMap since it uses a primitive array (no boxing needed): it needs only 906304 * 8 ~ 8 Mb (for doubles) or 906304 * 4 ~ 4 Mb (for floats). My 2 cents.

NB I've omitted some sanity checks for simplicity's sake

dfa
+1  A: 

I upvoted Chi's and Benjamin's answers, because they're telling you what you need to do, but while I'm here, I'd like to stress that using the hashcode of the two strings directly will get you into trouble. You're likely to run into the problem of hash collisions.

This would not be a problem if you were concatenating the two strings (being careful to use a delimiter which cannot appear in the place designators), and letting HashMap do its magic, but the method you suggested, using the hashcodes for the two strings as a key, that's going to get you into trouble.

CPerkins
+1  A: 

The above suggestions regarding heap size will be helpful. However, I am not sure if you gave an accurate description of the size of your matrix.

Suppose you have 4 locations. Then you need to assess the distances between A->B, A->C, A->D, B->C, B->D, C->D. This suggests six entries in your HashMap (4 choose 2).

That would lead me to believe the actual optimal size of your HashMap is (952 choose 2)=452,676; NOT 952x952=906,304.

This is all assuming, of course, that you only store one-way relationships (i.e. from A->B, but not from B->A, since that is redundant), which I would recommend since you are already experiencing problems with memory space.

Edit: Should have said that the size of your matrix is not optimal, rather than saying the description was not accurate.

Matt Caldwell
+1  A: 

Stephen C. has a good point: if the distances are as-the-crow-flies, then you could probably save memory by doing some calculations on the fly. All you'd need is space for the longitude and latitude for 952 zip codes and then you could use the vicenty formula to do your calculation when you need to. This would make your memory usage O(n) in zipcodes.

Of course, that solution makes some assumptions that may turn out to be false in your particular case, i.e. that you have longitude and latitude data for your zipcodes and that you're concerned with as-the-crow-flies distances and not something more complicated like driving directions.

If those assumptions are true though, trading a few computes for a whole bunch of memory might help you scale in the future if you ever need to handle a bigger dataset.

Seth