How can one generate say 1000 random points with a distribution like that of
towns and cities in e.g. Ohio ?
I'm afraid I can't define "distributed like cities" precisely;
uniformly distributed centres + small Gaussian clouds
are easy but ad hoc.
Added: There must be a family of 2d distributions
with a clustering parameter that can be varied to match a given set of points ?
views:
63answers:
4Maybe you can take a look at Walter Christaller's Theory of Central Places. I guess there must be some generator somewhere, or you can cook up your own.
Start with a model of the water features in your target area (or make one up, if it's for an imaginary place), then cluster the cities near river junctions, along lakeshores, lake-river junctions. Then make imaginary highways connecting those major cities. Now sprinkle some intermediate cities along those highways at reasonable spacing, preferring to be near junctions in the highways. Now sprinkle some small towns through the empty spaces.
Gaussian clusters with Poisson cluster sizes work fairly well.
Problem: generate random points that cluster roughly like given cities, say in the USA.
Subproblems:
a) describe clusters with rows of numbers, so that "cluster A is like cluster B"
simplifies to "clusternumbers(A) is like "clusternumbers(B)".
Running N=100 then 1000 points through fcluster below, with ncluster=25, gives
N 100 ncluster 25: 22 + 3 r 117
sizes: av 4 10 9 8 7 6 6 5 5 4 4 4 ...
radii: av 117 202 198 140 134 64 62 28 197 144 148 132 ...
N 1000 cluster 25: 22 + 3 r 197
sizes: av 45 144 139 130 85 84 69 63 43 38 33 30 ...
radii: av 197 213 279 118 146 282 154 245 212 243 226 235 ...
b) find a combiation of random generators with 2 or 3 parameters
which can be varied to generate different clusterings.
Gaussian clusters with Poisson cluster sizes can match clustering of cities fairly well:
def randomclusters( N, ncluster=25, radius=1, box=box ):
""" -> N 2d points: Gaussian clusters, Poisson cluster sizes """
pts = []
lam = eval( str( N // ncluster ))
clustersize = lambda: np.random.poisson(lam - 1) + 1
# poisson 2: 14 27 27 18 9 4 %
# poisson 3: 5 15 22 22 17 10 %
while len(pts) < N:
u = uniformrandom2(box)
csize = clustersize()
if csize == 1:
pts.append( u )
else:
pts.extend( inbox( gauss2( u, radius, csize )))
return pts[:N]
# Utility functions --
import scipy.cluster.hierarchy as hier
def fcluster( pts, ncluster, method="average", criterion="maxclust" ):
""" -> (pts, Y pdist, Z linkage, T fcluster, clusterlists)
ncluster = n1 + n2 + ... (including n1 singletons)
av cluster size = len(pts) / ncluster
"""
# Clustering is pretty fast:
# sort pdist, then like Kruskal's MST, O( N^2 ln N )
# Many metrics and parameters are possible; these satisfice.
pts = np.asarray(pts)
Y = scipy.spatial.distance.pdist( pts ) # N*(N-1)/2
Z = hier.linkage( Y, method ) # N-1, like mst
T = hier.fcluster( Z, ncluster, criterion=criterion )
clusters = clusterlists(T)
return (pts, Y, Z, T, clusters)
def clusterlists(T):
""" T = hier.fcluster( Z, t ) e.g. [a b a b c a]
-> [ [0 2 5] [1 3] ] sorted by len, no singletons [4]
"""
clists = [ [] for j in range( max(T) + 1 )]
for j, c in enumerate(T):
clists[c].append( j )
clists.sort( key=len, reverse=True )
n1 = np.searchsorted( map( len, clists )[::-1], 2 )
return clists[:-n1]
def radius( x ):
""" rms |x - xmid| """
return np.sqrt( np.mean( np.var( x, axis=0 )))
# * 100 # 1 degree lat/long ~ 70 .. 111 km