views:

132

answers:

2

Most web applications have a Location field, in which uses may enter a Location of their choice.

How would you classify users into different countries, based on the location entered.

For eg, I used the Stackoverflow dump of users.xml and extracted users' names, reputation and location:

['Jeff Atwood', '12853', 'El Cerrito, CA']
['Jarrod Dixon', '1114', 'Morganton, NC']
['Sneakers OToole', '200', 'Unknown']
['Greg Hurlman', '5327', 'Halfway between the boardwalk and Six Flags, NJ']
['Power-coder', '812', 'Burlington, Ontario, Canada']
['Chris Jester-Young', '16509', 'Durham, NC']
['Teifion', '7024', 'Wales']
['Grant', '3333', 'Georgia']
['TimM', '133', 'Alabama']
['Leon Bambrick', '2450', 'Australia']
['Coincoin', '3801', 'Montreal']
['Tom Grochowicz', '125', 'NJ']
['Rex M', '12822', 'US']
['Dillie-O', '7109', 'Prescott, AZ']
['Pete', '653', 'Reynoldsburg, OH']
['Nick Berardi', '9762', 'Phoenixville, PA']
['Kandis', '39', '']
['Shawn', '4248', 'philadelphia']
['Yaakov Ellis', '3651', 'Israel']
['redwards', '21', 'US']
['Dave Ward', '4831', 'Atlanta']
['Liron Yahdav', '527', 'San Rafael, CA']
['Geoff Dalgas', '648', 'Corvallis, OR']
['Kevin Dente', '1619', 'Oakland, CA']
['Tom', '3316', '']
['denny', '573', 'Winchester, VA']
['Karl Seguin', '4195', 'Ottawa']
['Bob', '4652', 'US']
['saniul', '2352', 'London, UK']
['saint_groceon', '1087', 'Houston, TX']
['Tim Boland', '192', 'Cincinnati Ohio']
['Darren Kopp', '5807', 'Woods Cross, UT']

using the following Python script:

from xml.etree import ElementTree

root = ElementTree.parse('SO Export/so-export-2009-05/users.xml').getroot()
items = ['DisplayName','Reputation','Location']

def loop1():
    for count,i in enumerate(root):
 det = [i.get(x) for x in items]
 print det
 if count>30: break

loop1()

What is the simplest way to classify people into different countries. Are there any ready lookup tables available, that provide me an output saying this location belongs to this country.

The lookup table need not be totally accurate. Reasonably accurate answers are obtained by querying the location string on google or better still, Wolfram Alpha.

+1  A: 

Force users to specify country, because you'll have to deal with ambiguities. This would be the right way.

If that's not possible, at least make your best-guess in conjunction with their IP address.

For example, ['Grant', '3333', 'Georgia']

Is this Georgia, USA? Or is this the Republic of Georgia?

If their IP address suggests somewhere in Central Asia or Eastern Europe, then chances are it's the Republic of Georgia. If it's North America, chances are pretty good they mean Georgia, USA.

Note that mappings for IP address to country isn't 100% accurate, and the database needs to be updated regularly. In my opinion, far too much trouble.

hythlodayr
A little error is alright, so long as the majority answers are right.
Lakshman Prasad
+2  A: 

You best bet is to use a Geocoding API like geopy (some Examples).

The Google Geocoding API, for example, will return the country in the CountryNameCode-field of the response.

With just this one location field the number of false matches will probably be relatively high, but maybe it is good enough.

If you had server logs, you could try to also look up the users IP address with an IP geocoder (more information and pointers on Wikipedia

levinalex
Thanks for the pointing at geopy. Seems awesome!
Lakshman Prasad