views:

163

answers:

5

This is not a directly programming related question, but it's about selecting the right data mining algorithm.

I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:

  • there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
  • young people tend to live in highly populated regions whereas old people prefer countrysides, and
  • Internet is used more by young people than by old people.

I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with

  • approx. 500 different names (nominal input variable with too many classes)
  • 20 different regions (nominal input variable)
  • Internet Yes/No (binary input variable)
  • 91 distinct birthyears (numerical target variable with range: 1910-1992)

Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?

+1  A: 

New answer

I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.

I have 4 women. Here are their data:

Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35

I would generate a matrix, A, like this (each row represents a respective woman in my list):

[[1,0,0,1,0], 
 [0,1,0,1,0],
 [1,0,0,0,1],
 [0,0,1,0,1]]

The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent

[Gertrude, Jennifer, Mary, Internet, No Internet]

You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is

b=[[57],
   [23],
   [60],
   [35]]

You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.

Justin Peel
I had thought about that but yes, I was worried about the hugeness of the matrix. But you're right, using a sparse matrix format would definitely help. I will give it a try.
ercan
And taking the 20 regions also into account, I would have 521 input variables and one numerical target variable. Then I would compare the average error with the test set with the scenario of just using the average age per name. Let's see if regression model brings any significant benefit compared to the simple one. If not, I would just use occam's razor ;)
ercan
If you take the internet variable like I did then you get 522 (separate columns for internet and no internet), but it might not help.
Justin Peel
+2  A: 

I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.

One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.

One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).

However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.

John the Statistician
actually, the ages are pretty much gaussian distributed. I don't think a reasonable clustering is possible.
ercan
Fair enough, I'll edit my answer to reflect this.
John the Statistician
Actually at first I began with the idea "simply take the average age per name directly". However, wouldn't it be too simple? I would like to have a more realistic model, which takes more parameters into account.. Can I somehow merge this simple approach with a more sophisticated one?
ercan
I don't think I said what I meant clearly enough. When designing the output feature, instead of forming the ages into categories by the actual age distribution, form them into clusters across the distribution of average ages per name. Then, you can use a decision-tree or whatever you like. If it were me, I'd probably use a perceptron with the initial weights given by the percentages of that feature being in a given bin.
John the Statistician
At this point I haven't said very much about training vs. testing and the like. Of course that applies to feature formation as well!
John the Statistician
+1  A: 

You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.

My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.

Grembo
The webpage you suggested is really nice, thank you. However, the country of interest is Germany, not the US.
ercan
A: 

I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.

The algorithm is implemented in R (library RWeka)

It also can be found in the open source machine learning software Weka

For more information see:

Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.

Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.

gd047
A: 

I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values. You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution. But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

mariana soffer