ansaurus

Question

Which data mining algorithm would you suggest for this particular scenario?

Answer 1

+1 A:

New answer

I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.

I have 4 women. Here are their data:

Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35

I would generate a matrix, A, like this (each row represents a respective woman in my list):

[[1,0,0,1,0], 
 [0,1,0,1,0],
 [1,0,0,0,1],
 [0,0,1,0,1]]

The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent

[Gertrude, Jennifer, Mary, Internet, No Internet]

You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is

b=[[57],
   [23],
   [60],
   [35]]

You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.

Justin Peel 2010-03-01 17:04:20

I had thought about that but yes, I was worried about the hugeness of the matrix. But you're right, using a sparse matrix format would definitely help. I will give it a try.

ercan 2010-03-03 10:29:29

And taking the 20 regions also into account, I would have 521 input variables and one numerical target variable. Then I would compare the average error with the test set with the scenario of just using the average age per name. Let's see if regression model brings any significant benefit compared to the simple one. If not, I would just use occam's razor ;)

ercan 2010-03-03 10:33:48

If you take the internet variable like I did then you get 522 (separate columns for internet and no internet), but it might not help.

Justin Peel 2010-03-03 17:26:25

Answer 2

+2 A:

I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.

One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.

One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).

However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.

John the Statistician 2010-03-01 17:28:36

actually, the ages are pretty much gaussian distributed. I don't think a reasonable clustering is possible.

ercan 2010-03-02 08:12:08

Fair enough, I'll edit my answer to reflect this.

John the Statistician 2010-03-02 11:02:54

Actually at first I began with the idea "simply take the average age per name directly". However, wouldn't it be too simple? I would like to have a more realistic model, which takes more parameters into account.. Can I somehow merge this simple approach with a more sophisticated one?

ercan 2010-03-02 14:38:15

I don't think I said what I meant clearly enough. When designing the output feature, instead of forming the ages into categories by the actual age distribution, form them into clusters across the distribution of average ages per name. Then, you can use a decision-tree or whatever you like. If it were me, I'd probably use a perceptron with the initial weights given by the percentages of that feature being in a given bin.

John the Statistician 2010-03-02 20:06:48

At this point I haven't said very much about training vs. testing and the like. Of course that applies to feature formation as well!

John the Statistician 2010-03-03 13:46:38

Answer 3

+1 A:

You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.

My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.

Grembo 2010-03-01 18:37:01

The webpage you suggested is really nice, thank you. However, the country of interest is Germany, not the US.

ercan 2010-03-02 08:09:44

Answer 4

A:

I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.

The algorithm is implemented in R (library RWeka)

It also can be found in the open source machine learning software Weka

For more information see:

Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.

Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.

gd047 2010-03-03 18:30:17

Answer 5

A:

I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values. You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution. But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

mariana soffer 2010-10-06 10:46:37

ansaurus

tags:

views:

answers:

Which data mining algorithm would you suggest for this particular scenario?

related questions