This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
- there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
- young people tend to live in highly populated regions whereas old people prefer countrysides, and
- Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
- approx. 500 different names (nominal input variable with too many classes)
- 20 different regions (nominal input variable)
- Internet Yes/No (binary input variable)
- 91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?