views:

45

answers:

2

Say I have a table with the following scheme (note: this example is hypothetical, though the real use case is similar).

Type      | Name         | Notes
=====================================================================================
Gender    | Gender       | Either Male or Female (not null)
GeoCoord  | Location     | Lattitude and longitude coordinates
string    | FullName     | 
Date      | BirthDate    | 
bool?     | LikesToParty | Data from a survey (null for people who didn't answer)

Manually looking at the data I know there is a strong correlation between LikesToParty and certain specific configurations of the other values. For example, men who have Wells as their middle name and who are between 15 and 30 years old and who comes from the LA area almost certainly has true in LikeToParty. I would like to predict the value of LikesToParty for users that didn't answer the survey.

How do I mine this data using C# without having to buy an expensive package like analysis services? Are there any free libraries for c#?

I've already made a neural network that is capable of most of what I describe in my example above, but it is extremely slow to train and I'm not sure about if this is the right way to go. Maybe there is a better, more efficient, way to segment the data?

+2  A: 

Because you are using both discrete and continous data, you might use a decision tree (C4.5, CART). There are some implemented libraries for them; don't beware of Java libs, as you can use the IKVM implementation of Java. For example, I have used the Weka API from C#.

lmsasu
I'll give it a spin and see how it goes. Thanks for the reply.
JohannesH
+1 for Weka, which makes it easy to try a bunch of learning algorithms. http://weka.wikispaces.com/Use+WEKA+with+the+Microsoft+.NET+Framework
Nate Kohl
I think Weka is the way to go... Thank you!
JohannesH
+2  A: 

What you describe is a standard problem in machine learning called: data classification.

Methods for data classifiation include: Neural Networks (as you mention), Support Vector Machines (see for example LIBSVM), Decision Trees (as mentioned in the previous answer). The output from these types of methods while very accurate can be difficult to interpret. You can also look probabilistic graphical models like Bayesian Networks, to answer deeper questions like: what is the probability that a male from southern California who likes to party is in his mid twenties.

carlosdc
Thanks for the answer! :)
JohannesH
lmsasu