views:

1124

answers:

22

Hi Guys,

We were set an algorithm problem in class today, as a "if you figure out a solution you dont have to do this subject". SO of course, we all thought we will give it a go.

Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.

We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.

i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.

If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.

Does anyone have ANY idea how to start something like this? Or any resources ? Preferrably in C#? :)

P.S - Even a keyword DB or something might be helpful ? Anyone know of any free ones?

Edit: Can use external data - but sorry guys, Google "forbidden" :) (we all said this in class hhaa)

+8  A: 

Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...

For example (results first):

weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado

With code (TODO: add threading ;-p):

static void Main() {
    string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
    string[] categories = { "sport", "fashion", "weather" };

    using(WebClient client = new WebClient()){
        foreach(string word in words) {
            var bestCategory = categories.OrderByDescending(
                cat => Rank(client, word, cat)).First();
            Console.WriteLine("{0}: {1}", bestCategory, word);
        }
    }
}

static int Rank(WebClient client, string word, string category) {
    string s = client.DownloadString("http://www.google.com/search?q=%2B" +
        Uri.EscapeDataString(word) + "+%2B" +
        Uri.EscapeDataString(category));
    var match = Regex.Match(s, @"of about \<b\>([0-9,]+)\</b\>");
    int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
    Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
    return rank;
}
Marc Gravell
haha NO GOOGLE!
No google, but still a great answer :)
karim79
agree with you on that one ;)
(the "no google" was added after I posted...)
Marc Gravell
yep :) its from our AI (artificial intelligence) class. "if a human can do it, a machine can too" ... love someone to tell our prof thats not the case :D
the human has had several years of anyalysing external data to learn though!
SillyMonkey
Just replace Google with Bing :)
Daniel Daranas
+1  A: 

I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.

Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?

jerryjvl
A: 

My first thought would be to leverage external data. Write a program that google-searches each word, and takes the 'category' that appears first/highest in the search results :)

That might be considered cheating, though.

GWLlosa
+2  A: 

You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.

Create a set of categorization rules like the one above and see how high an accuracy you get.

EDIT:

Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.

karim79
Killing is not a sport.
Lasse V. Karlsen
... unless you're psycho
Lasse V. Karlsen
@Lasse LOL true, was just one idea. How about if(!word.startsWith("kill") :)
karim79
A: 

Personally, I would use a Neural Network implementation (with the input encoding taking each character of the word, and using the integer value of the ASCII code as the input), and then evolve the weights and topology (need topology to adjust to the structure of the words) of the neural network.

Evolve using a Genetic algorithm. The trick is figuring out the best way of measuring the fitness of the set. You could perhaps make the fitness function based on the average number of words per category perhaps, but a judge of how 'similar' a word is still becomes an issue.

The upside of this approach is that once the network is evolved, you can put any new word into it, and then it will categorise it based on what it has learned.

Kazar
This might give some output, but it won't work. Words have semantic meaning beyond that characters that define them. There's no correlation between having 2 i's in the word and the word being a cooking-related term.
Greg D
Just to add, the k-nearest neighbor fits nicely here: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
karim79
This seems the best method, though for learning using humans is the best. Humans are pretty good at putting words and categories together. So hire any number of people to train your algorithm. Or produce a pairing game to fool players into doing it for free ;)
Onots
+15  A: 

First of all you need sample text to analyze, to get the relationship of words. A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.

A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.

seb
+1 for a sensible answer ;-p
Marc Gravell
I would go in that direction too. Find keywords and categories in texts, and register correlation of keywords and categories in text. (existance, "distance in words").Any large enough and random body of text in the relevant language would do as input
Marco van de Voort
+2  A: 

This sounds like you could use some sort of Bayesian classification as it is used in spam filtering. But this would still require "external data" in the form of some sort of text base that provides context.

Without that, the problem is impossible to solve. It's not an algorithm problem, it's an AI problem. But even AI (and natural intelligence as well, for that matter) needs some sort of input to learn from.

I suspect that the professor is giving you an impossible problem to make you understand at what different levels you can think about a problem.

The key question here is: who decides what a "correct" classification is? What is this decision based on? How could this decision be reproduced programmatically, and what input data would it need?

Michael Borgwardt
i am thinking something along these lines. maybe even a cross pollinisation of this and nearest neighbor
+1 for seeing a possible reason to give this problem to students. It shows that any (even personal) judgement is based on a data source.
borisCallens
+1  A: 

Fire this teacher.

The only solution to this problem is to already have the solution to the problem. Ie. you need a table of keywords and categories to build your code that puts keywords into categories.

Unless, as you suggest, you add a system which "understands" english. This is the person sitting in front of the computer, or an expert system.

If you're building an expert system and doesn't even know it, the teacher is not good at giving problems.

Lasse V. Karlsen
:) this is what we thought - but our prof loves giving questions to outside the square problems. as he loves to say "want to work at google? you'll figure this out" !!
Yeah, but you're not google, you need to deduce the meaning or context of a word, and this requires huge amount of data + a really good language parsing system and a really good expert system.
Lasse V. Karlsen
thank god I don't want to work at google!
grapkulec
yeah lasse - but its why we are doing phd course :D
+1  A: 

Google is forbidden, but they have almost a perfect solution - Google Sets.

Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.

Daniel Brückner
yep :) its from our AI (artificial intelligence) class.
simple solution is to wiki "category word" and see if you get a real result.
Will
A: 

Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.

Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.

Be creative. There is other data out there besides Google.

Peter Stuer
A: 

Well, you can't use Google, but you CAN use Yahoo, Ask, Bing, Ding, Dong, Kong... I would do a few passes. First query the 100 words against 2-3 search engines, grab the first y resulting articles (y being a threshold to experiment with. 5 is a good start I think) and scan the text. In particular I"ll search for the 10 categories. If a category appears more than x time (x again being some threshold you need to experiment with) its a match. Based on that x threshold (ie how many times a category appears in the text) and how may of the top y pages it appears in you can assign a weigh to a word-category pair. for better accuracy you can then do another pass with those non-google search engines with the word-category pair (with a AND relationship) and apply the number of resulting pages to the weight of that pair. Them simply assume the word-category pair with highest weight is the right one (assuming you'll even have more than one option). You can also multi assign a word to a multiple category if the weights are close enough (z threshold maybe). Based on that you can introduce any number of words and any number of categories. And You'll win your challenge. I also think this method is good to evaluate the weight of potential adwords in advertising. but that's another topic....

Good luck

Harel

A: 

Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.

Stephen Denne
A: 

My naive approach:

  1. Create a huge text file like this (read the article for inspiration)
  2. For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
  3. The word is likely to belong in the category with the greatest counter.
Nick D
This is just a codified implementation of Google's search algorithm on a small scale.
Tom Leys
A: 

My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.

jlouis
+3  A: 

So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.

In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.

Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:

  1. Start by more or less randomly assigning name-value pairs.
  2. Output the results.
  3. Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
  4. Input the results of the AMT evaluation back into the system along with the random assignments.
  5. If everything was approved, then we are done.
  6. Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
  7. Return to step 3.

Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.

Rob
+1 For the *outside of the box* option :)
fretje
AMT is what occurred to me, too - though I would make the AMT jobs consist of the list of 10 categories and one word, and ask the user to categorize it.
Nick Johnson
+1  A: 

Yeah I'd go for the wordnet approach. Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem. Hope this helps,

X.

Xavier Guardiola
+6  A: 

Maybe you are all making this too hard.

Obviously, you need an external reference of some sort to rank the probability that X is in category Y. Is it possible that he's testing your "out of the box" thinking and that YOU could be the external reference? That is, the algorithm is a simple matter of running through each category and each word and asking YOU (or whoever sits at the terminal) whether word X is in the displayed category Y. There are a few simple variations on this theme but they all involve blowing past the Gordian knot by simply cutting it.

Or not...depends on the teacher.

Mark Brittingham
+1 Mark, you left "the box" :)
grapkulec
Lol - thx. I started thinking back to my AI training when I read someone else's description of a neural network solution (I have a Ph.D. in AI) and I was thinking through how a neural network would be trained. Now, we were interested in the mathematics of the learning process when I was designing neural networks and not the source materials so we trained them manually. Then it struck me - this is a trivial problem if you train the system manually.
Mark Brittingham
If this is too simple, then the trick becomes convincing other humans to do the work for you, i.e recaptcha (http://recaptcha.net/)
Tom Leys
Excellent suggestion Tom - makes it all "webby" too.
Mark Brittingham
It would be too easy :)
furtelwart
A: 

Scrape delicious.com and search for each word, looking at collective tag counts, etc.

Not much more I can say about that, but delicious is old, huge, incredibly-heavily tagged and contains a wealth of current relevant semantic information to draw from. It would be very easy to build a semantics database this way, using your word list as a basis from scraping.

The knowledge is in the tags.

Nathan Ridley
A: 

As you don't need to attend the subject when you solve this 'riddle' it's not supposed to be easy I think. Nevertheless I would do something like this (told in a very simplistic way)

Build up a Neuronal Network which you give some input (a (e)book, some (e)books) => no google needed

this network classifies words (Neural networks are great for 'unsure' classification). I think you may simply know which word belongs to which category because of the occurences in the text. ('fishing' is likely to be mentioned near 'sports'). After some training of the neural network it should "link" you the words to the categories.

Calamitous
+1  A: 

Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)

Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,

# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)
* direct hypernym / inherited hypernym / sister term
      o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)
      + direct hypernym / inherited hypernym / sister term
            # S: (n) **sport**, athletics 
            (an active diversion requiring physical exertion and competition)

What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.

Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)

As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/

viksit
A: 

You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.

David Johnstone
A: 

Hi Tom, trying to solve a similar problem do you have any insight?