views:

337

answers:

6

Let's say you should monitor the brand "ONE" online. What algorithms can be used to separate pages about the brand ONE from pages containing the common word ONE?

I'm thinking maybe Bayes could work, but are there other ways to do this?

+4  A: 

You may want to associate brand ONE with its products, its executive officers or its challengers in your monitoring.

mouviciel
Yes, additional keywords is a good idea. Thanks!
Christian Davén
+1  A: 

I've done approaching things by seeing Wikipedia as a giant ontology (where each hyperlink is a relation between source node and end node).

EDIT : One very rough algo, with the "Java" example :

  • Query "Java" in wikipedia. Among others, this should give you (at least) the island and the programming language.
  • Get the in / out nodes of these base pages (from the base pages hyperlinks).
  • You have now small sets of correlated words.
  • Compute a "distance" of each set to the page and find the minimum of these distances.

The distance you'll use is very subjective and must be tweaked a bit to match your needs. You might have trouble getting the "core" of each page too, as parsing HTML will be a major pain.

Sylvestre Equy
Could you please expand a bit? I don't understand what you mean I should do.
Christian Davén
+3  A: 

If it's not really unique word then I would suggest the next approach.

Let's imagine that our key-word is Java. Then there are at least 2 categories: about programming and about tourism in Indonesia. We are interested in the first one.

Lets take a small text about Java (maybe from books or from wikipedia). Then lets assume some threshold (for example, 0.7). Then let's compare our text with different pages (one of the fastest ways is using Classic Vector Space Model algorithm, you can implement it yourself or find it's implementation in google). Then compare results with your threshold and filter weak results.


About using Bayes algorithm: it's not bad approach imo. But you should 'teach' your algorithm very carefully because several bad inputs can spoil the whole work.

Let me explain. Input for your Bayes algorithm is text with your brand-word. Output is probability [0 .. 1] that your text is about your brand but not about something else. In practice this algorithm very often gives you results near 0 or near 1 and it rare returns values between 0.2 and 0.8. It means that the algorithm is very sensitive to small variations and 1 or 2 words in text of 100 words can seriously affect the result.

Roman
I still don't see how VSM is any better than Bayes. Convince me, please?
Christian Davén
Actually, it's nice reason for some experiments. I've implemented both algorithms before and it's not difficult at all (you can also download some existing implementations). Prepare test input data (but it shouldn't be small), and verify which algorithm satisfy your requirements better.
Roman
+2  A: 

The term you're looking for is Concept learning or Concept extraction. The word One appears in many pages, but most often it refers to the concept of one as a quantity. Only rarely it refers to the concept of ONE the brand. (Another frequently used example is SUN as in the astral object sun, or the company named Sun).

I know Ari Rappoport has a lot of research on this topic. Practically this boils down to something like mouviciel's answer, but Ari's research is also about how you can automatically infer what related words you need to look for in order to distinguish one-as-number from one-the-brand.

Ofri Raviv
A: 

I would suggest an unsupervised approach to the problem:

  1. Get as many possible documents which describe the "ONE" in correct context and create a corpus.

  2. Find Statistically improbable phrases in that corpus against a standard english corpus.

this website gives a good example
http://sip.s-anand.net/?url=http://en.wikipedia.org/wiki/Apple_Inc.

As you can see the brand specific terms such as ipod, powerpc etc are easily filtered out.

Once you have extracted those you can create a Google alert or similar equivalent (if google alerts are too simplistic) with Queries like "SIP" AND "ONE" to monitor new articles.

Of course given this approach is unsupervised it might not be very efficient but should do the work.

you can find the code for SIP using google app engine here: http://code.google.com/p/statistically-improbable-phrases/source/browse/#svn/trunk
A: 

A different approach could be to look the page up in Google Directory, which has 'the web organized by topic into categories'. You could potentially use the category information for each page to decide what it is about.

Daniel I-S