views:

11

answers:

1

I'm attempting to use the OpenAmplify API to evaluate the content of a URI. The point is to draw out the topics that are truly relevant to the article. Unfortunately, the topical analysis I'm getting back is:

  1. Huge, and
  2. Varied

Neither quality is terribly useful for what I'm trying to do because the signal to noise ratio is being heavily skewed towards noise. I'm analyzing web content, so there is a certain amount (perhaps a large amount) of irrelevant content (ads, etc.) involved. I get that.

Nonetheless, many of the topics being returned are either useless (utterly non-sensical, not even words), irrelevant (as in, where did that come from?) or too granular to provide any meaning or insight. I can probably filter out most of this noise using the value, um, value that is returned for each domain, subdomain, topic, et al, but I don't really know what it means.

Certainly I understand that the value it's a measure of "the prominence of the word in the text," but the number itself appears entirely arbitrary in a way that I prevents me saying something like "ignore any terms with a value less than 50" and have it carry any real meaning.

Are there any range criteria that I can use to help me understand how to use a topic's value score as a filtering threshold? Alternatively, is there another field that I should be using for this sort of filtration?

Thanks for your help.

A: 

From other channels, I've learned that the value attribute can't be evaluated the way I was hoping. It means different things for different signals and none are defined in such a way that are meaningful for this kind of requirement.

Rob Wilkerson