views:

14

answers:

1

Hello all,

I've trained a system on SVM,that is given a question,whether the webpage is a good one for answering this question.

The feature I selected are "Term frequency in webpage","Whether term matches with the webpage title", "number of images in the webpage", "length of the webpage","is it a wikipedia page?","the position of this webpage in the list returned by the search engine".

Currently,my system will maintain a precision around 0.4 and recall at 1.It has a large portion of false positive error(that many bad links were classified as good link by my classifier).

Since the accuracy could be improved a bit,I would like to ask for some help here on considering refine the features that I selected for training/testing,could remove some or adding more in there.

Thanks in advance.

+1  A: 

Hmm...

  • How large is your training set? i.e., how many training documents are you using?
  • What is your test set composed of?
  • Since you're getting too many FPs, I would try training with more (and varied) "bad" webpages
  • Can you give more details about your different features, like "tf in webpage," etc.?
The Alchemist
yes,thanks,the Term Frequency is the frequency of keywords appearing in the webpage.Such keywords are determined by me,manually,to take 2 or 3 most important and decisive keywords out of the original question,then calculate its frequency in the webpage.
Robert
Well, without many more details, I can't help out much besides my original advice. You can probably come up with more features like: - number of words in the answer that is also in the related Wikipedia entry - complexity of answers (via a reading level calculator; this will probably only work well for very technical or scientific questions)Also, if you're using phrases as the basis of recommendations, you'll probably miss synonyms. If the question is about a doctor and the answer is about a *physician*, then it probably won't get caught. Somehow integrating WordNet may be worth it.
The Alchemist