views:

67

answers:

1

is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....should use some sort of text mining to identify which texts are more likely noise and repetivie, while other texts are more unique and useful...

+1  A: 

I'm a PHP guy, no idea about Ruby but I think that what you want is trivial to archive:

  • Use something like Simple HTML DOM to parse the pages.
  • For each page compare all the DOM elements.
  • Get the path of all elements that have different content, those will be your signal elements.
Alix Axel