screen-scraping

problems in a ruby screen-scraping script

Hi! I have a small crawler/screen-scraping script that used to work half a year ago, but now, it doesnt work anymore. I checked the html and css values for the reg expression in the page source, but they are still the same, so from this point of view, it should work. Any guesses? require "open-uri" # output file f = open 'results.csv'...

How do Search Engines find relevant content ?

How does Google find relevant content when its parsing the web? Lets say for instance, Google uses the PHP native DOM Library to parse content, What methods would they be for it to find the most relevant content on a web page. My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then f...

How to extract data from web 2.0 graphs using a scraper

I have recently come across a web page containing a graph object that displays the (x, y) values on the object as the mouse is rolled across it. Is there any way to automate the extraction of this data? ...

Trouble Scraping .HTM File

Hi All, I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would p...

Need to retrieve content from specific news sources / blogs etc. Third party software, or build my own?

Hi guys, Looking for some guidance. I've got a requirement to get article content from specific sources that will be used for data analysis in a nutshell. So we've got to get the latest articles, and store them in our database for processing later on. I'm not sure really sure of the best approach. Our code for current news retrieval...

How do I access BookCrossing data when they don't have an API?

BookCrossing doesn't have an API right now (it seems in the roadmap that it's planned, but with no expected date of arrival). Any ideas on how to quickly get the current location of a specific book? ...

PHP function to grab all links inside a <DIV> on remote site using scrape method

Anyone has a PHP function that can grab all links inside a specific DIV on a remote site? So usage might be: $links = grab_links($url,$divname); And return an array I can use. Grabbing links I can figure out but not sure how to make it only do it within a specific div. Thanks! Scott ...

JSoup - Select all comments

Hey, I want to select all comments from a document using JSoup. I would like to do something like this: for(Element e : doc.select("comment")) { System.out.println(e); } I have tried this: for (Element e : doc.getAllElements()) { if (e instanceof Comment) { } } But the following error occurs in eclipse "Incompatible condi...