web-crawler

How to extract the headline and content from a crawled web page / article?

I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler. ...

Does Flickr "Know" if a Hotlinked Image Does Not Link Back to Source?

From Flickr's community guidelines: "Do link back to Flickr when you post your photos elsewhere. The Flickr service makes it possible to post images hosted on Flickr to outside web sites. However, pages on other web sites that display images hosted on flickr.com must provide a link from each photo back to its photo page on Flickr." Our...

How to scrape the first paragraph from a wikipedia page?

Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar? Is there any php library for that? I don't want to use the api because it's a bit complex. Note: i just need that to add a widget under my pages that disp...

What's holding up my PHP script?

We've got a PHP crawler running on our web server. When the crawler is running, there are no cpu, memory or network bandwidth spikes. Everything is normal. But our website (also PHP), hosted on the same server, stops responding. Basically the crawler blocks any other php script from running. What could be the problem? EDIT: ** fsockop...

Asp.net Crawler Webresponse Operation Timed out.

Hi I have built a simple threadpool based web crawler within my web application. Its job is to crawl its own application space and build a Lucene index of every valid web page and their meta content. Here's the problem. When I run the crawler from a debug server instance of Visual Studio Express, and provide the starting instance as the ...

Rails: How to write to a custom log file from within a rake task in production mode?

Hey, I'm trying to write to my log files while running a rake task. It works fine in development mode, but as soon as I switch to the production environment, nothing is written to the log files. I read here http://stackoverflow.com/questions/1022093/how-do-i-use-a-custom-log-for-my-rake-tasks-in-ruby-on-rails/1648159#1648159 that thi...

where the crawled files are stored in Heritrix web crawler

hi i want to know where the crawled files are stored in Heritrix web crawler... thanks and advance ...

Good book or tutorials on Web Spiders for Ruby on Rails applications

Hi, I have just started learning Ruby on Rails and I really want to do a project on Web Spiders. I am really looking forward for good tutorials in ruby on rails. Could you help? Thanks. ...

Most efficient way for testing links

I'm currently developping an app that is going through all the files on a server and checking every single hrefs to check wether they are valid or not. Using a WebClient or a HttpWebRequest/HttpWebResponse is kinda overkilling the process because it downloads the whole page each time, which is useless, I only need to check if the link do...

Problem extracting text from RSS feeds

Hi, I am new to the world of Ruby and Rails. I have seen rails cast 190 and I just started playing with it. I used selector gadget to find out the CSS and XPath I have the following code.. require 'rubygems' require 'nokogiri' require 'open-uri' url = "http://www.telegraph.co.uk/sport/football/rss" doc = Nokogiri::HTML(open...

Is this visitor a bot or a user? PHP

I am doing my own visitor tracking with special features that Google Analytics (nor any other) can provide me as it is customized. I was calling this function near the end of my script, but quickly ran into our clients running into thousands of pages being called from bots (I assume Google), and my table filled up with around 1,000,000 u...

Webcrawler, feedback?

Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services). Anyways, as follow up to my previous question I've written a little webcrawler that can...

Scrapy - Follow RSS links

Hello, I was wondering if anyone ever tried to extract/follow RSS item links using SgmlLinkExtractor/CrawlSpider. I can't get it to work... I am using the following rule: rules = ( Rule(SgmlLinkExtractor(tags=('link',), attrs=False), follow=True, callback='parse_article'), ) (having in mind ...

Script to get the HTML files of a URL list, save them and convert them to an image

I am looking for an application or a script, either for Linux or Mac OS X, to retrieve the HTML files of a set of URLs (only the root HTML file), save them to a file and convert the HTML to an image. Anyone? Thanks in advance. ...

find contact list in email like facebook

In facebook, when i goto "find friends" page, it tells me to put in my login information for my email, e.g.- yahoo, gmail, hotmail etc. After which it gets all the contacts from my email account. How does it do this? I want to design my own system like this, but am not sure where to start. Any help will be appreciated. ...

Can EC2 instances be set up to come from different IP ranges?

I need to run a web crawler and I want to do it from EC2 because I want the HTTP requests to come from different IP ranges so I don't get blocked. So I thought distributing this on EC2 instances might help, but I can't find any information about what the outbound IP range will be. I don't want to go to the trouble of figuring out the e...

How do I crawl all the pages on my internal website?

I want to hit every page on my internal website to see if any throw an error just from looking at them. The website does its own error logging, so I just need something to follow links. I am running Windows XP and IIS. ...

need help in site classification

hi guys, I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification? ...

Retrieivng coordinates in this page

Hey guys, Im trying to do some data mining and analyze data based on locations. For this site, http://www.dianping.com/shop/1898365 I am trying to figure out whats the latitude and longitude by crawling. But I cant seem to figure out where this information is stored. Can someone give me some pointers ...

Web request returns "DOS"

I am getting a "DOS" instead of the html string .... let getHtmlBasic (uri :System.Uri ) = use client = new WebClient() client.DownloadString( uri) let uri = System.Uri( "http://www.b-a-r-f.com/" ) getHtmlBasic uri This gives a string, "DOS" Lol what the ? All other websites seems to work ... ...