I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.
...
From Flickr's community guidelines:
"Do link back to Flickr when you post your photos elsewhere.
The Flickr service makes it possible to post images hosted on Flickr to outside web sites. However, pages on other web sites that display images hosted on flickr.com must provide a link from each photo back to its photo page on Flickr."
Our...
Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar?
Is there any php library for that? I don't want to use the api because it's a bit complex.
Note: i just need that to add a widget under my pages that disp...
We've got a PHP crawler running on our web server. When the crawler is running, there are no cpu, memory or network bandwidth spikes. Everything is normal. But our website (also PHP), hosted on the same server, stops responding. Basically the crawler blocks any other php script from running.
What could be the problem?
EDIT:
** fsockop...
Hi I have built a simple threadpool based web crawler within my web application. Its job is to crawl its own application space and build a Lucene index of every valid web page and their meta content. Here's the problem. When I run the crawler from a debug server instance of Visual Studio Express, and provide the starting instance as the ...
Hey,
I'm trying to write to my log files while running a rake task. It works fine in development mode, but as soon as I switch to the production environment, nothing is written to the log files.
I read here
http://stackoverflow.com/questions/1022093/how-do-i-use-a-custom-log-for-my-rake-tasks-in-ruby-on-rails/1648159#1648159
that thi...
hi
i want to know where the crawled files are stored in Heritrix web crawler...
thanks and advance
...
Hi, I have just started learning Ruby on Rails and I really want to do a project on Web Spiders. I am really looking forward for good tutorials in ruby on rails. Could you help? Thanks.
...
I'm currently developping an app that is going through all the files on a server and checking every single hrefs to check wether they are valid or not. Using a WebClient or a HttpWebRequest/HttpWebResponse is kinda overkilling the process because it downloads the whole page each time, which is useless, I only need to check if the link do...
Hi, I am new to the world of Ruby and Rails.
I have seen rails cast 190 and I just started playing with it. I used selector gadget to find out the CSS and XPath
I have the following code..
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.telegraph.co.uk/sport/football/rss"
doc = Nokogiri::HTML(open...
I am doing my own visitor tracking with special features that Google Analytics (nor any other) can provide me as it is customized. I was calling this function near the end of my script, but quickly ran into our clients running into thousands of pages being called from bots (I assume Google), and my table filled up with around 1,000,000 u...
Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services).
Anyways, as follow up to my previous question I've written a little webcrawler that can...
Hello,
I was wondering if anyone ever tried to extract/follow RSS item links using
SgmlLinkExtractor/CrawlSpider. I can't get it to work...
I am using the following rule:
rules = (
Rule(SgmlLinkExtractor(tags=('link',), attrs=False),
follow=True,
callback='parse_article'),
)
(having in mind ...
I am looking for an application or a script, either for Linux or Mac OS X, to retrieve the HTML files of a set of URLs (only the root HTML file), save them to a file and convert the HTML to an image. Anyone? Thanks in advance.
...
In facebook, when i goto "find friends" page, it tells me to put in my login information for my email, e.g.- yahoo, gmail, hotmail etc. After which it gets all the contacts from my email account. How does it do this? I want to design my own system like this, but am not sure where to start. Any help will be appreciated.
...
I need to run a web crawler and I want to do it from EC2 because I want the HTTP requests to come from different IP ranges so I don't get blocked. So I thought distributing this on EC2 instances might help, but I can't find any information about what the outbound IP range will be. I don't want to go to the trouble of figuring out the e...
I want to hit every page on my internal website to see if any throw an error just from looking at them. The website does its own error logging, so I just need something to follow links.
I am running Windows XP and IIS.
...
hi guys,
I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification?
...
Hey guys, Im trying to do some data mining and analyze data based on locations.
For this site, http://www.dianping.com/shop/1898365
I am trying to figure out whats the latitude and longitude by crawling. But I cant seem to figure out where this information is stored. Can someone give me some pointers
...
I am getting a "DOS" instead of the html string ....
let getHtmlBasic (uri :System.Uri ) =
use client = new WebClient()
client.DownloadString( uri)
let uri = System.Uri( "http://www.b-a-r-f.com/" )
getHtmlBasic uri
This gives a string, "DOS"
Lol what the ?
All other websites seems to work ...
...