web-crawler

How to limit concurrent connections used by cURL

I made a simple web crawler using PHP (and cURL). It parses rougly 60 000 html pages and retreive product information (it's a tool on an intranet). My main concern is the concurrent connection. I would like to limit the number of connection, so whatever happens, the crawler would never use more than 15 concurrent connections. The serv...

Using Web crawler for price comparison

I need a open source java based web crwaler which I can extend for price comparison? How do I do the price comparison? Is there any open source code for that? ...

Inject and index a single url with Nutch

Hello; I want to inject a single url to the crawldb as a string not a urlDir, I'm thinking in add a modified method of the Injector.inject that take the url as a string parameter, but I cant inject the string url in the crawldb; I guess the current injector using the fileInput.. from hadoop. how can I do this ? and I test to crawl url...

How do I create a web crawler in ASP.NET?

I am wondering if there is a way to make a web bot/crawler for a website in ASP.NET. I have to grab information from one of our payment providers, but they do not have an API so the only current way to grab the information automatically would be to log in to their page and then fill out a form and retrieve the information. Is there any...

Web crawler Parsing PHP/Javascript links?

I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One examp...

Which web crawler to use to save news articles from a website into .txt files?

Hi, i am currently in dire need of news articles to test a LSI implementation (it's in a foreign language, so there isnt the usual packs of files ready to use). So i need a crawler that given a starting url, let's say http://news.bbc.co.uk/ follows all the contained links and saves their content into .txt files, if we could specify the ...

How Search Engines Crawl the websites ?

I am creating a Multilingual web site and I use a resource manager for each language. when user select a language all pages use the selected resource bondles. as entire sites only is available in one language,how search engines crawl the other languages ? or does search engine crawl optional provided languages ? ...

JSP Page Import problem. Class file placed in a package inside WEB-INF/classes

I have a Web application crawler_GUI running which has another java project jspider in its buildpath. (I use eclipse galileo) The GUI uses the jspider project as its backend. Visit http://i45.tinypic.com/avmszn.jpg for the structure The JSP creates an instance of the jspider object. First of all I didn't have the classes in the WEB-I...

web crawlering help required

hi i am completing a little hobby project of mine to create a small scale search engine. i was wondering if any one knows of a decent robust opensource web crawler that they have used? it should be easy for a noob to setup and use. thank you for not googling web crawlers and pasting a list . ...

How to crawl entire Wikipedia?

I've tried WebSphinx application. I realize if I put wikipedia.org as the starting URL, it will not crawl further. Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs? Anyone has suggestions of good website with the tutori...

Search Engines Crawling Question

the main page of my site is /home.php This page has pagination with anchor tags that link to many other queries of the same page, for example /home.php?start=4 /home.php?start=8 and so on... My question is, when i include the home.php page in a sitemap will crawlers crawl what ever page home.php links to(ex. /home.php?star=4)? or d...

php file got executed by alexa crawler and caused problems!!

I've wrote a script that will be used to release the new pages automatically at a particular time. It will just show a countdown timer and then when it reaches 0 it will rename a particular file into index.php and renames the current index.php to index-modified.php There's no problem in this. But at some point time my customer told tha...

Ruby open-uri, returns error when opening a png URL

I am making a crawler parsing images on the Gantz manga at http://manga.bleachexile.com/gantz-chapter-1.html and on. I had success until my crawler tried to open a image (on chapt 273): bad URI(is not URI?): http://static.bleachexile.com/manga/gantz/273/Gantz[0273]_p001[Whatever-Illuminati].png BUT this url is valid I guess, because I...

How to build a web crawler to find a specific advert, which is in an iframe loaded by Javascript

I'm trying to find all instances of an advert on a website. The advert is in an iframe which is loaded by javascript (it doesn't appear at all if javascript is turned off). Detecting the advert itself is extremely simple, both the name of the flash file and the target of the href always contain a certain string. What would be the best "...

What should i know about search engine crawling?

I don't mean SEO things. What should i know. Such as Do engines run javascript? Do they use cookies? Will cookies carry across crawl sessions (say cookies from today and a craw next week or month). Are selected JS filters not loaded for any reason? (Such as suspected ad which is ignored for optimization reasons?) I don't want to acc...

Language/libraries for downloading & parsing web pages?

What language and libraries are suitable for a script to parse and download small numbers of web resources? For example, some websites publish pseudo-podcasts, but not as proper RSS feeds; they just publish an MP3 file regularly with a web page containing the playlist. I want to write a script to run regularly and parse the relevant pag...

Using one Scrapy spider for several websites

I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI. How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamical...

Taking too long to load a page with HttpWebResponse.

I'm trying to access information on a webpage. First time I've done this. The problem with this is that it is too slow. Doing this only on one page, that loads very fast on a browser, but takes forever here. Only thing I need here is the HTML behind the page, so I got to ask, is my code in some way downloading the images? Any help would ...

Determining an a priori ranking of what sites a user has most likely visited

This is for http://cssfingerprint.com I have a largish database (~100M rows) of websites. This includes both main domains (both 2LD and 3LD) and particular URLs scraped from those domains (whether hosted there [like most blogs] or only linked from it [like Digg], and with a reference to the host domain). I also scrape the Alexa top mil...

how to store data crawled from website

I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work. Suggestions? Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoi...