web-crawler

How would you pick the best image from a webpage in a crawler?

If you were given any random webpage on the internet and had the html source only. What method would use to give you the most accurate image that would best describe that webpage? Assume that there are no meta tags or hints. Facebook does something similar when you post a link but they give you choices of n images to chose from, they d...

Obtaining images using mediawiki apis

Im trying to follow the tutorials from mediawiki One of the examples they used is http://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=images So I am wondering how would i convert File:1919 eclipse positive.jpg into the actual link to the file? ...

How can I make a function that repeats itself until it finds ALL the information?

I want to create a PHP function that goes through a website's homepage, finds all the links in the homepage, goes through the links that it finds and keeps going until all the links on said website are final. I really need to build something like this so I can spider my network of sites and supply a "one stop" for searching. Here's what...

Building vertical crawler using Bixo

Hi, I came across an an open source crawler Bixo. Has anyone tried it? Could you please share the learning? Could we build directed crawler with enough ease (compared to Nutch/Heritrix) ? Thanks Nayn ...

How to retrieve set of rules defined in scope in custom post processor module in Heritrix.

Hi, I am using Heritrix API for crawling web sites. I am facing a problem in writing my own post processor similiar to LinksScoper. Class which Heritrix API provied LinksScoper uses isInScope(CandidateURI) to check if CandidateURI is in scope or not. But It applies all the rules in one shot. Is there a way to write my own post process...

Python function based on Scrapy to crawl entirely a web site

hi, I recently discovered Scrapy which i find very efficient. However, I really don't see how to embed it in a larger project written in python. I would like to create a spider in the normal way but be able to launch it on a given url with a function start_crawl(url) which would launch the crawling process on a given domain and stop o...

how to tell if a web request is coming from google's crawler?

From the HTTP server's perspective. ...

The best way to store a large set of urls for crawler

I'm writing a custom built crawler, and need to know if a specific url is crawled or not, so I won't add the same url twice. Right now I'm using mysql to store hash values of each url. But I'm wondering if this may become very slow if I have a large set of urls, say, hundreds of millions. Is there other ways to store urls? Do people us...

can SWF apps running in the browser be automatically controlled/spidered, like in browser automation?

suppose the owner of a website that shows info "for humans only" is tired of the bots and the spiders grabbing the data and decides to show this info in a SWF app running in the browser. So now he reimplements the structure of the website as a flash app and the bad guys can no longer navigate it using their url-following, html-parsing sc...

PHP Based Web Crawler or JAVA Based Web Crawler

Hey i have some doubt about PHP based web crawlers,can it run like the java thread based one? i am asking it because, in java the thread can be executed again and again, i dont think, PHP have something like thread function, can you guys please say, which web crawler will be more use full?A PHP Based or A Java Based ...

What is the Use of Lucene?

Hey Friend, i have heard lot of time the name Lucene , while i try to fetch details of web crawler it show up most of time.whats the use of Lucene? ...

A web crawler in python. Where should i start and what should i follow? - Help needed

I have an intermediate knowledge in python. if i have to write a web crawler in python, what things should i follow and where should i begin. is there any specific tut? any advice would be of much help.. thanks ...

HttpWebReqest in page fetcher slow down

I have a crawler that download pages and process them. after 1 hour every request to the sources takes 1 minute or more to complete but in the start of the program every address is downloaded under 1 second, I suspected that destination web sites limit my request or traffic but when i close the program and run it again performance return...

How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation. I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution Any good ideas? Added: I'm looking for some kind of pseudocode but, just i...

WebCrawling User-authenticated websites

Hello Everyone, Is it possible to crawl user-authenticated websites using c#? Thnx ...

How to optimize this ugly code?

I make the other day a question here, but finally I decided to do it myself for questions of time, now I have a little more time to fix it :D I liked jSoup, but I'm kind of from old school, and preffer doing it my self (thanks to @Bakkal anyway). I manage to made this code, it works fine for now, but if a webpage is not well contructed ...

web indexer using java

is parallel system or distributed system is good for web site crawler and web indexer which is develop in JAVA,if so which are the available frameworks? ...

web crawler in java

Possible Duplicate: What is a good Java web crawler library? hello, i need to crawl some websites say some 1000 websites i need an open source web crawler in java which can be customizable can any one suggest an good one..... ...

Good websites to test webcrawler on

I'm testing out a new webcrawler and I'm looking for some good websites that might trip it up (redirects, frames, anything). Does anybody know of some really complicated sites, or ones that might trip it up? Thx ...

What should be the initial list of urls for a crawler to start its work

I want a list of urls from where my crawler can start crawling efficiently so that it can cover a maximum part of web. Do you have any other idea to create initial index for different host. Thanks you ...