If you were given any random webpage on the internet and had the html source only. What method would use to give you the most accurate image that would best describe that webpage? Assume that there are no meta tags or hints.
Facebook does something similar when you post a link but they give you choices of n images to chose from, they d...
Im trying to follow the tutorials from mediawiki
One of the examples they used is
http://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=images
So I am wondering how would i convert
File:1919 eclipse positive.jpg
into the actual link to the file?
...
I want to create a PHP function that goes through a website's homepage, finds all the links in the homepage, goes through the links that it finds and keeps going until all the links on said website are final. I really need to build something like this so I can spider my network of sites and supply a "one stop" for searching.
Here's what...
Hi,
I came across an an open source crawler Bixo.
Has anyone tried it? Could you please share the learning? Could we build directed crawler with enough ease (compared to Nutch/Heritrix) ?
Thanks
Nayn
...
Hi,
I am using Heritrix API for crawling web sites. I am facing a problem in writing my own post processor similiar to LinksScoper.
Class which Heritrix API provied LinksScoper uses isInScope(CandidateURI) to check if CandidateURI is in scope or not. But It applies all the rules in one shot.
Is there a way to write my own post process...
hi, I recently discovered Scrapy which i find very efficient. However, I really don't see how to embed it in a larger project written in python. I would like to create a spider in the normal way but be able to launch it on a given url with a function
start_crawl(url)
which would launch the crawling process on a given domain and stop o...
From the HTTP server's perspective.
...
I'm writing a custom built crawler, and need to know if a specific url is crawled or not, so I won't add the same url twice. Right now I'm using mysql to store hash values of each url. But I'm wondering if this may become very slow if I have a large set of urls, say, hundreds of millions.
Is there other ways to store urls? Do people us...
suppose the owner of a website that shows info "for humans only" is tired of the bots and the spiders grabbing the data and decides to show this info in a SWF app running in the browser. So now he reimplements the structure of the website as a flash app and the bad guys can no longer navigate it using their url-following, html-parsing sc...
Hey
i have some doubt about PHP based web crawlers,can it run like the java thread based one? i am asking it because, in java the thread can be executed again and again, i dont think, PHP have something like thread function, can you guys please say, which web crawler will be more use full?A PHP Based or A Java Based
...
Hey Friend,
i have heard lot of time the name Lucene , while i try to fetch details of web crawler it show up most of time.whats the use of Lucene?
...
I have an intermediate knowledge in python. if i have to write a web crawler in python, what things should i follow and where should i begin. is there any specific tut? any advice would be of much help.. thanks
...
I have a crawler that download pages and process them. after 1 hour every request to the sources takes 1 minute or more to complete but in the start of the program every address is downloaded under 1 second, I suspected that destination web sites limit my request or traffic but when i close the program and run it again performance return...
I have the following... I wouldn't say problem, but situation.
I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution
Any good ideas?
Added: I'm looking for some kind of pseudocode but, just i...
Hello Everyone, Is it possible to crawl user-authenticated websites using c#? Thnx
...
I make the other day a question here, but finally I decided to do it myself for questions of time, now I have a little more time to fix it :D I liked jSoup, but I'm kind of from old school, and preffer doing it my self (thanks to @Bakkal anyway).
I manage to made this code, it works fine for now, but if a webpage is not well contructed ...
is parallel system or distributed system is good for web site crawler and web indexer which is develop in JAVA,if so which are the available frameworks?
...
Possible Duplicate:
What is a good Java web crawler library?
hello,
i need to crawl some websites say some 1000 websites i need an open source web crawler in java which can be customizable can any one suggest an good one.....
...
I'm testing out a new webcrawler and I'm looking for some good websites that might trip it up (redirects, frames, anything). Does anybody know of some really complicated sites, or ones that might trip it up? Thx
...
I want a list of urls from where my crawler can start crawling efficiently so that it can cover a maximum part of web. Do you have any other idea to create initial index for different host. Thanks you
...