web-crawler

Does web crawler identify cookies?

Do web crawlers use cookies, or discard them? ...

dealing with duplicate urls in a log-structured sitemap

As context I am highlighting the recent paper on sitemaps http://www2009.org/proceedings/pdf/p991.pdf which has a writeup on their amazon sitemap case study Amazon publishes 20M URLs listed in Sitemaps using the“Sitemaps” directive in amazon.com/robots.txt. They use a SitemapIn- dex file that lists about 10K Sitemaps files each ...

Developing a crawler and scraper for a vertical search engine

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be i...

MP3 link Crawler

I have been looking into a good way to implement this. I am working on a simple website crawler that will go around a specific set of websites and crawl all the mp3 links into the database. I don't want to download the files, just crawl the link, index them and be able to search them. So far for some of the sites i have been successful, ...

Automated filedownload using WebBrowser without url

I've been working on a WebCrawler written in C# using System.Windows.Forms.WebBrowser. I am trying to download a file off a website and save it on a local machine. More importantly, I would like this to be fully automated. The file download can be started by clicking a button that calls a javascript function that sparks the download disp...

how to detect search engine visites on my site? like phpBB

Is there any way to detect search engines or crawlers on my site. i have seen in phpBB at the admin we can see and allow search engines and also we can see the last visit of the bot(like Google Bot). any script in PHP? Not Google Analytic or same kind of application. i need to implement that for my blog site, i think there is some way...

php crawler detection

Hiya, I'm trying to write a sitemap.php which acts differently depending on who is looking. I want to redirect crawlers to my sitemap.xml, as that will be the most updated page and will contain all the info they need, but I want my regular readers to be show a html sitemap on the php page. This will all be controlled from within the ph...

Best open source, extendable crawler to use for image crawling.

Hi- We are in the starting phase of a project, and we are currently wondering whether which crawler is the best choice for us. Our project: Basically, we're going to set up Hadoop and crawl the web for images. We will then run our own indexing software on the images stored in HDFS based on the Map/Reduce facility in Hadoop. We will not...

When does Google re-crawl a site?

When does Google re-crawl a site? And why does Google have two versions of the same page in Cache?? http://forum.portal.edu.ro/index.php?showtopic=112733 cache pages are: forum.portal.edu.ro/index.php?showtopic=112733&st=25/ forum.portal.edu.ro/index.php?showtopic=112733&st=50 ...

How do I resolve the content of a webpage?

I'm writing a special crawler-like application that needs to retrieve the main content of various pages. Just to clarify : I need the real "meat" of the page (providing there is one , naturally) I have tried various approaches: Many pages have rss feeds , so I can read the feed and get this page specific contnent. Many pages use "co...

Two charset tags on a page, which to take?

I'm working on crawling pages for information, and have run into many problems with parsing the pages in Groovy. I've made semi-solution that works most of the time using juniversal chardet and just scanning the page for tag in the head, but sometimes two of these tags are found on one page, for example: <meta http-equiv="Content-Ty...

Why doesn't Nutch seem to know about "Last-Modified"?

I setup Nutch with a db.fetch.interval.default of 60000 so that I can crawl every day. If I don't, it won't even look at my site when I crawl the next day. But when I do crawl the next day, every page that it fetched yesterday gets fetched with a 200 response code, indicating that it's not using the previous day's date in the "If-Modif...

Building an automatic web crawler

I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is a kind of "browser simulator". I find ...

Does the url order matter in a XML sitemap?

For search engines and website crawlers, does the url order matter in a XML sitemap? Currently when the sitemap is generated, I order the website urls sequentially using a unique id, in the database. Should I order the urls in date order? Sequential Sitemap <urlset> <url> <loc>http://example.com/&lt;/loc&gt; <lastmod>2009-08-1...

How to track all website activity and filtering web robot data

I'm doing a very rudimentary tracking of page views by logging url, referral codes, sessions, times etc but finding it's getting bombarded with robots (Google, Yahoo etc). I'm wondering what an effective way is to filter out or not log these statistics? I've experimented with robot IP lists etc but this isn't foolproof. Is there some k...

crawler gets stuck on the mandatory agecheck page in Drupal

Hi, we have a big community website build in drupal, where the site has a mandatory agecheck before you can access the content of the website it checks for a cookie to be present, if not, you get redirected to the agecheck page. now we believe crawlers get stuck on this part, they get redirected to the agecheck and never get to crawl ...

Extensible/Customizable Web Crawling engines / frameworks / libraries?

I have a relatively simple case. I basically want to store data about links between various websites, and don't want to limit the domains. I know I could write my own crawler using some http client library, but I feel that I would be doing some unnecessary work -- making sure pages are not checked more than once, working out how to read ...

Is there a way to tell when googlebot/bingbot/yahoobot is crawling my site in asp.net 2005 IIS6?

I want to know when google is crawling the site, preferably by sending myself an email. Is there any way to do this that won't adversely effect performance? ...

What are the benefits of having an updated sitemap.xml?

The text below is from sitemaps.org. What are the benefits to do that versus the crawler doing their job? Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additi...

What is a good Web search and web crawling engine for Java?

I am working on an application where I need to integrate the search engine. This should do crawling also. Please suggest a good Java based search engine. Thank you in advance. ...