crawler

What is the techniques to implement Visual Web Scraper?

I'm going to build a visual web scraper. The most important feature the software required is "visual" like http://mozenda.com/. The software create a tool like web-browser not only allow user to browse a webpage, perform some tasks as authenticate, click links, make searching, ... but also can track all these tasks. Does anyone know the ...

how i get data from crawler to my site?

what is the best way to get data from external crawler to my DATA BASE, to my site i work in LAMP environment, is web services is good idea ? the crawler run every 15 minutes. ...

crawler instances

im building a large-scale web crawler, how many instances is optimal when crawling a web when running it on dedicated web server located in internet server farms. ...

Is it possible to discover plugged disks from Java?

I'm writing a disk crawler and if the user doesn't provide an existing path the program should search all disks that are available. Does anybody know is it possible and if it is how to do that from Java? ...

Does web crawler identify cookies?

Do web crawlers use cookies, or discard them? ...

MP3 link Crawler

I have been looking into a good way to implement this. I am working on a simple website crawler that will go around a specific set of websites and crawl all the mp3 links into the database. I don't want to download the files, just crawl the link, index them and be able to search them. So far for some of the sites i have been successful, ...

Is it possible crawl ASP.NET pages?

Is there a way to crawl some ASP.NET pages that uses doPostBack as events calling? Example: Page1.aspx: Contains 1 LinkButton that redirects to Page2.aspx Code-behind for LinkButton Click event: Response.Redirect("Page2.aspx") In client side this code is generated on click event: doPostBack(... Is it possible crawl pages usin...

What sort of web host lets you run crawlers on it?

I'm working on a graduation project for one of my university courses, and I need find some place to run several crawlers I wrote in C# from. With no web hosting experience, I'm a bit lost. Is this something that any site allows? Do I need a special host that gives more access to the server? The crawler is a simple app that does its work,...

Problem with a custom content type

I've made a custom content type based on the "Page publishing". In this content type, I've also made a lookup field that lists all items in a list (nothing special with that list though) When I use my own user to look at a page made with my custom content type, no problem. When the site is crawled, the crawler doesn't want to index it...

Sharepoint Crawler is denied access to sites

We create all our site collections programatically with a custom site def/template. Everything works as expected, except for the crawler. It's apparently denied access to the sites. The crawl logs says: http://server.localnetwork.lan/somesites/siteName The object was not found. (The item was deleted because it was either not fo...

Crawler do not create custom crawled properties

Hi, These days i have faced with very strange problem. I have development environment with MOSS 2007 SP 2 and WS 2008, i have search configured and everything works great. I have started to configuring staging environment (MOSS 2007 SP2 with June CU) and create new farm and new SSP. I have deployed my changes with package (wsp) and manua...

Sharepoint Document Unique Identifier while crawling using SiteData Webservice

Does anybody know how I can map the "UniqueID" property to a managed property so I can display it in the advanced search results? This property is not visible when I try to create a new managed property using Metadata Property Mappings link in shared services administration. Using the SiteData or Lists web service I can see the "ows_Uni...

How to find all links / pages on a website

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site? I've looked at HTTrack but that downloads the whole site and I simply need the directory tree. Thanks Jonathan ...

How to get HTML element coordinates using C#?

Hello, I am planning to develop web crawler, which would extract coordinates of html elements from web pages. I have found out that it is possible to get html element coordinates by using "mshtml" assembly. Right now I would like to know if it is possible and how to get only necessary information (html,css) from web page, and then by us...

parser/crawler algorithm question

Hi. In the process of high level design for a targeted crawler/parser. The app will be used to extract data from specific websites. Furhtermore, the app is being designed to run in a master/slave process, where the master/server side processes the packets to be parsed, and then allows the child nodes (client servers) in the system to fe...

efficient web parsing approach - aggregation issue

Hi. A number of sites do aggregation (indeed.com, simplthired.com,expedia...) I'm trying to figure out a good/efficient way of determining that the data I get from parsing data from a page is valid. In particular, if I parse a page multiple times, (say once a day) how do I 'know' that the data i get on any given time is valid? I'm cons...

Lucene crawler (it needs to build lucene index)

Hi, I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example... Does anybody know does such a web crawler exist and can If answer is yes where I can find it. T...

SEO : Adding to Google other than submitting directly for google's crawler - http://www.enshaeyah.webs.com

Hi All, What are other ways of making your website searchable by Google, other than submitting the link directly to Google. Submitting links to yahoo is a breeze, gets crawled for a day or two... Google though takes a while... Thanks... ...

PHP cURL getting encoded data

I have downloaded page header and compressed body in one string by using cURL, problem is that I don't know how to split them from each other and how to decompress body? Thank you! ...

Average Size of an RSS/Feed file, for Data Storage and Bandwidth Calculation

Hi Folks, Doing a back of the envelope calculation to determine network bandwidth and data storage needed to monitor approx 10,00,000 feeds every 20 minutes. Any idea what could be the average size of an rss file ? I remember reading somewhere the guys from technorati revealing the avg size of an rss file. Ankur Gupta ...