crawler

How to rebuild Safari Web Clip functionality in PHP

Hi there, is there a way to rebuild Mac OSX Snow Leopard's Dashboard Widget 'Web Clip' on a PHP website? Something like a crawler or scraper. I thought about using file_get_contents to getting the page content into the page, but how do I select a section on the external page? And does this work with session/login content as well? I'm ...

can a crawler be written entirely in javascript?

I was wondering - can a crawler be written entirely in javascript? That way, the crawler is only called when a user needs the information and everything is run from the individual user's computer. if crawler is written server side - doesn't that also run the risk of the IP being blocked? ...

Malicious crawler blocker for ASP.NET

I have just stumbled upon Bad Behavior - a plugin for PHP that promises to detect spam and malicious crawlers by preventing them from accessing the site at all. Does something similar exist for ASP.NET and ASP.NET MVC? I am interested in blocking access to the site altogether, not in detecting spam after it was posted. EDIT: I am int...

architecture python question

hi. creating a distributed crawling python app. it consists of a master server, and associated client apps that will run on client servers. the purpose of the client app is to run across a targeted site, to extract specific data. the clients need to go "deep" within the site, behind multiple levels of forms, so each client is specifical...

redirect all bots using htaccess apache

What .htaccess rewriterule should i use to detect known bots, for example the big ones: altavista, google, bing, yahoo I know i can check for their ips, or hosts, but is there a better way? ...

Spider/Crawler for testing an AJAX web app that requires a session cookie?

We have a web app that is heavy on AJAX and it is very customizable so we need something that will click on every link in it to make sure that none of the forms/pages break. I know that there are lots of spiders/crawlers out there but we haven't been able to find one thats easy to implement and works with AJAX and allows you to have a se...

Age verification forms and crawlers

I have created a website about some beer brand and had to include age verification page. The verification script is written in PHP and uses sessions to store verification variable. The script works the way that no matter form which link you will try to enter the website it will take you to the verification page first. The verification is...

Use jQuery on a variable instead on the DOM ? [solved]

In jQuery you can do : $("a[href$='.img']").each(function(index) { alert($(this).attr('href')); } I want to write a jQuery function which crawls x-levels from a website and collects all hrefs to gif images. So when I use the the get function to retrieve another page, $.get(href, function(data) { }); I want to be able to do s...

Is this visitor a bot or a user? PHP

I am doing my own visitor tracking with special features that Google Analytics (nor any other) can provide me as it is customized. I was calling this function near the end of my script, but quickly ran into our clients running into thousands of pages being called from bots (I assume Google), and my table filled up with around 1,000,000 u...

How to Identify the website's content language like English, Japanese, Chinese etc

(I am developing a website to crawl the other website content in ASP.NET . I am able to get the content correctly but how can I identify which language is used based on that content. For Ex. English, Hindi, Chinese, Japanese etc. I used following code. HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(TextBox1.Text ); ...

Retrieivng coordinates in this page

Hey guys, Im trying to do some data mining and analyze data based on locations. For this site, http://www.dianping.com/shop/1898365 I am trying to figure out whats the latitude and longitude by crawling. But I cant seem to figure out where this information is stored. Can someone give me some pointers ...

need to crawl images and the whole web pages

hey, I am starting a project and wonder the relationship between the characters in images and the whole web page where the images reside. so first, i want to crawl some images and their web pages.....need to save the crawl result in local disk for further analysis. I wonder if there is any open source for this issue? thx^_^ ...

Free Software Solution to continuously load a large number of feeds with several servers?

I need a system that schedules and conducts the loading of a large number of Feeds. The scheduling should consider priority values for feeds provided by me and the history of past publish frequency of the feed. Later the system should make use of pubsub where available. Currently I'm planning to implement my own system based on HBase an...

Property showing up in crawled properties selection (Search)

I have user profile property called "Chapter". It is index when it was created. But this property is not showing up on crawled properties selection window to make MetaData Property Mapping for searching. How can I get this property on crawled properties selection window? ...

Programmatically generate additional properties during SharePoint crawl

Is it possible to hook into the MOSS 2007 crawl process and programmatically populate a metadata property as the content is being indexed? The reason I need to do this at crawl time is that the content is coming from outside SharePoint (from a file share) and so I can't add the metadata directly to the documents themselves. There's a wi...

Google APPS site - not being indexed

Help! I am having fun getting Google to index my googleapps site and wandered if anyone can help. 1) I have successfully created CNAME records to point my www.mydomain.co.uk to myap.appspot.com. When the user goes to www.mydomain.co.uk, www.mydomain.co.uk shows in the address bar (fab). When I do a "fetch as googlebot" the detail of ...

How to get web content before visit that web page

hi, how to get description/content of web page for given URL. (Something like Google gives the short description of each resulting link). I want to do this in my jsp page. Thank in advance! ...

crawler vs scraper

Can somebody distinguish between a crawler and scraper in terms of scope and functioanlity Thanks Nayn ...

Make a Web Crawler/ Spider

Hi, I'm looking into making a web crawler/ spider but I need someone to point me in the right direction to get started. Basically, my spider is going to search for audio files and index them. I'm just wondering if anyone has any ideas for how I should do it. I've heard having it done in php would be extremely slow. I know vb.net so c...

Need suggestion for web crawler software to help build a databse of accountants

We are working on simplifying accounting, payables and invoicing. In order to get the accountant to use our product we are intending to an email marketing targetted exclusively to all the book-keeprs and accounting/CPA firms in USA. We will like ot know whether there is any software/application that will help to crawl the net and get ema...