views:

124

answers:

2

This is a question to proxy and plugin developers.

The usual mindset when it comes to specific sites is "They make changes which breaks our plugin; we change the logic to make it work again".

But, what if the other side worries about this too? If we want to compile a set of guidelines and best practices for site development for a proxy-friendly site, what do you suggest should go to it? Think of the tough nuts you've had to crack. Do you remember those moments you wished the site developer had done a certain feature differently? How?

Since this is concerned with coding, I don't think it should go to serverfault.

Edit: After reading Pekka's comment I feel I should add some more background information.

There are web proxy scripts out there such as glype and PHProxy. As the script should cope with many unknown conditions it fails to serve many sites. Because the number of such sites is overwhelming it does not make sense to try to make the proxy's internal logic sophisticated enough to handle this huge variety. This is where plugins come in handy. The main or base script implements a mechanism to invoke plugin code on a per site bases.

So, if the proxy fails to serve let's say facebook.com, which is the case by the way, a coder interested in the challenge does some research and debugging, to find where and why the chain is broken and what should be done to resolve the issue. The coder implements his fix as a plugin for that particular site and users could drop the plugin into their plugins directory.

But it also happens that something on the site changes, and that change causes the plugin to break again. So it is a constant match to catch up with the most recent changes of a site. The irony of the situation is that many site developers neither know, nor care about the impact their design decisions may have on proxy serve-ability of the content. But some sites have good reason to care about the ability of visitors to access their content through proxies. Don't want to get into politics here, so I leave it to you to guess why this might be important to some sites.

This question is an attempt to tap into the collective knowledge and experience of proxy and plugin authors to make a set of guidelines for making a site proxy-friendly.

I didn't tag the question php originally as it mostly concerns output of a site, not how you generate it. But decided to tag it so, because it will improve visibility of the question and the tag could be justified on target audience basis as well. I'm making this community wiki also, so if you feel php tag should be removed, just do so.

+1  A: 

This is an interesting question and I don't think there is a good general solution. Basically you want your site's content to be transformable by some third party over which you have no control and possibly no knowledge of the transformations.

The traditional way this problem is solved is by having a published API which allows third parties to query the data they want in a controlled manner which doesn't rely on screen-scraping. The API often will only expose a subset of the functionality, usually because the site requires users viewing pages to generate ad revenue (or some other kind of revenue).

You could generate very simple pages, in effect using an HTML API, and use Javascript and CSS to make the pages more user-friendly. This might not be appropriate for most sites, however. But the "progressive enhancement" approach trumpeted by jQuery and others is along the same lines: serve basic, semantic content and add functionality and pizzazz through JS and CSS.

You could use microformats to make certain page content more accessible. You should use semantic HTML and put lots of classes and ids on page elements so that plugin authors can find the relevant content they need.

It strikes me that no matter what these proxies are going to need to learn how to parse your pages at least once. You could document the process (maybe release a plugin or two).

I don't think you'll be able to avoid forcing plugin authors to re-code when you release new versions of your site. You could institute a policy of having a beta period, where both the old and new versions of the site are available, and this would give plugin authors a chance to update their plugins with no service interruptions for their users.

Mr. Shiny and New
This is mostly true for situations were we want to make our content available to automatic scrapping and meshing. The purpose of web proxies (aka anonymous proxies) is not taking bits off our content, they are there to offer a side door to content otherwise not available to the viewer.As such, they try to deliver the same data they get on behalf of the proxy user. But the last recommendation (beta period) is a good suggestion.
Majid
I need my bounty for another question, and I think this is the best answer we're gonna get :) Catch this, Mr. Shiny and New :)
Litso
@Litso: Thanks :)
Mr. Shiny and New
+1  A: 

I am not sure if this of concern to you or not, but currently I only remember two things that should be taken care of when working on a proxy friendly site. One is header that can affect sites visited behind proxy and another is IP detection. Cache-control header (public/private), and other header can affect speed and privacy of the user. IP detected can of of proxy and not of user. So, these things should be kept in mind.

Satya Prakash