tags:

views:

41

answers:

7

We have a development server which we use while developing LAMP applications.

We also have a production server where our websites are deployed.

Occasionally, when deploying websites on the production server, we forgot to update certain paths. Occasionally we find that a production website is still referencing images, scripts and stylesheets on the development server. It can be tricky to detect these path issues and I am looking for creative ways to help detect these incorrect paths.

1) We can block all access from the development server to the production server. This would allow us to more easily detect incorrect paths as it would cause an error or broken image, rather then appearing fine.

2) Another option is we could prevent any external website from linking to images on the development server. (http://altlab.com/htaccess_tutorial.html). Any links to an image on the development server could be replaced with an image with says something like, "INCORRECT PATH" etc.

What other options are there?

(Note: I agree that the goal is to PREVENT these paths issues from occurring on the production server in the first place which can be done using code versioning and deployment tools, etc. However, in this conversation I am specifically looking for ways to detect these paths issues. It is an extra layer of quality control I am looking for...)

A: 

It is better to have one option in configuration file on server, that will allow you to switch between dev and prod environments.

FractalizeR
A: 

I think the easiest way to handle paths is to use the same exact code in both servers. Whenever you need to provide a full absolute path, define a proper constant:

<?php

include_once(FS_ROOT . 'lib/foo.php');
echo '<link href="' . WEB_ROOT . 'css/bar.css" rel="stylesheet" type="text/css">';

?>

These constants can either be set in an unversioned file:

define('FS_ROOT', '/home/project/htdocs/');
define('WEB_ROOT', '/');

... or can be calculated dynamically with __FILE__, $_SERVER['DOCUMENT_ROOT'], dirname() and such stuff.

Álvaro G. Vicario
A: 

Apache Logs

To diagnose places where this is happening already, you should be able to check the server logs on the dev server to see which files are being requested from unexpected places. Using log analysis should help you identify specific cases that need fixing without interrupting service.

Basic apache logs should have IPs of computers requesting files on each server. You should be able to filter out any IPs that you or your team is using and see which files on the dev server are being pulled by external IPs.

A more robust log analysis tool may be able to help you figure out which requests are coming from another domain. You might want to look for tools designed to prevent hotlinking.

Introspection

I think there are two reasonable options for auditing code on your end to locate the problem.

The first option is to use a search tool (grep or grepwin) on your code and literally search for the dev domain in your code. If you've use some complicated code to stitch domains together, this may not work, but if you've just hard-coded values in all over the place this may be the magic bullet.

The second option is to spider your own site and search for the bad domain in rendered HTML. The benefit here is that you can see the actual problem right away, but the downside is you may miss some if the spider can't or doesn't access certain pages.

banzaimonkey
A: 

I generally make sure that the paths on the development workstations and the staging/testing server are not the same. For example on a dev workstation it is localhost/websitename/path and on staging its just websitename/path.

This causes all kinds of stuff to obviously break when you go from localhost to staging/testing - that way you ensure that your paths are dynamically sniffed or using the appropriate constants.

Yes I'm assuming version control, deployment procedure, etc.

Yes find/replace, grep, spiders are all your friend to fix the rats nest that you started with Firebug .NET tab is helpful too

Bryan Waters
A: 

If you can't use relative paths for the resources (e.g. because static files are served from a different hostname), then it's probably a good idea to make sure you aren't hardcoding absolute paths anywhere except a single configuration file that's included near the top of where your web app is initializing.

Jeff Standen
A: 

I would agree with banzaimonkey, except I would start with the crawler. Working with the webserver logs assumes that the problem images, stylesheets, etc. are being accessed with some regularity. Pages or links located deep in the site or on rarely visited pages could easily be missed. Crawling the pages should locate them reliably.

I am by no means an expert, but I've been working on a somewhat similar problem. My solution was to use Perl and the WWW::Mechanize module to crawl entire sites and record various aspects of the pages. In my case, I wanted a list of bad links, specific forms, multimedia objects, and about five other things. I was able to build the script so it would treat specific hosts as "local" (there are about 80 sites over a number of domains). You should be able to do the same thing in reverse by identifying the "bad" links. This assumes you're doing the testing AFTER you've deployed the production site. You could probably do some sort of variation that would allow checking before deployment.

Another alternative would be to look at an already written crawler, and see what its results are. The Internet Archive has produced Heritrix which crawls, archives, and reports on websites. It's probably a bit of overkill. An option like LinkChecker could be used with the verbose option, then the output grepped for instances of the development server name / ip address. I'm sure there are lots of other options along this line.

I mention these primarily because I think you want something that automates the process more than someone manually checking each page. These tools can take some time to complete since they traverse the entire site, but they can give a pretty complete picture. The main things mine does not handle well are javascript and forms. Heritrix actually handles some JavaScript links, but still doesn't handle forms.

That said, WWW::Mechanize and other modules can programmatically submit forms, but they need to be given specific values. In your case, if you've got a large database, you may only need to submit one or two form values to verify images, etc. aren't from the development server. On the plus side, you can also check returned content to make sure the forms are working correctly. I had an issue today with paged navigation - the page was serving the same 20 results regardless of the selected page. Checking for that could be automated by testing for specific strings in the result sets (this is getting into the realm of test-driven development).

One other thing - Heritrix actually creates archives. It's the basis for the WayBack Machine at the Internet Archive. You might get a side benefit if keeping multiple versions of websites is of interest to you or your organization.

tmsilver
A: 

You must place environment dependent settings into the virtualhost configuration. Place this

SetEnv IMAGE_PATH = '/var/www/images/'

inside the virtual host ON the production server.

In your PHP code, use

getenv('IMAGE_PATH')

to get the path

Of course in your local environment that value on the virtualhost must be different.

After you've done that, any deploy won't require any site update. If you have lots of requirements, you can make only one variable, and use it to switch between loading two different configuration files at runtime

Elvis