views:

2134

answers:

2

The concept

So, I've already made (upgraded actually) this website with its own Content Management System (CMS) that everyone likes. As with most CMS, the default behavior was the access pages with the ugly and utterly unhelpful url like such:

www.mysite.edu/index.php?pageid=xxxx

So the idea was to change it so that we could have "real" URLs that would not only look better but hopefully cooperate better with the Google search engine. The change really wasn't that hard:

  1. See that there was no page with the corresponding URL via Apache and redirect to /redirect.php using ErrorDocument 404 /redirect.php
  2. redirect.php strips the URL and find its entry in the database.
  3. redirect.php echos the HTML data from the page entry.

Because all the pages were created in a hierarchical structure (as per the CMS), finding the page was simply a matter of searching the database child-by-child until the last was found. This way a URL such as www.mysite.edu/me/something/useful would bring up the entry in useful which is a child of something which is a child of me. All the page HTML is stored in the database, so once the entry is found, its a simple matter to echo it to the page via PHP.

Side note: I have actually created a new table which stores the full URL of each page and links it to its pageid so the searching process is much improved, while the general idea stays the same.

The Problem

Everything works astounding well on the client side. However, I was noticing that Google has yet to index much (any) of our site. Basically, it was indexed to some extent before I re-engineered it, and now all that is left of the index are the files whose URLs remained the same.

I finally (today) got some data from Google Webmaster Tools that says it keeps getting 404 errors on pages listed in our sitemap.xml, yet, when I click on the links, the pages come up just fine. This leads me to believe that while the redirect is working well, Apache is still sending a Status: 404 message which probably prompts Google's bots to stop processing and/or not index the page.

The question

So with all this in mind, the question is this:

  1. Is there a way to first confirm that Apache is still sending Status: 404 messages?
    • Answer: yes!
  2. Is there a way to get it to stop while still redirecting to /redirect.php

Thanks in advance!

Edit 1: Thank you alex for introducing me to the Net tab in firebug. As I love and use firebug a lot, I'm sure that this new feature will come in handy later on down the road (read: currently researching other things it can do). Thanks to your post I have been able to confirm that the Status: 404 is indeed the right problem which needs addressing. Now the question is specifically how do I disable Apache from sending this error and simply redirect the page as I need it to.

As requested, here are some code samples from my files. One thing to note about the config files is that I am running on Debian Etch and installed via "apt-get install apache2 mysql-server php5" so they are spread out a bit, and the snipit of the one that is listed is the only one I believe to be of consequence to this problem. As it is a large file (669 lines), if you would like to see more, please tell me which parts will be useful and I will include it.

/etc/apache2/apache2.conf

...
ErrorDocument 404 /redirector.php
...

/etc/apache2/apache2.conf - blank file

/www-root/redirector.php

<?php
//get the URL string after server id.
//    e.g. www.mysite.edu/page returns "/page"
$pageReq = preg_replace("/\/$|\.php$|\.html?$/","",$_SERVER['REQUEST_URI']);

if(substr($pageReq,0,5)=='/wiki') {    //am I redirecting to the wiki app
    include "mewiki/wiki.php";
} else {                                //rest of site - what google will see
    if($pageReq=='')                    //most site looks like /ME/something
        $pageReq = '/ME';               //this fixes index to be appear as /ME
    include "config.php";

    //query the database for pageid
    mysql_connect($meweb['host'],$meweb['user'],$meweb['pass']);
    mysql_select_db($meweb['database2']);
    $qPageReq = mysql_query("SELECT pageid FROM url_redirects WHERE ".
                                "url='".$pageReq."'".
                                "ORDER BY updated DESC LIMIT 1");
    if($qPageReq) {
        //query database for actual page
        $pageid = mysql_fetch_assoc($qPageReq);
        $qPage = mysql_query("SELECT * FROM pages WHERE pageid=".
                                                $pageid['pageid']);
            if($qPage) {
                //createPage() is in page_loader.php.  It actually does a lot
                include "page_loader.php";
                createPage(mysql_fetch_assoc($qPage));
            }
    }
    mysql_close();
}
?>
+1  A: 

You can use Firebug to see if it is sending the 404 headers. Use the net tab. If it is 404ing, the GET for the page will be in red. Alternatively you can use Live HTTP headers. These are for Firefox only.

Can you post some of your .htaccess which redirects to redirect.php?

alex
+2  A: 

You need to send OK header, add header('HTTP/1.1 200 OK') to your code.

vartec
Thank you so much! This works perfectly.
Mike