Background - what is a result code and why should I care?
Each time something accesses your web server, it returns some content and a result code. The result code is not shown to the user but processed by the program accessing the server ("client"). The result code is a code for the client telling it the status of the content that it just sent
There are several result codes, but the ones we'll look at now are "404" and "200". The result code 200 means that the page the client wanted is available and shown in the content. The result code 404 means that the page the client wanted is not available, but it can also return content (eg. a page saying "sorry, couldn't find your page"). Usually a normal web page returns 200, saying all is ok.
Background - custom error pages
Anyone who has ever looked for something special on the web will be able to tell stories of the "URL that got away" -- the link to the page with exactly the content you were looking for, that just doesn't work anymore. More and more, as people update their sites, URLs aren't valid for that long of a time - people moving from "Frontpage" to a content-management-system (CMS), or from one CMS to another. When they move, the old links usually become invalid, inside of the new site it shows the correct links, but coming from the outside it'll more than likely have old, obsolete links. So you find the page you were looking for -- and then just get a boring "404 - not found.".
The people making these sites know that this is always going to happen sooner or later. So what can you do? Many sites now have custom error pages, so instead of landing on a plain "nothing found" page, you'll land on a page with lots of information about the sites, other links that might be interesting or similar to the link you were looking for. These custom error pages are a good thing - they help get people back on track, possibly helping them find the page they were originally looking for (or at least something similar). Sometimes, instead of an error page it will just redirect you to the starting page of the site, letting you go from there.
However, these custom error pages need to be set up correctly in order to work according to the web standards. A lot of sites just redirect errors to the error page, and the error page then returns ... result code "200" - meaning "I found what you were looking for". This is not the way it should be - an error page should always return 404 (at least for bad URLs, other errors have other codes). An error page with the result code 404 can still show exactly the same content as an error page with the result code 200. The user will not be able to tell the difference - but the software will.
Check to see if your server returns 404 or 200
To help you check to see what your server returns, I made a small Tool to check the server return codes.
Why should I return 404 instead of 200 when the user doesn't see it?
The result code is not returned for the user. The result code is for the software accessing the server. For a normal browser, it doesn't make much of a difference if it gets 404 or 200, but for a search engine it makes a world of a difference.
As everyone has seen, there are lots of bad URLs in all search engines. When a search engine crawls a URL and receives the result code 200 then it thinks the content is valid and will try to add it to the index. When it gets a 404 it know that the URL is no longer valid and will remove it (sooner or later, mostly later :-)). Can you see where this is heading? When a search engine tries to access a URL which no longer exists and your server returns the cusom error page with result code 200, then the search engine thinks that this is the content of that URL and will try to add it like that into the index. That's not a good thing.
Shoot yourself in the foot with a 200-file not found
Assuming you have several URLs which Google now has in the queue to be added to the index and that a part of these are actually file-not-found pages with the result code 200. According to Google's quality guidelines (http://www.google.com/intl/en/webmasters/guidelines.html) you should take care that you "Don't create multiple pages, subdomains, or domains with substantially duplicate content." It's listed as a recommendation, but you can believe me, it's more than that: Google will remove your duplicate content - either everything or all but one. Finding duplicate content within your website is really easy with my GSiteCrawler, by the way :-)
So it removes the error pages - great, huh? That's what we wanted in the first place. Well, it actually does a bit more. Every time your site goes against the rules, Google likes it a little bit less. Small things can add up, and 1000's of URLs which your server keeps on passing out (along with 200-all ok) but which Google has to filter out each time it comes visiting WILL make an impact. It probably won't get you thrown out by itself, but it does have an effect on the speeed and frequency of the crawl.
And then we come to the worst-case scenario: Instead of an error page, your server redirects to your start page. Can you see where this is going? So Google doesn't like duplicate content, let's get rid of everything or everything but ... this last URL. In the end you'll either have all those URLs gone (including the start page) or you might end up with an INVALID URL passing the content of your start page. Yes, that does matter, because that invalid URL will not have many external links pointing to it, i.e. it will have a very low pagerank (PR - Google uses this in part to determine the order of the results in the search pages). That was a good idea to redirect to the start page, huh? NOT!
How to make your 404 return 404- file not found
How you make your custom error page return result code 404 depends on your server, on it's configuration and on the programming language you used to create your site. Generally speaking, it's best to let your hoster handle this and you can check to see if they're doing it correctly.
Defining a custom error page with an ASP.NET server
To enable custom error handling in ASP.NET you need to configure your application to pass these on. For this, edit the file "web.config" in the application root. This example has a default error page (error.asp) and a special file-not-found page (error404.asp):
<configuration> <system.web> <customErrors mode="On" defaultRedirect="error.asp"> <error statusCode="404" redirect="error404.asp" /> </customErrors> </system.web> </configuration>
Note: do NOT use static html pages for this -- you can't change the result code on a static html page.
Adding result code 404 to a Windows-Server / ASP page
To add the result code 404 to an ASP page, you need to add the following code on TOP of the page (or use "Response.Buffer = True"):
<% Response.Status = "404 Not Found" %>
If you have that in your custom error page, it will return 404, no matter what else you display.
Adding result code 404 to a Windows-Server / ASP.NET page
<% Response.Status = "404 Not Found" %>
Defining a custom error page with an Apache server
Add the following line to your .htaccess file:
ErrorDocument 404 /error404.php
.. and make sure your error404.php returns 404 (as above). Also, DO NOT use static html pages for this -- you can't change the result code on a static html page.
If you find the following in your .htaccess file, remove it -- it's redirecting errors to your entry page:
ErrorDocument 404 /index.php
Again: DO NOT USE THIS LINE.
Adding result code 404 to a PHP page
Use this in PHP pages:
<?php header("HTTP/1.0 404 Not Found"); ?>
So your webhoster can't or won't allow you to change this?
Some webhosters don't allow you to change the default error behaviour, some even use the default error page as a way of presenting advertising for themselves or for paying advertisers. If your webhoster doesn't want to or can't change the default error behaviour, you need to look for a new web hoster. No way around it - either you get 404-pages that return 404 or you need to host your site elsewhere. Point yor webhoster to this page - link to this page on your site to get the point across. Even a page filled with ads can return result code 404, so there's no excuse not to!
Here are some links to help you, including an online check-page for your site:
Copyright © 2004-2013 by Johannes Mueller, all rights reserved. These files may NOT be posted anywhere else without express written permission by the author.
Please feel free to contact me via email: softplus [at] gmail.com