Can I block search crawlers for every site on an Apache web server?

Nick Messick picture Nick Messick · Oct 22, 2008 · Viewed 22.4k times · Source

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.

Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?

Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.

Answer

jsdalton picture jsdalton · Sep 9, 2011

Create a robots.txt file with the following contents:

User-agent: *
Disallow: /

Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt).

Add the following to your httpd.conf file:

# Exclude all robots
<Location "/robots.txt">
    SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt

The SetHandler directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.

That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.

(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)