I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.
Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?
Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.
Create a robots.txt file with the following contents:
User-agent: *
Disallow: /
Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt
).
Add the following to your httpd.conf file:
# Exclude all robots
<Location "/robots.txt">
SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt
The SetHandler
directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.
That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.
(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)