How to block search engines from indexing all urls beginning with origin.domainname.com

Loveleen Kaur picture Loveleen Kaur · Oct 5, 2010 · Viewed 16.3k times · Source

I have www.domainname.com, origin.domainname.com pointing to the same codebase. Is there a way, I can prevent all urls of basename origin.domainname.com from getting indexed.

Is there some rule in robot.txt to do it. Both the urls are pointing to the same folder. Also, I tried redirecting origin.domainname.com to www.domainname.com in htaccess file but it doesnt seem to work..

If anyone who has had a similar kind of problem and can help, I shall be grateful.

Thanks

Answer

Lekensteyn picture Lekensteyn · Oct 5, 2010

You can rewrite robots.txt to an other file (let's name this 'robots_no.txt' containing:

User-Agent: *
Disallow: /

(source: http://www.robotstxt.org/robotstxt.html)

The .htaccess file would look like this:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com$
RewriteRule ^robots.txt$ robots_no.txt

Use customized robots.txt for each (sub)domain:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^sub.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^example.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.example.org$ [OR]
RewriteCond %{HTTP_HOST} ^example.org$
# Rewrites the above (sub)domains <domain> to robots_<domain>.txt
# example.org -> robots_example.org.txt
RewriteRule ^robots.txt$ robots_${HTTP_HOST}.txt [L]
# in all other cases, use default 'robots.txt'
RewriteRule ^robots.txt$ - [L]

Instead of asking search engines to block all pages on for pages other than www.example.com, you can use <link rel="canonical"> too.

If http://example.com/page.html and http://example.org/~example/page.html both point to http://www.example.com/page.html, put the next tag in the <head>:

<link rel="canonical" href="http://www.example.com/page.html">

See also Googles article about rel="canonical"