http://www.site.com/shop/maxi-dress?colourId=94&optId=694
http://www.site.com/shop/maxi-dress?colourId=94&optId=694&product_type=sale
I have thousands of URLs like the above. Different combinations and names.
I also have duplicates of these URLs which have the query string product_type=sale
I want to disable Google from indexing anything with product_type=sale
Is this possible in robots.txt
Google supports wildcards in robots.txt. The following directive in robots.txt will prevent Googlebot from crawling any page that has any parameters:
Disallow: /*?
This won't prevent many other spiders from crawling these URLs because wildcards are not a part of the standard robots.txt.
Google may take its time to remove the URLs that you have blocked from the search index. The extra URLs may still be indexed for months. You can speed the process up by using the "Remove URLs" feature in webmaster tools after they have been blocked. But that is a manual process where you have to paste in each individual URL that you want to have removed.
It may also hurt your site's Google rankings to use this robots.txt rule in the case that Googlbot doesn't find the version of the URL without parameters. If you commonly link to the versions with parameters you probably don't want to block them in robots.txt. It would be better to use one of the other options below.
A better option is to use the rel canonical meta tag on each of your pages.
So both your example URLs would have the following in the head section:
<link rel="canonical" href="http://www.site.com/shop/maxi-dress">
That tells Googlebot not to index so many variations of the page, only to index the "canonical" version of the URL that you choose. Unlike using robots.txt, Googlebot will still be able to crawl all your pages and assign value to them, even when they use a variety of URL parameters.
Another option is to log into Google Webmaster Tools and use the "URL Parameters" feature that is in the "Crawl" section.
Once there, click on "Add parameter". You can set "product_type" to "Does not affect page content" so that Google doesn't crawl and index pages with that parameter.
Do the same for each of the parameters that you use that don't change the page.