Stop Search Engines From Indexing A Page
There can be times where you have a page in Search Engines that you don't want indexed or are creating a new page and don't want this one page to be indexed. In this tutorial we are going to look at all the different ways you can stop the search engines from indexing your page.
First of all there are a few things you can do to stop search engines from indexing a page you can use a meta tag on the page or you can use the robots.txt file.
The HTML meta tag of robots should be used if you want to stop search engines from indexing a certain page, inside your head tag add the following meta tag.
<meta name="robots" value="">
Inside the robots content attribute you can use a number of values, you can comma separate these values to apply multiple settings to your robots meta. By default the robots will use INDEX, FOLLOW, if this is the value you want you don't need to add the robots meta tag. This means that the search engine will index the page and follow all links.
Here is a list of all the available values you can use inside the robots meta tag.
- index - Allows search engines to index the page
- noindex - Tells search engine not to index this page
- nofollow - Tells the search engines to not follow any of the links on this page
- follow - Allows the search engines to follow the links on the page
- noimageindex - Will stop search engines from indexing the images on the page
- none - This is a shortcut to noindex, nofollow
- noarchive - Will stop search engines from displaying a cached version of the page
- nocache - This is the same as the noarchive but will work for bing.
- nosnippet - Will stop search engines from displaying a snippet in the results
- noodp - Will stop search engines using the description of this page from DMOZ
- noydir - Stop yahoo from displaying the yahoo directory description
Block Certain Search Engines
If you want to block certain search engines from indexing your page you can define each search engine robot and block them. All you have to do in the meta tag is swap name="robots" with the search engine you want to block.
- GOOGLEBOT - Block Google from indexing your page
- SLURP - Block Yahoo from indexing your page
- MSNBOT - Block MSN from indexing your page
- TEOMA - Block ASK from indexing your page
If you want the search engines to stop indexing a group of pages or a sub-folder then using the robots.txt is a good technique to use. This is simply a text file that you upload to the root of your website, it must be named robots.txt and will be used by search engines as instructions of how to handle crawling your website this is called The Robots Exclusion Protocol. Before the robot crawls your site it will check for a /robots.txt file to get instructions of how to crawl it.
The main commands it will looking for are user-agent and disallow. user-agents defines the type of crawler so you can make sure a certain bot can't crawl your site. Disallow will tell the crawler what it isn't allowed to crawl.
To stop all search engines from crawling and indexing your site use the follow code. This sets the user-agent to * which means all crawlers and disallow to / which means everything from the root.
User-agent: * Disallow: /
As the robots.txt file sits in the root of your website and it is accessible from /robots.txt it's open for general users to read your robots.txt file. If you have a hidden area in your website you shouldn't add it to your robots.txt to hide as this file is human readable.
To block robots from crawling certain areas in your folder you can simply build up a disallow list of folders. The following will disallow search engines from indexing anything inside the folders admins, users and tmp.
User-agent: * Disallow: /admins/ Disallow: /users/ Disallow: /tmp/
With the user-agent you can change which search engines can't index your pages for example if you just want to make sure Google doesn't index your site then add the following.
User-agent: googlebot Disallow: /
You could do the opposite and disallow all other search engines except Google.
User-agent: googlebot Disallow: User-agent: * Disallow: /
To get a full list of robots that can crawl your site visit the robots database.
Learn how to code with Treehouse
- Learn projects with access to 1000+ videos
- Practice live with our Code Challenge Engine
- Get help in our members-only forums
- Access 100s of premium tutorials and downloadable content
- Members content consists of premium WordPress plugins
- CSS packages, jQuery packages, tutorial demo files and templates for 100s of web development tutorials
- In-depth development tutorials
- Priority tutorial requests
- No ads