The robots.txt file considers as one of the essential and most important part in SEO. It manage robots and archivers. If you have been blogging or managing a website for a while, You may already heard about the robots.txt file. the robots.txt file also known as “The Robots Exclusion Protocol“.
However, The name isn’t a big fact, The fact depends on it’s importance. Basically the robots.txt file use to exclude or include some parts of your blog or website to the search engines and other robots that crawl websites. Tough some rough archivers don’t adhere to the robots.txt rules, But most of the popular search engines and robots such as Alexa, Bing, Yahoo do follow the robots.txt rules.
From my point of view, the robots.txt file is a one of the must have part in SEO and website development. Creating a robots.txt file is easy and anyone can do this. All you have to do is, creating a .text file with the name “robots” in your domain’s root directory. You can do this from cPanel or creating this file on your computer and upload using FTP/SFTP client to your domain’s root directory.
Once you upload the robots.txt file to your domain’s root directory, The file can be accessible at
http://www.yourdomain.com/robots.txt, for an example here is this blog’s robots.txt file
http://www.bloggingguts.com/robots.txt, Also make sure that, The file is view-able by anyone.
The robots.txt file can be simple or complicated based on your needs. If you want to allow all robots to index each part of your site, The following line should be placed on your blog’s or websites robots.txt:
User-agent: * Disallow:
And if you want to disallow your whole blog or websites from being crawled by robots, Use the following line instead: (Just use a froward slash (/) right after Disallow property to achieve this)
User-agent: * Disallow: /
After disallowing every part of your site, If you want to allow some specific content, You should try the following: (Here I’ve allowed a directory called blog by using /blog/ right to the Allow: property)
User-agent: Allow: /blog/
If you are not from a techie background, It’s bit difficult to understand the full function of the robots.txt file, Huh!
Okay! For your easiness, Here is the understanding of the robots.txt file:
Understanding robots.txt File
- User-agent: – this property specify the user agent such as Googlebot and ia_archiver (Alexa Robot).
- Disallow: – This property specify which folders or directory shouldn’t be crawl or index by search robots and archivers
- Allow: – This property specify the folders and directories to be indexed by search robots and archievrs
- Sitemap: – This property specify the XML sitemaps to the robots.
- Crawl-delay: – This property used to limit the robots crawl frequency. Tough, This isn’t understood by many robots, But it’s still useful in many cases
The Disallow function used more frequently then Allow because the search robot’s and other archiver index your site’s every part by Default. That’s why, You need to insert Disallow property more frequently to specify the directories you don’t want to be indexed or crawled by search robots and archivers.
Allow property is useful when you want some of your blog contents within a Disallowed directory to be indexed or crawled by robots. For an example, You have a category base url structure and you have disallowed a category, But also want a few contents within the category to be crawled or indexed by robots. Then you should use the Allow property as follows:
User-agent: Disallow: /category_a/ Allow: /category_a/page-1.html Allow: /category_a/page-4.html
In above the robots.txt content. The category named “categorya” is blocked for robots, But the pages “page-1.html” and “page-4.html” within categorya will be indexed.
And If you want to apply above the rule for a specific robots such as Google, Simple add the Google bot name as the User-agent property value, as follows:
User-agent: Googlebot Disallow: /category_a/ Allow: /category_a/page-1.html Allow: /category_a/page-4.html
You can also specify your blog’s or website’s XML sitemap to the robots using robots.txt. Basically most of the robots look for the default sitemap location such as
http://www.yourdomain.com/sitemap.xml and some look at the robots.txt file. I recommend webmasters to specify the XML sitemap in the robots.txt file. To specify the sitemap file’s location to the search and archive robots, You can use the Sitemap property as follows:
The robots.txt file also can be used to limit the search bots and archivers crawl frequency by inserting Crawl-delay property. This property is useful when you are on a tight bandwidth limit. Tough most of the robots doesn’t understand the Crawl-delay property but, most of the popular robots do understand. The Crawl-delay property used as follows:
The value “10” in above the code specify 10 seconds delay in crawl priorities after first crawl.
The robots.txt file In WordPress
This website powered by WordPress and I know the better robots.txt file for WordPress based on the viewpoint of SEO. The best practice of using the robots.txt file in WordPress for SEO is, Excluding the low quality and un-necessary pages from being indexed and including the pages that provides value. For an example, Here is this blog’s robots.txt file:
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /out/ Disallow: /go/ Disallow: /cgi-bin/ Disallow: /wp-content/plugins/ Disallow: /wp-content/cache/ Disallow: /wp-content/themes/ Disallow: /trackback/ Disallow: /feed/ Disallow: /comments/ Disallow: */trackback/ Disallow: */feed/ Disallow: */comments/ Disallow: /*? Disallow: /?attachment_id Allow: /wp-content/uploads/ Crawl-delay: 10 Sitemap: http://www.bloggingguts.com/sitemap_index.xml
I’ve disallowed some un-necessary pages such as Feeds, Trackbacks, Comments, Attachments and a few more. You can use the robots.txt file for your WordPress as shown above or try with your needs.
Hope you’ve got a clear statement about the robots.txt file and now you know what is the best for you. However, If you find this page useful, Considering sharing and feel free to ask my anything related to the robots.txt file in the comments below.