Robot.txt and It's Importance

Robot.txt and It's Importance
V-empower Inc
<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>

Today's Inspirational Quote:

"It is more important to know where you are going than to get there quickly. Do not mistake activity for achievement."

-- Mabel Newcomber

<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>~<>
Greetings!
How is it going?

It is great, isn’t it guys, when search engines frequently visit your website and index your pages – you will have a great deal to be glad about (apparently for reasons you already know!). However, sometimes there are cases, rarely though, when you would want a search engine NOT to index few pages of your website – either for technical reasons or personal ones.

Talking technically, say, if you have two versions of a page (one for viewing in the browser and other for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will prefer that search engines do not index it. Additionally, you may also want to save some bandwidth by excluding images, stylesheets and javascript from indexing – and for this you need a way to tell spiders to keep away from ‘these’ items.

Its’ here that the ‘Robots.txt’ file comes to your rescue!

Robots exclusion standard – ‘Robots.txt’

Many search engines use programs called robots to locate web pages for indexing. These programs are not limited to a pre-defined list of webpages instead they follow links on pages they find, which makes them a form of intelligent agent. The process of following links is called spidering, wandering, or gathering. Once they have a page or document, the parsing and indexing of the page begins.

If a site owner wishes to give instructions to web robots about which pages to index and which pages NOT to be indexed, he must place a text file called robots.txt to the root of the web site hierarchy (e.g. www.example.com/robots.txt). Robots that wish to follow the instructions try FIRST to look for & fetch this file to know if the web owner wanna restrict it indexing few pages. If this file doesn't exist web robots assume that the web owner wishes indexing of all its pages.

‘Block or remove pages from being indexed by using a robots.txt file’

IMPORTANT: All respectable robots will respect the directives in a robots.txt file, although some may interpret them differently. A robots.txt by no chance is enforceable, and some spammers and other troublemakers may choose to ignore it. Password protecting of confidential information is recommended here.

Definition wise Robots.txt or The Robot Exclusion Standard, also known as the Robots Exclusion Protocol is a convention to prevent web spiders and other web robots from accessing all or a few pages of a website, which are otherwise publicly viewable.

Text Box: Food for Thought: Experience is like a comb that life gives you when you are bald.   (Navjot Singh Sidhu)
Creation of a Robot.txt file

For creating a ‘Robots.txt’ file check this:

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449

Generate ‘Robot.txt’ files using the below given links:

http://www.mcanerin.com/EN/search-engine/robots-txt.asp

http://www.howrank.com/Robots.txt-Tool.php

Rule 1: Make sure it's named exactly ‘Robots.txt

Rule 2: This file must be uploaded to the root accessible directory of your site, not a subdirectory (ie: http://www.mysite.com but NOT http://www.mysite.com/stuff/).

It is only by following the above two rules will search engines interpret the instructions contained in the file. Deviate from this, and "robots.txt" becomes nothing more than a regular text file.

Note-worthy Notes:
  1. Robots.txt is a text (not html) file you put on your site.
  2. Robots.txt is by no means are mandatory unless you want to hide public view of few pages
  3. Search Engines generally obey what they are asked not to do but you are NOT to trust them blindly.
  4. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that putting a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door.
  5. If you have real sensitive data, it is NOT RECOMMENDED to rely on robots.txt.
  6. In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
  7. You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one).
  8. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.
Note: The concept and structure of robots.txt has been developed more than a decade ago and if you are interested to learn more about it, visit http://www.robotstxt.org/ or you can go straight to the Standard for Robot Exclusion because in this article we will deal only with the most important aspects of a robots.txt file. For more information you may also read: http://www.searchtools.com/robots/robots-exclusion-protocol.html

Thanks!

V-Empower Inc: Robots Topic Today 26-Nov

Connect with V-Empower Inc on Social Networking websites:
Connect to LinkedinConnect To FacebookConnect To FlickrProgressive Dies China TwitterProgressive Dies China Blogger Progressive Dies Asia Delicious Progressive Dies Asia Freiend Feed Sheet Metal Stampings YoutubeV-empower On TechnoratiSheet Metal Stampings Stumbleupon Progressive Die Design MySpacePlaxo Seo SEM Connect to Yedda

0 comments:

Post a Comment

Share