To Crawl or Not to Crawl: A Short Guide to Robots.txt

Magento is a leading and one of the most popular eCommerce platforms used by businesses from all over the world. Of course, for our readers, that is no news. It gained such popularity thanks to the variety of features available out of the box.

However, when it comes to Magento SEO and eCommerce, we see that so powerful solution doesn’t have a robots.txt file by default. Thus, the extremely common question is how a robots.txt file should look and what should be in it.

You might wonder to hear that such a small text file as robots.txt, could cause the downfall of your website. In case of improper usage of robots.tx, you could end up telling search engine robots not to crawl your site, which means your web pages will not appear in search results. That's why it’s important to understand the importance of robots.txt file and learn how to check whether you’re using it correctly.

For those who are new to robots.txt, robots.txt is merely a text file implementing what is known as the Standard for Robot Exclusion — a protocol set in 1994 for websites to communicate with crawlers and other bots.

Benefits of robots.txt

Using robots.txt ensures that the search engine bots index the main content on your website.

Using robots.txt enable you to disallow directories that you would not want the search engine robots to index (those containing technical details about the site, personal data or other kinds of sensitive data).

Using robots.txt can prevent duplicate content issue (it is very good for SEO).

Where to put a robots.txt file?

A robots.txt file has to be placed in the root directory of the site (this is usually a root level folder called “htdocs” or “www” which makes it appear directly after your domain). You cannot use the file in a subdirectory.

*** TIP: If using subdomains, you have to create a robots.txt file for each of subdomains.

To cite one example, if the domain is magecloud.net, then robots.txt URL should be:

*** TIP: Make sure you have secured your store with HTTPS (What is HTTPS and why it is so important? ). Check our HTTPS Guide to find out whether your online store is a secure one and what is to be done if it’s not.

Basic parts of robots.txt file:

The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents and disallowed files and directories.
Robots.txt consists of two main parts: User-agent and Directives.

1. User-Agent: indicates what search engine robots this command applies to

If you want to tell all robots the same thing you put a ” * ” after the “User-agent”, making it “User-agent: *” This basically means that “these directions apply to all robots”.

*** TIP: If you need to impose different rules for each search engine, you must create several User-agent sections - every bot gets an own line.

2. Directives:

Disallow: indicates page, file, directory to be excluded from indexing (as many disallow lines as needed).

Allow: tells a bot that it’s okay to see a file in a folder that has been "Disallowed" by other instructions.

*** TIP: Different search engines interpret directives differently. By default, the first matching directive always wins.

*** TIP: Each directive should be on a separate line, for search engines not to get confused when parsing the robots.txt file.

Primary robots.txt commands

Robots.txt instructions depend on what outcome you need. All robots.txt commands result in one of the following three outcomes:

1. Full allow: when all content may be crawled.

There are three ways to welcome bots to your site:

A. Don't make a robot.txt file.
B. Make an empty file called robots.txt.
C. Make a file called robots.txt and enter the ‘Allow full access’ command.

User-agent: *
Disallow:

*** TIP: The " * " is a special value meaning "any robot".

2. Full disallow: No content may be crawled.

If you need the access to the entire site denied — use a slash (/). This is telling all bots to ignore THE WHOLE domain, meaning none of that website’s pages or files would be listed at all by the search engines.
To block all search engine crawlers from your site — use the following command in your robots.txt:

User-agent: *
Disallow: /

3. Conditional allow: Directives in robots.txt determine the ability to crawl certain content.

Block one folder:

User-agent: *
Disallow: /folder/

Block one file:

User-agent: *
Disallow: /file.html

Common robots.txt mistakes

Different errors can result when using a robots.txt file. Below are some of the most common mistakes:

1. No robots.txt file at all

Without robots.txt, your eCommerce website is absolutely open and crawlable and for sure, it sounds like a good thing. But... as Google allocates a crawl budget for each website (Google prioritize what to crawl, when, and how much) and thus, outdated and irrelevant content may be indexing by Googlebot instead of valuable and important pages. Some key pages even might be skipped entirely.

2. Empty robots.txt
3. Default robots.txt allowing to access everything
4. Disallow all
5. Using robots.txt to block access to sensitive areas of your site

If you are new to robots.txt file, you may think it's a security feature, but it’s the very opposite to. Robots.txt do not physically keep anyone from accessing files on your site. It is just a list of files and directories you don't want search engines to crawl.
Use a password to protect any those areas on your site that you want not to be accessible. DON'T use a robots.txt file for blocking access to them.
You better believe robots.txt is one of the hacker’s first ports of call — to see where they should break into.

6. Blocking JS and CSS files
7. Not using Sitemap
8. Robots.txt syntax errors

Creating robots.txt file process:

Step # 1. Checking if you have a robots.txt

Not sure if you have a robots.txt file? To establish whether you have one you have to enter your URL on a site with robots.txt checkers.
Alternatively, you can check it from any browser just by adding "/robots.txt" to the end of your site domain in the manner shown below.

If you have any file there, it is your robots.txt file.

Step # 2. Starting your robots.txt file

If you’ve found that you don’t currently have a robots.txt file, highly recommend you to create one as soon as possible.

To create a robots.txt file, simply open up any kind of text editor. Robots.txt file is just a plain text file (*.txt), which means that it can be created with a help of Notepad or any other standard text editor.
It’s vital to upload your robots.txt in ASCII mode. If WYSIWYG software (web page design software) word processing software (not plain text editors) are used to create the file — this file will be ignored by search engine crawlers if they can’t read the code.

*** TIP: Only one robots.txt file must exist on the whole of your site.

*** TIP: The name of the file (robots.txt) must be imperatively be written in low case.

Step # 3. Checking the file for errors

You should make certain that you did not mess up your file by making typos. Make sure that you spelled everything correctly and have the proper spacing.

Step # 4. Optimizing robots.txt for SEO

Robots.txt file is a very powerful file if you’re working on a site’s SEO. At the same time, it also has to be used with care. For that purpose, you may follow Neil Patel SEO hacks related to robots.txt.

Step # 5. Testing robots.txt file out

Once you've created a robots.txt file or made changes thereto, it should be tested to make sure everything’s valid and operating the right way.

There are lots of different online tools, which will warn you if you are blocking certain page resources that Google needs to understand your pages. You can use Google Search Console to test your robots.txt file and monitor Google Search results data for your properties (you will need an access thereto).
Here you can find the instructions.

*** TIP: Google Search Console is not a public tool, it requires user authentication (login).

Key concepts

Robots.txt file controls how search engine spiders see and interact with your webpages.

If you have a robots.txt file, make sure it is used in a proper way. Improper usage of robots.txt can block crawlers from indexing your page and thus can hurt your ranking.

Do not use robots.txt file as a security feature. For security issues, you may use robots.txt equivalent — security.txt.

Conclusion

Robots.txt is used to tell web spiders what to crawl and what not to crawl. Robots.txt file isn’t mandatory. If everything is to be indexed on a site, a robots.txt file isn’t needed.

It is almost always beneficial to have and maintain robots.txt. However, you do need to keep in mind how sensitive it is in nature. Remember to be careful when making any major changes to your site via robots.txt. While these changes can improve your search traffic, they can also do more harm than good if you’re not careful.

Feel free to get in touch, if you need help with robots.txt or have any questions concerning your Magento eCommerce store!