Robots.txt Disallow: how to create, features and recommendations

Getting into SEO-promotion courses, beginners come up with a lot of understandable and not very terms. It’s not so easy to figure it all out, especially if one of the points was initially poorly explained or missed. Consider the value in the robots.txt Disallow file, why this document is needed, how to create it and work with it.

In simple words

In order not to “feed” the reader with complex explanations that are usually found on specialized sites, it is better to explain everything “on the fingers”. A search engine comes to your site and indexes pages. After you look at reports that indicate problems, errors, etc.

robots txt disallow

But on the sites there is such information that is not required for statistics. For example, the page "About the company" or "Contacts". All this is not necessary for indexing, and in some cases is undesirable, since it can distort statistical data. To prevent this from happening, it is better to close these pages from the robot. This is exactly what the command in the robots.txt Disallow file is for.

Standard

This document is always on the sites. Its creation is done by developers and programmers. Sometimes the owners of the resource can do this, especially if it is small. In this case, working with him does not take much time.

Robots.txt is called the search robot exception standard. It is presented by a document that spells out the main limitations. The document is placed at the root of the resource. At the same time, so that it can be found along the path /robots.txt. If the resource has several subdomains, then this file is placed at the root of each of them. The standard is continuously associated with another - Sitemaps.

map of site

To understand the full picture of what is at stake, a few words about Sitemaps. This is a file written in XML. It stores all the resource data for the MS. The document provides information on web pages indexed by robots.

disallow robots txt directive

The file gives PS quick access to any page, shows the latest changes, their frequency and importance. By these criteria, the robot most correctly scans the site. But it is important to understand that the presence of such a file does not give confidence that all pages will be indexed. It is more of a clue to the process.

Using

The correct robots.txt file is used voluntarily. The standard itself appeared back in 1994. It was accepted by the W3C consortium. From that moment it began to be used in almost all search engines. It is needed for the "metered" adjustment of resource scanning by the search robot. The file contains a set of instructions that use the PS.

Thanks to a set of tools, they easily install files, pages, directories that cannot be indexed. Robots.txt also points to such files that need to be checked immediately.

For what?

Despite the fact that the file can really be used voluntarily, it is created by almost all sites. This is necessary in order to streamline the work of the robot. Otherwise, it will check all pages in a random sequence, and in addition to skipping some pages, it creates a heavy load on the resource.

Also, the file is used to hide from the eyes of the search engine:

  • Pages with personal data of visitors.
  • Pages on which there are forms of sending data, etc.
  • Mirror sites.
  • Pages with search results.

robots txt user agent disallow

If you specified Disallow for a specific page in robots.txt, there is a chance that it will still appear in the search results. This option may occur if a link to such a page is placed on one of the external resources or inside your website.

Directives

Speaking of a ban on a search engine, the concept of “directive” is often used. This term is known to all programmers. It is often replaced by the synonym “indication” and is used together with “commands”. It can sometimes be represented by a set of programming language constructs.

The Disallow directive in robots.txt is one of the most common, but not the only one. In addition to her, there are several who are responsible for certain instructions. For example, there is a User agent that points to search engine robots. Allow is the opposite Disallow team. It indicates the resolution for scanning some pages. Next, we consider in more detail the main commands.

Business card

Naturally, the robots.txt User agent Disallow file is not the only directive, but one of the most common. Most files for small resources consist of them. A business card for any system is still considered a User agent command. This rule was created in order to indicate robots looking at instructions that will be written later in the document.

There are now 300 search robots. If you want each of them to follow a specific direction, you should not rewrite them all unlikely. It will be enough to specify "User-agent: *". An asterisk in this case will show the systems that the following rules are for all search engines.

If you create directions for Google, then you need to specify the name of the robot. In this case, use Googlebot. If only this name is indicated in the document, then the rest of the search engines will not accept the commands of the robots.txt file: Disallow, Allow, etc. They will assume that the document is empty and there are no instructions for them.

disallow robots txt indexing ban

A complete list of bot names can be found on the Internet. It is very long, so if you need directions for certain Google or Yandex services, you will have to indicate specific names.

Ban

We have talked about the next team many times. Disallow just indicates what information should not be read by the robot. If you want to show all your content to search engines, then just write “Disallow:”. So robots will scan all pages of your resource.

A complete ban on indexing in robots.txt "Disallow: /". If you write like this, then the robots will not scan the resource at all. Usually this is done at the initial stages, in preparation for the launch of the project, experiments, etc. If the site is ready to show itself, then change this value so that users can get to know it.

In general, the team is universal. She can block certain elements. For example, a folder, using the “Disallow: / papka /” command, may prohibit scanning a link, file or documents of a certain resolution.

Resolution

To allow the robot to view specific pages, files or directories, use the Allow directive. Sometimes a command is needed so that the robot visits files from a specific section. For example, if this is an online store, you can specify a directory. Other pages will not be crawled. But remember that first you need to prevent the site from viewing all content, and then specify the Allow command with open pages.

what does disallow mean in robots txt

Mirrors

Another host directive. Not all webmasters use it. It is needed if your resource has mirrors. Then this rule is mandatory, because it indicates the Yandex robot which one of the mirrors is the main one and which one needs to be scanned.

The system does not crash on its own and easily finds the desired resource according to the instructions described in robots.txt. The site itself is registered in the file without specifying “http: //”, but only if it works on HTTP. If he uses the HTTPS protocol, then he indicates this prefix. For example, “Host: site.com” if HTTP, or “Host: https://site.com” in the case of HTTPS.

Navigator

We already talked about Sitemap, but as a separate file. Looking at the rules for writing robots.txt with examples, we see the use of a similar command. The file indicates "Sitemap: http://site.com/sitemap.xml". This is done so that the robot checks all the pages that are listed on the site map at. Each time he returns, the robot will look at new updates, changes that have been made and send data to the search engine faster.

Additional teams

These were the main directives that indicate important and necessary commands. There are less useful and not always applicable directions. For example, Crawl-delay sets the period that will be used between page loads. This is necessary for weak servers, so as not to "put" them by the invasion of robots. The seconds are used to indicate the parameter.

Clean-param helps to avoid duplication of content located at different dynamic addresses. They occur if a sorting function exists. Such a command would look like this: "Clean-param: ref /catalog/get_product.com."

Universal

If you don’t know how to create the right robots.txt, don’t be scared. In addition to directions, there are universal versions of this file. They can be placed on almost any site. An exception may be only a large resource. But in this case, professionals should know about the file and special people should do it.

how to create the right robots txt

A universal set of directives allows you to open the site for indexing. There is a host registration and a site map is indicated. It allows robots to always visit the pages required for scanning.

The catch is that the data may vary depending on the system on which your resource stands. Therefore, you need to select the rules, looking at the type of site and CMS. If you are not sure that the file you created is correct, you can check it in the Google webmaster tool and Yandex.

Mistakes

If you understand what Disallow means in robots.txt, this does not guarantee that you will not be mistaken when creating a document. There are a number of common problems experienced by inexperienced users.

Directive values ​​are often confused. This may be due to misunderstanding and lack of knowledge of directions. Perhaps the user simply overlooked and inadvertently mixed up. For example, they can use the value “/” for the User-agent, and the name of the robot for Disallow.

Enumeration is another common mistake. Some users believe that listing forbidden pages, files or folders should be indicated in a row in a row. In fact, for each forbidden or permitted link, file and folder, you need to write the command again and from a new line.

Errors can be caused by an incorrect file name. Remember that it is called "robots.txt". Use a lower case for the name, without variations such as “Robots.txt” or “ROBOTS.txt”.

robots txt spelling rules with examples

The User-agent field must always be filled. Do not leave this directive without a command. Returning to the host again, remember that if the site uses the HTTP protocol, then you do not need to specify it in the command. Only if it is an extended version of HTTPS. You cannot leave the Disallow directive unchanged. If you do not need it, just do not specify it.

conclusions

In summary, it is worth saying that robots.txt is a standard that requires accuracy. If you have never encountered him, then in the first stages of creation you will have many questions. It’s better to give this work to webmasters, as they work with the document all the time. In addition, some changes in the perception of directives by search engines may occur. If you have a small site - a small online store or blog - then it will be enough to study this issue and take one of the universal examples.

Source: https://habr.com/ru/post/K20678/


All Articles