How to close a site from indexing in robots.txt: instructions and recommendations

The work of the SEO optimizer is very large-scale. It is recommended that novice specialists write down an optimization algorithm so as not to miss any steps. Otherwise, the promotion will be difficult to call successful, because the site will constantly experience crashes and errors that will have to be fixed for a long time.

One of the optimization steps is to work with the robots.txt file. Each resource should have this document, because without it it will be more difficult to cope with optimization. It performs many functions that you have to understand.

Robot assistant

The robots.txt file is a plain text document that can be viewed in the standard Notepad of the system. When creating it, you must set the UTF-8 encoding so that it is read correctly. The file works with the protocols http, https and FTP.

This document is an assistant to search engines. If you do not know, then each system uses "spiders" that quickly scan the World Wide Web to issue relevant sites to user requests. These robots must have access to the resource data, robots.txt works for this.

In order for the spiders to find the path, you must send the robots.txt document to the root directory. To check if the site has this file, enter “https://site.com.ua/robots.txt” in the address bar of the browser. Instead of "site.com.ua" you need to enter the resource you need.

Work with robots.txt

Document Functions

The robots.txt file provides search engines with several types of information. It can give partial access so that the spider scans specific elements of the resource. Full access allows you to check all available pages. A complete ban does not give robots the opportunity to even begin testing, and they leave the site.

After visiting the resource, “spiders” receive an appropriate response to the request. There may be several, it all depends on the information in robots.txt. For example, if the scan was successful, the robot will receive a 2xx code.

Perhaps the site has been redirected from one page to another. In this case, the robot receives the code 3xx. If this code occurs several times, then the spider will follow it until it receives a different answer. Although, as a rule, he uses only 5 attempts. Otherwise, the popular 404 error appears.

If the answer is 4xx, then the robot is allowed to crawl all the content on the site. But in the case of the 5xx code, the verification may completely stop, since this often indicates temporary server errors.

Search robots

What is robots.txt for?

As you may have guessed, this file is a conductor of robots to the root of the site. Now it is used in order to partially restrict access to inappropriate content:

  • pages with personal information of users;
  • mirror sites;
  • search results;
  • data sending forms, etc.

If there is no robots.txt file in the root of the site, then the robot will scan absolutely all the content. Accordingly, unwanted data may appear in the SERP, which means that you and the site will suffer. If there are special instructions in the robots.txt document, then the spider will follow them and give out information that is desirable to the resource owner.

Work with file

To use robots.txt to close the site from indexing, you need to figure out how to create this file. To do this, follow the instructions:

  1. Create a document in Notepad or Notepad ++.
  2. Install the file extension “.txt”.
  3. Enter the necessary data and commands.
  4. Save the document and upload it to the root of the site.

As you can see, at one of the stages it is necessary to establish commands for robots. They are of two types: Allow and Disallow. Also, some optimizers can specify the crawl speed, host, and a link to the resource page map.

How to close a site from indexing

In order to start working with robots.txt and completely close the site from indexing, you also need to understand the symbols used. For example, you can use “/” in a document, which indicates that the entire site is selected. If "*" is used, then a sequence of characters is required. Thus, it will be possible to specify a specific folder, which can either be scanned or not.

Bot feature

Search engines have different spiders, so if you work for several search engines at once, you will have to consider this moment. They have different names, which means that if you want to contact a specific robot, you will have to indicate its name: “User Agent: Yandex” (without quotes).

If you want to set directives for all search engines, then you need to use the command: "User Agent: *" (without quotes). To correctly use robots.txt to close a site from indexing, you need to find out the specifics of popular search engines.

The fact is that the most popular search engines Yandex and Google have several bots. Each of them is engaged in its own tasks. For example, Yandex Bot and Googlebot are the main "spiders" that crawl the site. Knowing all the bots, it will be easier to fine-tune the indexing of your resource.

How the robots.txt file works

Examples

So, using robots.txt you can close the site from indexing with simple commands, the main thing is to understand what you need specifically. For example, if you want to prevent Googlebot from approaching your resource, you need to give it the appropriate command. It will look: "User-agent: Googlebot Disallow: /" (without quotes).

Now you need to make out what is in this team and how it works. So, the "User-agent" is used in order to use a direct call to one of the bots. Next, we indicate which one, in our case, this is Google. The Disallow command should begin on a new line and prevent the robot from accessing the site. The slash symbol in this case indicates that all pages of the resource are selected to execute the command.

What is robots.txt for?

In robots.txt, indexing can be disabled for all search engines with a simple command: "User-agent: * Disallow: /" (without quotes). The asterisk in this case denotes all search robots. Typically, such a team is needed in order to suspend the indexing of the site and begin cardinal work on it, which otherwise could affect the optimization.

If the resource is large and has many pages, often there is service information that is either undesirable to disclose, or it can adversely affect the promotion. In this case, you need to understand how to close the page from indexing in robots.txt.

You can hide either a folder or a file. In the first case, you need to start again with a call to a specific bot or all, so we use the “User-agent” command, and below we specify the “Disallow” command for a specific folder. It will look like this: "Disallow: / folder /" (without quotes). This way you hide the entire folder. If there is any important file in it that you would like to show, then you need to write the command below: “Allow: /folder/file.php” (without quotes).

File check

If you managed to close the site from indexing using robots.txt, but you don’t know if all your directives worked correctly, you can check the correct operation.

First you need to check the placement of the document again. Remember that it should be exclusively in the root folder. If it is in the root folder, it will not work. Next, open the browser and enter the following address: “http: // yoursite. com / robots.txt "(without quotes). If an error appears in the web browser, it means that the file is not where it should.

How to close a folder from indexing

Directives can be checked in special tools that are used by almost all webmasters. It's about Google and Yandex products. For example, in the Google Search Console there is a toolbar where you need to open the "Crawl", and then run the "Robots.txt File Validation Tool". In the window you need to copy all the data from the document and start scanning. Exactly the same check can be done at Yandex.Webmaster.

Source: https://habr.com/ru/post/C41064/


All Articles