How to configure Robots.txt correctly?

The correct Robots txt for the html site creates action layouts for search engine bots, telling them what they can check. Often this file is referred to as the Robot Exclusion Protocol. The first thing bots look for before crawling a website is robots.txt. He can point to the Sitemap or tell him not to check certain subdomains. When it is necessary for search engines to search for what is most often found, then robots.txt is not required. It is very important in this process that the file is formatted correctly and does not index the user page with the user's personal data.

Robot Scanning Principle

Robot Scanning Principle

When a search engine encounters a file and sees a forbidden URL, it does not crawl it, but it can index it. This is due to the fact that even if robots are not allowed to view the content, they can remember backlinks pointing to a forbidden URL. Due to blocked access to the link, the URL will appear in the search engines, but without fragments. If the inbound marketing strategy requires the correct Robots txt for bitrix (Bitrix), provide a site check at the request of the user with scanners.

On the other hand, if the file is not formatted correctly, this can lead to the site not being shown in the search results and not being found. Search engines cannot get around this file. A programmer can view the robots.txt of any site by going to his domain and following him using robots.txt, for example, www.domain.com/robots.txt. Using a tool such as the SEO optimization section of Unamo, in which you can enter any domain, and the service will display information about the availability of the file.

Limitations for scanning:

  1. The user has outdated or confidential content.
  2. Images on the site will not be included in image search results.
  3. The site is not yet ready for demonstration, so that the robot indexes it.

It must be borne in mind that the information that the user wishes to receive from the search engine is available to everyone who enters the URL. You should not use this text file to hide sensitive data. If the domain has an error of 404 (not found) or 410 (passed), the search engine checks the site, despite the presence of robots.txt, in which case it considers that the file is missing. Other errors, such as 500 (Internal Server Error), 403 (Forbidden), timeout, or “unavailable,” take into account robots.txt instructions, however, a crawl may be delayed until the file is available.

Create Search File

Create search file

Many CMS programs, such as WordPress, already have a robots.txt file. Before setting up Robots txt WordPress correctly, a user needs to familiarize themselves with its capabilities in order to figure out how to access it. If the programmer creates the file on his own, he must meet the following conditions:

  1. Must be written in lower case.
  2. Use UTF-8 encoding.
  3. Saved in a text editor as a file (.txt).

When the user does not know where to place it, he contacts the web server software provider to find out how to access the domain root or go to the Google console and download it. Using this function, Google can also check whether the bot is functioning correctly, and a list of sites that have been blocked using the file.

The basic format of the correct Robots txt for bitrix (Bitrix):

  1. The legend of robots.txt.
  2. #, comments are added that are used only as notes.
  3. These comments will be ignored by the scanners along with any typos from the user.
  4. User-agent - indicates which search engine contains instructions for the file.
  5. Adding an asterisk (*) tells the scanners that the instructions are for everyone.

Indication of a specific bot, for example, Googlebot, Baiduspider, Applebot. Disallow, tells the crawlers which parts of the website do not need to be crawled. It looks like this: User-agent: *. An asterisk means "all bots." However, you can specify pages for specific bots. To do this, you need to know the name of the bot for which recommendations are set.

The correct robots txt for Yandex might look like this:

The right robots txt for Yandex

If the bot should not crawl the site, you can specify it, and to find the names of user agents, it is recommended that you familiarize yourself with the online capabilities of useragentstring.com.

Page optimization

Page optimization

The two lines listed below are considered the complete robots.txt file, and one robot file may contain several lines of user agents and directives that prohibit or allow scanning. The basic format for the correct Robots txt is:

  1. User agent: [agent username].
  2. Disallow: [URL string that is not crawled].

In the file, each block of directives is displayed as discrete, separated by a line. In the file next to the agent user’s directory, each rule is used for a specific set of section-separated lines. If the file has a rule that is valid for several agents, the robot will consider only the most specific group of instructions.

Technical syntax

Technical syntax

It can be considered as the "language" of robots.txt files. There are five terms that may exist in this format, the main ones include:

  1. User-agent - a web crawler with crawl instructions, usually a search engine.
  2. Disallow - a command used to indicate to the user the agent needs to (skip) a specific URL address. There is only one forbidden condition for each.
  3. Allow. For Googlebot, which gets access, even the user page is prohibited.
  4. Crawl-delay - indicates how many seconds the scanner will need before bypassing. When bot does not confirm it, the speed is set in the Google console.
  5. Sitemap - used to determine the location of any XML maps associated with the URL.

Pattern Mappings

When it comes to the actual lock URLs or permissions of the correct Robots txt, the operations can be quite complex as they allow you to use pattern matching to cover a number of possible URL parameters. Google and Bing both use two characters that identify pages or subfolders that SEO wants to exclude. The two characters are an asterisk (*) and a dollar sign ($), where: * is a wildcard that represents any sequence of characters. $ - matches the end of the url.

Google offers a long list of possible template syntax that explains to the user how to properly configure the Robots txt file. Some common use cases include:

  1. Prevent duplicate content from appearing in search results.
  2. Save all sections of the website privately.
  3. Saving the internal pages of search results based on an open statement.
  4. Indication of location.
  5. Preventing search engines from indexing specific files.
  6. Specifies a crawl delay to stop congestion while scanning multiple areas of content at the same time.

Check for Robot File

If the site does not have zones that need to be scanned, then robots.txt will not be needed at all. If the user is not sure that there is this file, he needs to enter the root domain and type it at the end of the URL, something like this: moz.com/robots.txt. A number of search bots ignore these files. However, as a rule, these scanners do not belong to reputable search engines. They are from the kind of spammers, mail aggregates and other types of automated bots, which are in large numbers posted on the Internet.

It is very important to remember that using the Robot Exclusion Standard is not an effective security measure. In fact, some bots can start from pages where the user sets them to scan mode. There are several parts that are part of the standard exception file. Before you tell the robot which pages it should not work on, you need to specify which robot to talk to. In most cases, the user will use a simple declaration, which means "all bots."

SEO optimization

SEO optimization

Before optimization, the user must make sure that he does not block any content or sections of the site that need to be bypassed. Links to pages blocked by the correct Robots txt will not be respected. It means:

  1. If they are not linked to other pages available to search engines i.e. pages that are not blocked by robots.txt or meta-robot, and related resources will not be crawled and therefore cannot be indexed.
  2. No link can be transferred from the blocked page to the destination of the link. If there is such a page, it is better to use a different blocking mechanism than robots.txt.

Since other pages can directly link to a page containing personal information and you need to block this page from the search results, use another method, for example, password protection or noindex meta data. Some search engines have several user agents. For example, Google uses Googlebot for ordinary searches and Googlebot-Image for image searches.

Most user agents from the same search engine follow the same rules, so there is no need to specify directives for each of several search robots, but if you can do this, you can fine-tune the scanning of site content. The search engine caches the contents of the file and usually updates the contents of the cache at least once a day. If the user modifies the file and wants to update it faster than standard, he can send the robots.txt URL to Google.

Search engines

Check for Robot File

To understand how Robots txt works correctly, you need to know about the capabilities of search engines. In short, their capabilities are that they send “scanners,” which are programs that browse the Internet for information. They then store part of this information so that it can subsequently be transmitted to the user.

For many people, Google is already the Internet. In fact, they are right, since this is perhaps his most important invention. And although search engines have changed a lot since they were created, their fundamental principles are the same. Scanners, also known as bots or spiders, find pages from billions of websites. Search engines give them directions on where to go, while individual sites can also communicate with bots and tell them which specific pages they should look at.

As a rule, site owners do not want to display in search engines: administrative pages, backend portals, categories and tags, as well as other information pages. The robots.txt file can also be used to prevent search engines from checking pages. In short, robots.txt tells web crawlers what to do.

Page Ban

This is the main part of the robots exclusion file. With a simple ad, the user tells the bot or group of bots not to scan specific pages. The syntax is simple, for example, to deny access to everything in the "admin" directory of the site it says: Disallow: / admin. This line will prevent bots from crawling yoursite.com/admin, yoursite.com/admin/login, yoursite.com/admin/files/secret.html sites, and everything else that falls under the admin directory.

To prohibit a single page, simply indicate it in the prohibition line: Disallow: /public/exception.html. Now the “exception” page will not be migrated, but everything else in the “public” folder will be.

To include multiple pages, simply list them:

Catalogs and Pages

These four lines of the correct Robots txt for symphony will apply to any user agent listed at the top of the # robots.txt section for https://www.symphonyspace.org/.

Page Ban

Sitemap: https://www.symphonyspace.org/sitemaps/1/sitemap.xml.

Other commands: # live - do not allow web crawlers to index cpresources / or provider /.

User Agent: * Disallow: / cpresources /.

Deny: / vendor / Disallow: /.env.

Setting Standards

The user can specify specific pages for different bots by combining the previous two elements, this is how it looks. An example of the correct Robots txt for all search engines is presented below.

Setting standards

The admin and private sections will be invisible to Google and Bing, but Google will still see the "secret" directory, while Bing will not. You can specify general rules for all bots using the asterisk user agent, and then give specific instructions to the bots in the following sections. With the knowledge above, the user can write an example of the correct Robots txt for all search engines. Just run your favorite text editor and tell the bots that they are not welcomed in certain parts of the site.

Server Performance Tips

SublimeText is a universal text editor and the gold standard for many programmers. His programming tips are based on efficient coding, moreover. users appreciate the availability of keyboard shortcuts. If the user wants to see an example robots.txt file, go to any site and add “/robots.txt” to the end. Here is part of the robots.txt GiantBicycles file.

The program provides the creation of pages that users do not want to show in search engines. And also has several exclusive things that few people know about. For example, if the robots.txt file tells the bots where you don’t need to go, the Sitemap does the opposite and helps them find what they are looking for, and although search engines probably already know where the site map is located, it does not bother them.

There are two types of files: an HTML page or an XML file. An HTML page is one that shows visitors all the available pages on a website. In your own robots.txt, it looks like this: Sitemap: //www.makeuseof.com/sitemap_index.xml. If the site is not indexed by search engines, although it has been scanned several times by web robots, you need to make sure that the file is present and that its permissions are set correctly.

By default, this will happen with all installations of SeoToaster, but if necessary, you can reset it as follows: File robots.txt - 644. Depending on the PHP server, if this does not work for the user, it is recommended to try the following: File robots.txt - 666.

Setting Scan Delay

The crawl delay directive tells certain search engines how often they can index a page on a site. It is measured in seconds, although some search engines interpret it a little differently. Some people see walk-off delay 5 when they tell them to wait five seconds after each scan to start the next.

Others interpret this as an instruction to scan only one page every five seconds. The robot cannot scan faster to save server bandwidth. If the server must match the traffic, it can set a crawl delay. In general, in most cases, users do not need to worry about this. This is how the eight-second bypass delay is set - Crawl-delay: 8.

But not all search engines will obey this directive, so if you prohibit pages, you can set various crawl delays for certain search engines. After all the instructions in the file are configured, you can upload it to the site, first make sure that it is a simple text file and has the name robots.txt and can be found at yoursite.com/robots.txt.

Best WordPress bot

Best wordpress bot

The WordPress site has some files and directories that you need to block every time. The directories that users should deny are the cgi-bin directory and the standard WP directories. Some servers do not allow access to the cgi-bin directory, but users must include it in the disallow directive before setting up Robots txt WordPress correctly

The standard WordPress directories that should block are wp-admin, wp-content, wp-includes. In these directories there is no data that was originally useful for search engines, but there is an exception, that is, in the wp-content directory there is a subdirectory named uploads. This subdirectory must be allowed in the robot.txt file, since it includes everything that is loaded using the WP media download function. WordPress uses tags or categories to structure content.

If categories are used, then in order to make the correct Robots txt for Wordpress, as indicated by the manufacturer of the program, you must block the tag archives from the search. First, they check the database by going to the Administration panel> Settings> Permalink.

By default, the base is a tag if the field is empty: Disallow: / tag /. If a category is used, then you need to block the category in the robot.txt: Disallow: / category / file. By default, the base is a tag if the field is empty: Disallow: / tag /. If you are using a category, you must block the category in the robot.txt: Disallow: / category / file.

Files used mainly for displaying content, they are blocked by the correct Robots txt file for Wordpress:

Robots txt for wordpress

Basic Joomla setup

Joomla, Robots txt Joomla , . SEO. , . , SEO. , , : URL- .

, Joomla URL-. , index.php URL-. , URL- , Google . robots txt Joomla:

  1. In the Joomla root folder, find the htaccess.txt file.
  2. Label it as .htaccess (no extension).
  3. Include site name in page titles.
  4. Find metadata settings at the bottom of the global configuration screen.

Cloud Robot MODX

Cloud Robot MODX

Previously, MODX Cloud provided users with the ability to control the behavior of the robots.txt resolution file for maintenance based on a switch in the dashboard. Although this was useful, you could accidentally allow indexing on staging / dev sites by switching the option in Dashboard. Similarly, indexing at the production site could easily be prohibited.

Today, the service assumes the presence of robots.txt files in the file system with the following exception: any domain that ends with modxcloud.com will serve as Disallow: / directive for all user agents, regardless of the presence or absence of the file. For production sites that receive real traffic from visitors, you will need to use your own domain if the user wants to index their site.

Some organizations use the correct Robots txt for modx to run multiple websites from the same installation using Contexts. The case in which this can be applied will be a public marketing site in combination with micro sites of the landing page and possibly a non-public intranet.

Traditionally, this has been difficult to do for multi-user installations, since they share the same network root. In MODX Cloud, this is easy to do. They simply upload an additional file to a website called robots-intranet.example.com.txt with the following content, and it will block indexing with well-functioning robots, and all other host names will return to standard files if no other specific name nodes exist.

Robots.txt is an important file that helps the user to link to a site on Google, major search engines and other websites. Located at the root of the web server, the file instructs the web robots to scan the site, sets which folders it should or should not index, using a set of instructions called the Robot Exclusion Protocol. An example of the correct Robots txt for all obots.txt search engines is especially easy to do with SeoToaster. A special menu has been created for him in the control panel, so the bot will never have to strain to gain access.

Source: https://habr.com/ru/post/A9247/


All Articles