In the years leading up to at least the early ’70s, information was not so “within reach” as it is now. Then, most of the available information was gotten from what people had written and published in books, journals, and the likes. So, to gain information on a particular topic, one might have had to go through droves of books. However, in research, manual/physical indexes play a role in ensuring that the process of finding information on individual topics is not too tedious.
Soon came the Internet. The Internet was officially released sometime around 1989 as a result of over a decade’s work. Ever since the internet became widely available to individuals around the globe, obtaining information has become as easy as the push of a few buttons.
What has the Internet done for our information system? In simple terms, the Internet has taken away the stress of looking through several books hoping to find the information you need. Now, your searches yield only relevant results. However, what goes on within search engines to ensure that what you seek is what you get? Several factors contribute to this accuracy, and one of these factors is a process called ‘crawling’. Details ranging from what carries out crawling, what the result of crawling is, and how the information gotten from this process is stored will all be considered in this article.
What Is Crawling?
Crawling is a process whereby search engines like Google adopt the use of software called crawlers/spiders to look through pages for public viewing. Using existing links provided by site owners as a start-mark, these crawlers eventually move along a series of links, and by extension, move from page to page.
Data found by the crawlers is then sent to Google’s server. From here, Google notes several signals that are available on these pages/sites, like keywords, relevance, and how current the content is. Google stores this information in Caffeine, its search index.
What Is A Crawl List?
A Crawl list is simply another term for Google's search index. This is the catalog where all pages that are crawled, and successfully indexed by bots are stored. Therefore, web contents that are served as results when queries are made are selected from this crawl list.
Reasons Why Your Page is not Crawled
Here are a few reasons why your page might not be crawled by Google’s crawlers –
As the site owner, you could disallow the search bots from crawling a selected page or pages.
Your page may require password verification before anyone can access it. In cases like this, the search bots do not log in, hence, the contents of your page cannot be accessed.
Pages that are copies of other pages that were crawled previously will not be crawled again.
How to Prevent Your Page from being Indexed/Crawled
Preventing one of your pages from being crawled means that the page in question will not appear on the search index/crawl list. If this is the result you desire, let us consider some steps you can take to make it happen –
Adopting robots.txt commands:
The main function of robots.txt commands is to regulate the crawling traffic to ensure that your server is not overwhelmed by web crawlers. It also gives you some control over what pages will or will not undergo crawling. Robots.txt files do not hide your page from search results completely. Rather, it displays the page URL leaving out the description, visual content, PDFs, or any non-HTML file. However, if the page in question is linked to other pages, it might appear on Google search regardless of the existing robots.txt command. In addition, some crawlers do not work with robots.txt rules.
Adopting noindex tags:
If you include the noindex header or metatag in your page HTTP response, it stops that page from being served in Google's search. However, the next time crawling takes place, rather than prevent the page from showing up in search results, it is completely dropped from Google’s index. This is done even when the page is linked to other pages. Although, in the presence of a robots.txt command, already-indexed pages with noindex tags will possibly be served. This is because Google needs to crawl a page before it can recognize a noindex tag.
Securing pages by using passwords:
Typically, web crawlers cannot access password-protected pages as these pages carry a “no follow” command. Only pages that are open to everyone can undergo both indexing & crawling. Therefore, whenever you have a page that you do not want Google spiders to access, you can consider the option of moving these articles to password-protected directories on your site's server.
Using the URL parameters tool:
In the hands of experienced users, these tools block the crawling of selected URLs. However, when wrongly used, it blocks a reasonable portion of your page’s URL space. This effect cannot be easily debugged, therefore, this method is not advisable for beginners.
Taking the page off the web entirely:
Taking your page off the web entirely is the best way to keep its contents hidden.
What comes after crawling?
Sites that have been crawled by the bots undergo indexing next. However, not every page that is crawled is guaranteed to be indexed. The probability of your page undergoing indexing depends on the quality of content on your page and your metadata to a reasonable extent. Here are some factors that cause a crawled page to not undergo indexing:
Low/poor quality content: Content that falls below par is far from what Google strives to provide for its users. Therefore, pages that lack information or are poorly structured will not proceed to indexing.
When meta directives preventing indexing are present: If your HTTP response contains any noindex header/meta tag, your page will be exempted from indexing. However, this only works with pages that do not have a prior robots.txt command which prevents crawling altogether.
After indexing what next?
The pages that have successfully been indexed are stored in Google's index and remain here until they will be served. Serving is simply what goes on when individuals input a query into the search engine. Results that are served are the pages with the best quality and the highest relevance available in Google's index.
If you find that your site was indexed yet does not come up in search results, there would be reasons for this. It could be that compared to other sites in the same niche, your site’s content quality was low. Or, several other sites were of more relevance to the query than yours.
Let us now consider 4 of Google's user agents.
Googlebot is a generic name for web crawlers of Google. Some of its crawl behaviors are;
It has both desktop and smartphone crawlers which stimulate crawling in the respective devices. As both of these crawlers will likely crawl your page, you must see to it that your site is as optimized on desktop as it is on mobile devices. This ensures that both crawlers see the same content across your site.
On average, the bot accesses pages once every couple of seconds.
To reduce bandwidth consumption, several close-by machines are used to perform crawling. Due to this, you might notice visits from several systems carrying the user agent’s IP address on your logs.
The bot can accommodate requests for a higher crawl frequency.
Note: Blocking Googlebot in a bid to block pages from Google would mean that no other user agent will have access to these pages. However, if you block any other crawler, it will halt the functionality of that crawler alone. For example, if you would not like the videos stored in your private directory to pop up in searches, you can block the Googlebot Video crawler using a robots.txt file.
This crawler visits sites to gather the information that is required to create relevant ads for pages. The crawl behaviors of this bot are as follows;
Although this crawler shares the same cache as the Google’s crawler, they are separate from each other. This is to avoid one page from receiving requests from both crawlers at the same time and tampering with the page bandwidth. The crawler for the Search Console is equally different.
Indexing by this crawler is done using URL. Pages with www.site.com & those with site.com are crawled separately. Although, site.com/#anchor & site.com are not counted separately.
The bot visits your original pages that have pages they redirect to, ensuring that the redirect is indeed in place. Due to this, you may find the IP address of the bot in the access logs of the original page.
Unlike other crawlers, the crawl frequency of this bot cannot be controlled. Crawling by the AdSense crawler is an automated action with reports that are updated weekly. Therefore, requests made for a higher crawling frequency will not be accommodated. Changes made to your page might not reflect on the index before 14 days.
Like Googlebot, AdSense crawlers work with robots.txt commands, therefore, it will not be accessing pages that hold a robots.txt command specifically for the user agent; Mediapartners-Google.
Only sites that carry Google ad tags will be crawled by this bot.
This user agent is used by Google to crawl Atom feeds or RSS for its (Google) podcasts, PubSubHubbub, and Newsstand. Although this user agent only crawls podcast feeds, some feeds that do not follow the RSS or Atom specifications might be indexed. It does not crawl using links.
Crawling Behaviors of Feedfetcher
Feedfetcher acts according to human requests and not automated crawler commands. For this reason, robots.txt commands do not work with Feedfetcher. Therefore, to prevent some or all of your feed from being crawled, you might try configuring your website to send error status messages like a 404 or 401 to the user agent; Feedfetcher-Google.
On average, Feedfetcher tries to retrieve feeds at least once an hour. For sites that are frequently updated, the crawl frequency may be higher. However, in some cases, network lags might make it appear that the bot is crawling your page more frequently.
As the bot works on human requests, it might try to download links from a non-existent domain. Following this, the bot might also try to gather feed from a secret server of yours if a user that is aware of that server makes the request.
Duplex on the web –
This user agent provides support for the Google Duplex on the web service. User agent tokens and full user agent strings are applicable here.
Crawl behaviors of DuplexWeb-Google
Services using this user agent cannot carry out purchases and other actions while crawling takes place.
On average, this crawler crawls a page couple of times every hour or a few times a day depending on the feature in question. However, the number of times these pages are crawled is carefully calculated so as not to overwhelm your site’s servers and disrupt your traffic.
As the results of this crawler play no part in indexing results, the crawler is not affected by noindex directives.
During crawling and analysis, page requests made by the bot are not recorded by Google Analytics.
Disabling crawling on your search console is not enough to prevent this user agent from crawling your page. The effective method to block the activities of this crawler is implementing the Disallow robots.txt command. However, when the Duplex on the web service is activated by default on Google’s search console, it does not follow Disallow rules that are in wildcard user agent groups.
How to verify Google’s Crawlers
You can verify that the crawling activities on a webpage are truly carried out by Google's crawlers manually or automatically.
The manual verification would suffice in most cases. It works as a one-off lookup and requires the use of command-line tools. Here is how it works:
From your logs, find the IP address that accessed your page and carry out a reverse DNS lookup with help from the host command.
Verify the domain name ensuring that it is either Google.com or Googlebot.com.
Using the domain name gotten from step 1, carry out a forward DNS lookup still using the host command.
Confirm that the IP address on your existing logs is the same as the address from step 3 above.
You can simply verify the IP address used by Googlebot by looking through the IP address list for Googlebot and matching it with that of the spider that crawled your page. If you need to verify using other IP addresses for Google crawlers, check Google’s list of IP addresses.
Indeed, Google does a lot to ensure that its audience is served the best. With time, its crawling, indexing, and serving systems keep improving. Now, you can verify that recent bot activities on your site are from crawlers.
If you no longer want a crawler’s activity on your webpage, feel free to visit the Search Console to disable that particular crawler. You can also request a higher crawling frequency from some bots.
However, if what you want is for your page not to be crawled altogether, you can use any of the methods outlined above that is convenient. Without any doubt, blocking a less relevant page of yours from being crawled will ensure that Google focuses on your more relevant pages.
Do note, however, that some crawlers interpret commands like robots.txt files differently, while others do not work by the robots.txt guidelines. If your page is not being crawled, indexed, or served, a good place to start improving would be the quality of your content, and the codes on your website.