I admit, in the days of sophisticated SEO tools and access to practically any data we need, the idea of downloading a server-file log might seem a bit odd.
After all, Google Search Console alone offers an incredible breadth of information that could help you understand how Google views your site. And needless to say, you could use that insight to optimize your search performance. However, it’s the server-file log that provides the most accurate account of how bots crawl and use your site.
And in this post, I want to give a quick primer on how to use a server-log analysis to improve your results.
Why You Shouldn’t Ignore Insights from the Server-Log Files
Before we dive deeper into the server-log file, let’s cover some basic concepts.
First, what specific information a log file can help you uncover?
- Time and date
- The request IP address
- Response code
- User agency (or else, the search engine that made the request)
- Requested file, and much more.
Here’s an example server-log file record (naturally, containing dummy data):
18.104.22.168 - - [26/Apr/2017:00:17:48 -0400] "GET /img/SEOoptimizationtips.png HTTP/1.0" 200 7342 "http://www.domain.com" "Mozilla/4.05 (Macintosh; I; PPC)"
Now, let’s break it down to understand what information it provides us:
The hit came from the “22.214.171.124” IP address, on the 26th of April just after midnight to request a specific image. The server responded with the code 200 (which means OK). We also know the referring domain for this request, and the operating system used to access the file.
But I admit, at first sight, it may seem that none of this is important.
But consider this - what if this record included information about a bot unable to access your site? Or revealed a broken image on a page?
And needless to say, not many SEO platforms will help detect those issues for you at such a granular level. After all, the server-log file is the only way to see every hit to a specific page - request for a CSS, image or a js script, and detect issues at the deepest level.
Anyway, to sum it up, a server-log file contains information about every request made on your site. And so, it could easily help you understand:
- How search engines crawled your pages.
- What problems they encountered.
- What pages they couldn’t access.
- Who your visitors are and where they came from.
- What domains send you the most traffic.
- How your visitors interact with the site.
- Is your crawl budget spent efficiently?
How You Benefit from Analyzing the Server Log-File
#1. Discover Pages Crawled by ALL Search Bots
You know, you could look at your indexing data in two ways:
1. Focus just on Google, and use the Search Console to identify what pages the search engine crawls regularly.
2. Take a more holistic approach and look at crawl data from all search bots - Baidu, BingBot, GoogleBot, Yahoo, Yandex, and others.
Doing so will help you identify pages ALL those bots think are important. And in the process, discover where you should be putting the most effort on your site.
#2. Assess Your Crawl Ratio
I’m sure you know this already - bots never index the entire site during a single crawl. And it’s true, your crawl ratio might actually be quite high, particularly if you look at your site as a whole. But try segmenting your pages into different categories, and you might discover a sub-optimal crawl ratio.
OK, but why is improving the crawl ratio so important? Because, as we’ve discovered, the content that hasn’t been crawled for a long time receives materially less traffic.
It’s that simple.
#3. Identify Pages Bots Can’t Access
You know it so well - the fact that a page displays fine in the browser’s window doesn’t automatically means that it’s easy to crawl for a bot too. Broken links, 404 pages, errors in htaccess and robots files or long redirect chains often prevent bots from locating and crawling your content.
The result? Your potentially most important pages aren’t getting indexed. That’s where the server-log file comes in handy too. By looking at bot activity and the error codes your server has returned for page hits, you can identify assets they consistently can’t access.
#4. Analyze the Reasons Behind Crawl Issues
Analyzing error codes will provide you with one more insight - an indication for the reasons why bots can’t crawl those pages. And although the log file might not reveal the actual error that caused an issue, it might set you on the right course to identifying it.
Errors from the 4xx group (i.e. 403, 404) denote a client side error. They are typically the result of requests sent by a web browser or other user client. Worth to note, although these errors occur on the user side, they can often be eradicated on the server side too. Errors from the 5xx group, however, indicate a server problem. These should be the first errors you should be looking for, as the complete resolution is in your hands.
#5. Find Out If Bots Crawl Pages They Shouldn’t Be Crawling
Fact: when you’re working on a small site, with tens of pages at most, crawl budget isn’t really an issue. Bots will most likely crawl the entire site in a single session. Providing there aren’t any roadblocks stopping them in their tracks, of course.
Google’s Gary Illyes confirmed it too by saying:
“Likewise, if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.”
But what if you indeed manage a website with thousands of pages?
Then, as we can conclude from Gary’s statement, making sure that bots visit at least the most important pages turns into quite a challenge.
In fact, as Gary said:
“Prioritizing what to crawl, when, and how much resource the server hosting the site can allocate to crawling is more important for bigger sites.”
Some of the issues that affect the crawl budget I’ve already mentioned in this article:
- Having too many pages that offer low value to users (hence it’s so important to identify your most important assets, and ensure that bots can crawl them with high frequency).
- Errors and server issues (again, something you could identify with a server-log analysis)
- Low quality or spam content. And I guess this point doesn’t need any further clarification.
- Duplicate content (and particularly, faceted navigation such as filtering products by color, size or any other factors)
In an earlier post, I listed some strategies for doing that on an enterprise site. At the highest level, the simplest way to boost the crawl budget is by blocking bots from accessing assets they don’t really need to. Use the server-log analysis to double check you haven’t left any of those resources still available for crawling.
Do you need to manually go through the entire server-log file to access all this insight?
As a matter of fact, we’ve recently re-launched Bot Clarity to our platform that allows you to spot many of the above issues right from a simple dashboard.
- Understand what the most important pages on your site are to the search engine crawl.
- Optimize crawl budgets to ensure bots crawl and index as many important pages on your site as possible.
- Find broken links and errors that search engine bots have encountered while crawling your site.
- Audit your redirects.
- And tie bot activity to performance, indicating which areas of the site you should focus your efforts.
Learn more about Bot Clarity here.