Google’s web spider, Googlebot, is constantly crawling web pages and adding them to Google’s index. When it crawls your site, it registers quite a bit of information about how your site is working.
But there’s a lot of information you can learn by following Googlebot’s activity on your site, too.
Wouldn’t it be useful to see what Google sees? After all, one of your goals as an SEO is to follow the guidelines set by Google, and if you can get a behind-the-scenes look at how it understands your website, you can adjust your strategy accordingly.
Good news: you can. Analyze bot activity to get an inside look at what the search engine picks up about your webpage – and uncover pesky spam bots along the way. Many SEOs don’t conduct log file analysis, which limits them from receiving valuable insights that they’re not able to find with a regular site crawl.
In this post, I’m going to show you how you can analyze server log files to glean these important insights and improve search performance. First, let’s cover the basics…
What is Log File Analysis?
A server log file is a file output from a web server containing ‘hits,’ or a record of all requests that the server has received. Essentially, log file analysis is a potential tool in your tool belt that allows you to dig into which pages/pieces of content on your site Google is crawling.
The information contained in a log file includes:
- Time & Date
- The request IP address
- Response Code
- User Agency
- Requested File
Below is an example of what a server log file looks like (using dummy information):
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET / apache _pb.gif HTTP/1.0" 200 2326
Because a server log file is real information from Googlebot (and other search engine crawlers), the analysis of the log files answers questions like:
- Is my crawl budget spent efficiently?
- What accessibility errors were met during the crawl?
- Where are the areas of crawl deficiency?
- Which are my most active pages?
- Which pages does Google not know about?
These are just a few examples of the insights that you can uncover with log file analysis.
While there are ways to signal to Google how they should crawl a site (e.g. XML sitemap, robot.txts, etc.) finding the answers to these questions can be greatly beneficial in adjusting your strategy to alert Googlebot to your most important pages.
Challenges of Log File Analysis
There can be some inherent obstacles with log analysis. For one, it can be difficult to get your hands on the bot log files, and if you’re an enterprise company, you most likely have hundreds of thousands of pages on your site. That’s a lot of information to gather and digest.
Since log file analysis is typically separate from your SEO reporting, you have to connect the dots manually. And while it is possible to do this, there’s no reason you should be doing it this way. There’s simply TOO much data. If you were to do this manually in Excel, you’d only see log file data for one day and not the overall trend. Not to mention the time wasted trying to filter, segment, and organize the data.
You need a platform to pull this data together, because truly, it has to be aggregated for it to be meaningful.
Let me illustrate this with an example. If a website has 5,000 visitors a day who each go to 10 pages, then the server will create a log file entry of 50,000 records. It would be an incredibly cumbersome process to go through that data manually.
By bringing your bot log files in the same tool as the rest of your SEO reporting, you begin to more easily connect the dots and find out what this information is telling you. So, what does this process look like?
How to Analyze Log Files With seoClarity
By bringing your bot log files in the same tool as the rest of your SEO reporting, you can begin to more easily connect the dots. In fact, seoClarity is the only SEO platform to deliver a robust and powerful log file analysis solution as part of its main offering.
At seoClarity, your Client Success Manager helps you set up all the appropriate files – essentially, you put the files in, and we go in to pull them. When the files are uploaded into the platform, you can use Bot Clarity, our integrated log-file analyzer, to discover how bots access your site, any issues they may encounter, and how your crawl budget is spent.
We do the heavy lifting so you’re left with the meaningful information.
Our log file analysis tool looks at your log files and allows for the correlation between bots, rankings, and analytics. To discover these insights, navigate to Bot Clarity within the Usability tab inside the platform.
(Bot Clarity within the seoClarity platform.)
Here, you can discover bot requests, request status, and the number of bots found crawling your site.
Since we’re mainly concerned with Googlebot, let’s filter down the results by Bot Group to get a deeper understanding of how it crawls and understands the site.
It’s also interesting to see how the different arms of Google (e.g. Google Desktop vs. Google Mobile) are crawling and understanding your site.
(Filter down to see information for specific bots.)
Next, we analyze which URLs Googlebot is crawling, and how often. Now that we know what pages of the site Google is looking at, we can download that data to have on hand. Then, look up your XML sitemap. Are the pages on your XML sitemap the same pages that Google is crawling? What pages is Google crawling that aren’t on the sitemap, where Google may be wasting its time?
(Figure out which pages Googlebot is crawling, and how often.)
But log file analysis offers more insights than just seeing what pages Googlebot is crawling. Let’s take a look at what other use cases can be applied…
Other Insights From Log File Analysis
Log data can be used across a variety of use cases. Analyzing bot log files lets you see your site how the search engine sees it, which means you can pick up on potential errors and fix them with site updates for the next time the bots come around.
Spoofed Bot Activity
Spoofed Activity refers to any crawl requests from a bot that declares itself as a major search engine but whose IP doesn't match that of the search engine. Our tool easily flags crawlers that are pretending to be Googlebot and are crawling your site and using up valuable resources. If you do find spam bots, you can clean them up so your crawl budget is optimized and your site loads faster.
(Validated vs. spoofed activity in Bot Clarity.)
Also check the HTTP status of your website. Know which URLs are working properly, and which are responding with page errors. 2xx response codes mean that the request was properly received and accepted, but some response codes indicate an error.
But 3xx, 4xx, and 5xx response codes should be addressed. For example, while one 301 redirect (indicating the page was moved permanently) isn’t a problem, multiple redirects are going to cause trouble.
Since some response codes are positive, you can filter down the results to specify which response codes you want to see. Here, I’ve filtered down the results to show 3xx and 4xx response codes.
(Response Codes for different site URLs.)
Also find out your Googlebot crawl rate over time, and how it correlates with response times and the serving error pages.
New Content Discovery
With the log file analyzer, you can group new pages on the site through segmentation, and see exactly when these specific pages have been crawled. In a matter of days, you can be 100% certain that this new strategic content has been discovered by Google.
User Agent Filter
Use the user agent filter to select the ones you want to analyze, or search for them by name. Filter specific user agents based on the following criteria: is, isn't, contains, does not contain, starts with, ends with, or a Regex pattern. This allows you to narrow in and discover which search bots have the highest level of activity on your site. Filtering down to the specific you want to analyze also allows you to see if the search bots coincide with the search engines you want to rank with.
Top Crawled Pages
As we’ve seen, log file analysis allows you to see which pages a bot is crawling, and which are the top crawled pages. This allows you to verify that the pages crawled coincide with the site’s most important pages. You don’t want the crawl budget to be wasted on lower-impact pages – ensure that the pages that Google is crawling are the most high-level pages that have the most products and result in the most sales for you and your organization.
Lastly, discover what IP addresses Googlebot is using to crawl your site. Verify that the Googlebot is correctly accessing the relevant pages and resources in each case.
Bot log files can take a little work in terms of gathering the data from the correct teams, but once you pipe them into seoClarity and can compare them to your other SEO metrics, you are one step closer to understanding Google and how it understands your site.