Crawlability issues can sink any SEO effort. 

The most common SEO technical issues search engine spiders encounter involve crawling the site. Your SEO efforts are only effective if Googlebot is able to crawl and index your site. If a spider can't access your web pages, it will not be indexed. In addition to your site not being crawled by the search engine, it's likely those technical SEO issues also effect user experience. For instance, if you the spiders can't follow your website path, neither will your users. 

Lastly, it's important that your site can efficiently be crawled to optimize crawl budget.

Request a Free Site Audit

In this post, we'll look at some of the top crawlability problems affecting a site's SEO performance.  

Steps to Find Crawlability Issues

1. Run regular crawls and analyze the data. Use software mimicking how Google accesses your web pages.

2. Spot high priority issues and fix, such as a mis-configured Robots.txt and site errors.

3. Prioritize mid and lower priority issues to develop the biggest impact you can make to help Google better index your content

How to Leverage Crawlers to Spot Crawlability Problems for SEO

I recommend running two types of crawls using a crawler tool

1. A crawl of the site starting from the home page. Let the crawler loose on the site to mimic Google's web crawler.

2. A crawl of landing pages for SEO, ideally aligned with the XML sitemaps.

The data from these crawls will help diagnose crawl problems. More insights will come from more crawls with further variables such as setting the user agent to Googlebot, a mobile device to see the mobile experience, and rendering JavaScript as opposed to just the HTML. You can save time by saving these settings and schedule future crawls, perhaps weekly; one of the many great capabilities of an SEO platform.

 

High Priority

URLs Blocked by Robots.txt

The first thing a bot will look for on your site is your robots.txt file. You can direct Googlebot by specifying “disallow” what pages you don’t want them to crawl.

User-agent: Googlebot

Disallow: /example/

The robots.txt file is most often the cause of a site's crawlability problems. The directives in this file could block Google from crawling your most important pages or vice versa. 

How to find:

  1. Google Search Console – Google Search Console blocked resource report shows a list of hosts that provide resources on your site that are blocked by robots.txt rules. 
  2. Crawl – Analyze your own crawl outputs outlined above. Identify pages flagged for being blocked via the robots.txt file.

These could stem from a mistake in regex code or a typo that can cause major problems.

Server (5xx) and Not Found (404) Errors

Like being blocked, if Google arrives at the page and encounters these errors, it’s a big problem. A web crawler travels through the web by following links. Once the crawler hit the 404 or 500 error page, it’s a dead end for the bot. When a bot hits a large number of error pages, it will eventually give up crawling the page, and your site.

How to find:

  1. Google Search Console – Google Search Console reports server errors and 404 (aka broken links) it encounters. The Fetch and Render Tool also serves as a useful point solution.
  2. Analyze the outputs from regularly scheduled crawls for server errors. Also note issues such as re-direct loops and meta refreshes, and all other circumstances where ultimately Google cannot access to the page.

SEO Tag Errors

Look for issues with the tags that are directives to Google: canonical or hreflang to name a few. These tags could be missing, incorrect, or duplicated, potentially confusing crawlers. 

How to find:

1. Google Search Console - These issues may appear in Google Search Console but not be interpreted as errors. For example if a site has duplicate content because of a missing a canonical tag, search engines will try to index these pages. Within GSC, the "number of pages indexed" will rise, which alone is not an "error." The tag issues typically surface in the "HTML improvements" and international section within GSC.

2. Analyze the crawl outputs for any missing or incorrect values. Pay special attention to the key landing pages for SEO. Keep a record of the key elements for each page (directives such as "noindex") you expect to see.

Note: Platform users can set rules to pull out changes in these elements flagged by "high priority" rules such as "Noindex detected" where there shouldn't be and can have a major impact on the site. This is a great example of how  site audit technology can scale SEO tasks.

Mid-Tier Priority

Rendering Issues

Google’s ability to render JavaScript is improving, and although Progressive Enhancement is still the recommended method (where all the content would appear in the HTML source code) it’s useful to fully render pages the way Google now does when necessary to experience what a searcher would find on the page.

How to find:

  1. Google Search Console – Fetch and Render Tool. If the “rendered” version does not contain the vital content on the page then there is likely a problem to address. This should also match the cached version of a page.
  2. Analyze the results of a JavaScript rendered crawl – there may be crawl issues (missing content, broken links, etc.) unique to the rendered crawl. Here's a great article for more on optimizing JavaScript for SEO.

Duplicate Content from Technical Issues (Spider Traps)

Some issues stem from Google or other search engines not knowing which version of the content to index because of a coding setup. Examples includes pages with many parameters in the URL, session IDs, redundant content elements, and pagination.

How to find:

  1. Google Search Console - There is sometimes an alert for "Too Many URLs" or similar language when Google believes it's encountering more URLs and content than perhaps it should be. Check the messages and make sure you're receiving them as emails too.
  2. Crawl Results - a web crawl will identify these in a few ways. The most obvious will be duplicate or missing values in areas such as the title tag or header tags - maybe internal search pages or product category filters that don't update the meta tags. URLs that look unrecognizable (e.g. with parameters or extra characters) can be an issue too. These pages may be a problem as they're creating more work than necessary for Google to access and index priority pages.

Once you find these instances on your site find ways to either remove the creation of the pages, adjust Google's access, or check they have the correct tags, such as canonical, noindex, nofollow to make sure they don't interfere with your target landing pages.

Low-Tier Priority

Site Structure and Interlinking

How a website interlinks between related posts is important for indexation. A page that is part of a clear website structure and is interlinked within content has little barrier to indexation.

How to find:

  1. Analytics - Review your site's analytics to determine how users are flowing through the site. Identify ways to keep them engaged by linking to related content. Be on the look out for pages with high bounce rates that may need a clearer nudge to more content. 
  2. Analyze advanced crawl features that show how many internal links an individual page has directed to it. Review the top performing pages for ways the site interlinks to those pages.

Be on the look-out for best practice elements in this step such as no internal 301 redirects, correct pagination, and  complete sitemaps.

Mobile Usability

Mobile usability is a key priority area for SEO with the roll-out of Google’s mobile first index. If the site is deemed unusable for mobile devices, Google may drop them in SERP and which will result in lost traffic.

How to find:

  1. Google Tools - Test your key landing pages in the Google Mobile Friendly Tester tool as well as monitor mobile issues within Google Search Console.
  2. Analyze a Mobile Crawl - Review the output of a crawl ran as a mobile device and ensure the site's content appears. Any issues with mobile navigation or usability should arise here if the content you expect to find is missing.

Thin Content

If it's confirmed that your site doesn't have issues outlined above but still isn't indexed, you may have "thin content." Google is aware of these pages, it just doesn't believe they are worthwhile to index. The content on these pages may be boilerplate or be somewhere else on your website, or it's just not unique enough or sees no external signals validating the content's value or authority from news websites or other industry sites, i.e. no links to it. 

How to find:

  1. Analyze the site's content that is not indexed by Google (can proxy this by target landing pages not receiving traffic), and review the target queries for the page. Refresh the content or create new content based on keyword research to provide better value.

Conclusion

A website free of crawlability issues is a great place to be. Sites that achieve this enjoy relevant traffic from Google and other search engines and focus on bettering a search experience as opposed to fixing problems. It's not easy, especially if you have limited time to dedicate to these crawability problems. Spotting and fixing these issues can take effort from dozens of people - from a web design team, to developers, content writers, and other stakeholders. This is why it's important to find the top problems affecting your performance and develop a plan to fix and standards to suppress any in the future.

Learn more about Clarity Audits, our site audit technology that includes a built-in JavaScript and HTML crawler, and how it performs to identify crawlability issues and technical health checks of your site.