What is a Sitemap?
A sitemap is an XML file that includes information about your site’s pages, images, videos, and other files on your website. This information is then presented to a search engine bot like Googlebot so the search engine can better crawl your site.
Since you provide the sitemap information, this also clues Google in on which pages and files you believe are the most important on your site. Plus, the sitemap offers the search engine valuable information about these pages and files.
With pages, for example, this additional information can be when the page was last updated, how often the page receives changes, and any language alternatives used on the page.
According to Google, no site map can be more than 50MB (uncompressed) and cannot exceed 50,000 URLs. If your sitemap exceeds this file size or URL count, you will need to create multiple sitemaps.
The Format of a Sitemap
Below is a simple example of an XML sitemap that shows the location of a single URL.
Sitemaps.org has many examples of sitemaps, many with complex scenarios and full documentation.
Let’s break down the components of a sitemap.
<?xml version="1.0" encoding="UTF-8"?>
The first line of the sitemap above informs the search engines that they are reading an XML file. In addition to this, the search engine can pick up on the version, in this case “1.0”, which is the preferred version for sitemaps.
You’ll also see the type of encoding. It’s necessary that the encoding type is UTF-8.
The next line of the XML sitemap is the URL set. This is a container for all the URLs in the sitemap.
The last portion of the text, “/sitemap/0.9” indicates the protocol standard. In this case, a common protocol standard is used: Sitemap 0.9. Most search engine crawlers support this standard.
This portion of the XML sitemap shows the parent tag. The location of the URL, in this case, the URL leading to seoClarity’s homepage, must be within the <loc> tags.
It’s also important that these URLs are only absolute, not relative, canonical URLs.
This portion of the XML sitemap can also include additional components, as seen on Sitemaps.org. For example, <lastmod> is optional here. These optional components are not crucial to your SEO, but they are there if you choose to use them:
The date when the file was last modified. (This can also include the time as well.) It’s important to note that the W3C datetime format must be used. That is, YYYY-MM-DD.
Remember that the sitemap contains your most important URLs. The <priority> property is another opportunity to specify the importance of each individual URL within the sitemap. You must pick a value between 0.0 and 1.0, where the highest priority is given to URL 1.0.
How frequently the page is likely to change, based on frequencies of always, hourly, daily, weekly, monthly, yearly, or never. This clues search engine bots in on how often they want to come back to recrawl the page.
Google offers multiple general sitemap guidelines to follow when building and submitting and sitemap to the search engine.
- Google crawls URLs exactly as they are listed, so be sure to use consistent URLs in your sitemap.
- It’s recommended that you put your sitemap at the site root, since this allows it to affect all site files. The sitemap can be placed anywhere, but it may not be able to do its job.
- To avoid duplicate crawling of URLs, do not include sessions IDs from URLs.
- Use Hreflang to alert Google to alternative language versions of a page.
- UTF-8 must be used for sitemaps.
- If you have more than one sitemap, list all the individual sitemaps on a sitemap index file and submit this single file.
- Only include self canonical URLs in the sitemap.
- Point to only one version of a URL if a URL differs on mobile versus desktop.
- If you have media types like video and images, use sitemap extensions for these additional media types.
- Use hreflang in a sitemap or HTML tag to show URL variations for language differences.
- A sitemap can only contain ASCII characters, not upper ASCII characters or special characters.
Common Sitemap Issues
Now that you’re familiar with the purpose of the XML sitemap and its general best practices, it’s helpful to be aware of common issues that arise with the sitemap.
- Pages included in the sitemap contain meta No Index tag
- Soft 404 URLs included in the sitemap
- Canonicalized URL included in the sitemap
- Disallowed and forbidden URL included in the sitemap
- Redirect URL included in the sitemap
- Low-quality pages included in the sitemap
- Sitemap is not being updated with new URLs
- Forgetting to update the sitemap after a migration
- Submitted page blocked by robots.txt file