Sitemaps¶
The latest version of the Sphinx Stack generates a sitemap for your documentation using the sphinx-sitemap extension.
This page goes over the nuances of configuring sitemaps, as well as how the extension must be configured in your Sphinx Stack project.
Read the Docs-generated sitemaps¶
RTD generates a basic sitemap pointing to the index page, and relies on crawlers to index the site. This is sufficient for some projects, but RTD does not generate sitemaps for subprojects.
This means any project under the Ubuntu documentation library project must generate its own sitemap.
sphinx-sitemap-generated sitemaps¶
The standard Sphinx Stack uses the dirhtml builder for Sphinx recipes in the
project’s Makefile.
If your project uses an older version of the Sphinx Stack or changes the builder, the
links generated by the sitemap will be malformed. Either update to the latest
version of the Sphinx Stack or ensure your project’s recipes use
the dirhtml builder, not html.
Ensure sphinx-sitemap has been added to your docs/requirements.txt file.
Add sphinx_sitemap to extensions in your configuration file (docs/conf.py):
extensions = ['sphinx_sitemap']
Sitemap configuration¶
The build configuration file (docs/conf.py) in the Sphinx Stack includes default
sitemap configuration.
The sphinx-sitemap extension requires a html_baseurl variable to be configured.
By default, this is set as:
html_baseurl = os.environ.get("READTHEDOCS_CANONICAL_URL", "/")
When building on Read the Docs, this sets html_baseurl dynamically to the value of
the READTHEDOCS_CANONICAL_URL environment variable, which resolves to the full URL
of the documentation including the version and language (if applicable).
In local builds and builds on other hosts, html_baseurl defaults to /.
The sitemap_url_scheme variable is set to '{link}' by default. This uses the
value of html_baseurl to generate the full URL for each page for the sitemap.
Note
If you are implementing a sitemap on an RTD instance that is not a subproject, and
it uses {link} for the sitemap_url_scheme, RTD will replace your sitemap
with their own.
This is a known bug. The only current workaround is to use a different sitemap name
and a custom robots.txt pointing to it.
lastmod configuration¶
As of version 2.7.0, the sitemap extension supports adding a lastmod date.
Make sure that your configuration file has:
sitemap_show_lastmod = True
Exclude pages¶
Pages can be excluded from the sitemap by adding them to sitemap_excludes in docs/conf.py:
sitemap_excludes = [
'404/',
'genindex/',
'search/',
]
Wildcards are supported. For example, _modules/* excludes the path _modules/ and
all paths such as _modules/foo/bar/. For details, see Excluding Pages.
Validate your sitemap¶
A sitemap will be available at different locations, depending on how it is generated.
Read the Docs generated sitemaps are available at the base domain of a project, while sitemaps generated with this extension will be placed in the base of the URL schema used.
For example, two sitemaps are generated for the Sphinx sitemap’s documentation as it is hosted on RTD:
The first is generated by RTD and is available at the root of the domain: https://sphinx-sitemap.readthedocs.io/sitemap.xml
The second is generated by the sphinx-sitemap extension and is available at the base of the URL schema used by the RTD instance: https://sphinx-sitemap.readthedocs.io/en/latest/sitemap.xml
How to specify a sitemap
A robots.txt file dictates which sitemap is used to index a website. You can use
a custom robots.txt file by creating your own and adding it to
html_static_path in your configuration file. An example can be found in the
Ubuntu documentation library project.
Support multiple versions¶
The sphinx-sitemap extension doesn’t support multiple versions by default. Configuring your versioned documentation to use an appropriate version may be sufficient, as search engines and other web systems crawl websites for the purposes of indexing.
If you want sitemaps for all your documentation’s versions, you need to deploy your own
robots.txt file and sitemap index. Supporting multiple versions is recommended for
documentation with LTS releases, as it makes past versions more prominent to search
engines.
For this task, we’ll use the Sphinx Stack as an example. Let’s assume it has three
versions, 1.0, 2.0, and 3.0, and uses the URL schema of <version>/<filename>.
First, ensure each version of your documentation has a sitemap generated by this extension with the appropriate version.
Next, create a sitemapindex.xml file in the same directory as the configuration
file, and point to the sitemap files of each of your documentation sets:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://canonical-sphinx-stack.readthedocs-hosted.com/stable/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://canonical-sphinx-stack.readthedocs-hosted.com/3.0/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://canonical-sphinx-stack.readthedocs-hosted.com/2.0/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://canonical-sphinx-stack.readthedocs-hosted.com/1.0/sitemap.xml</loc>
</sitemap>
</sitemapindex>
Create a robots.txt file in the same directory as the configuration file.
If necessary, block any paths you don’t want crawled. Google describes how to do this in How to write and submit a robots.txt file.
At the end of robots.txt, point to the future path of sitemapindex.xml:
Sitemap: https://canonical-sphinx-stack.readthedocs-hosted.com/stable/sitemapindex.xml
Lastly, add both new files to the configuration file:
html_extra_path = [
"sitemapindex.xml",
"robots.txt",
]
This provides a sitemapindex.xml file which points to the sphinx-sitemap
generated sitemap for each version.
You may want to automate the generation of the sitemapindex.xml file. To see how
this is done for the Ubuntu documentation library project, which generates a sitemap
containing subproject sitemaps, see the script here.