Content search and categorization have been difficult problems to solve since the dawn of the internet.
However, discoverability and real-time indexing at scale are as hard or even harder because of the sheer volume of content that gets published every day.
Even though search engines can do a pretty good job crawling the web, the process can be slow. That's where sitemaps can provide value to the search engines and to our sites.
In this article, we will learn how to add a sitemap to a Rails application using the "sitemap_generator" gem, how to keep it updated and what are the best practices when it comes to sitemaps.
Let's start by learning what's a sitemap and why your site could benefit from one:
What's a sitemap and why they matter
A sitemap is actually a file that lists the URL within our website.
It's usually an XML file, but other formats like RSS and text files are also allowed following the sitemaps protocol
Sitemaps act as a shortcut for web crawlers: every relevant URL for a given website, old and new, can be extracted from one place without the need to spend many resources on actually crawling the sites.
A conceptual sitemap diagram for a basic website would look like this:
As you can see, a quick glance suffices to know about every piece of content this website published to date.
From the perspective of a search engine, sitemaps save them resources while allowing them to update their indices whenever they see fit.
From the perspective of a website owner, having a sitemap means that our content can get indexed faster and that we can have a bit more control of the way our content is indexed.
Of course, anybody can try to game search engines using a sitemap and that's why having one doesn't actually mean our content will get indexed or that every directive will be respected by search engines.
As with a lot of things, it's all about being a good citizen and establishing trust.
Now that we know what a sitemap is and why we should have one, let's see how we can add one to a Rails application:
Adding a sitemap with the 'sitemap_generator' gem
We will use an example application that has a couple of static URLs like a home an about page and a FAQ page and some dynamically generated URLs like a list of books and blog articles.
The first step is to add the gem to our Gemfile:
# Gemfile
gem 'sitemap_generator'
After that, we run bundle install
to install the gem, then we run the installation command:
bundle exec rake sitemap:install
This will add a config/sitemap.rb
file which will contain the configuration for our sitemap generation. It's generated empty, but I usually add the following configuration:
# config/sitemap.rb
require 'rubygems'
require 'sitemap_generator'
# Your site's host. You can use Rails.credentials if you want to
SitemapGenerator::Sitemap.default_host = ENV["APPLICATION_HOST"]
# Creates a sitemap index only if more than one sitemap is generated
SitemapGenerator::Sitemap.create_index = :auto
# Compress set to true will generate an '.xml.gz' file
SitemapGenerator::Sitemap.compress = true
Next up, we add the next logic below our configuration in order to generate the sitemap:
SitemapGenerator::Sitemap.create do
# We add our static URLs first:
add root_path, changefreq: "weekly", priority: 1
add about_us_path, changefreq: "monthly", priority: 0.1
add faq_path, changefreq: "monthly", priority: 0.2
add services_path, change_freq: "monthly", priority: 0.2
# We map over our database backed resources:
Book.find_each do |book|
add book_path(book), lastmod: book.updated_at, changefreq: "weekly", priority: "0.5"
end
Article.published.find_each do |article|
add article_path(article), lastmod: article.updated_at, changefreq: "weekly", priority: "0.4"
end
end
This configuration will add 4 static pages which are the: Home, About us, FAQ and Services pages. Then it will dynamically add every Book
and published Article
we have on our database.
Of course, we can add as many static or dynamic URLs as we need.
In case you are wondering about the parameters we pass to the add
method, here's an explanation:
- Resource path: the relative path for the resource, the gem will construct the URL with the
default_host
config parameter. - Lastmod: it expects a date time that represents the last time when the resource was modified. It is expected to coincide with
updated_at
attributes on the front-end. - Changefreq: this tells the search engines the estimated update frequency for the resource. It can be estimated with successive crawls, but we can help them at least with an approximation.
- Priority: the relative importance of a given page. Google and other search engines publicly disclose they don't give any weight to that parameter, but it doesn't hurt to include it.
After adding the resources that we think are important to our sitemap we can proceed and generate the actual sitemap with the following command:
bundle exec rake sitemap:refresh
If ran for the first time, this command will create a sitemap.xml
inside our application's public
directory. Every subsequent run will overwrite the sitemap with the newest URLs.
This means that if we want our sitemap to be up-to-date we need to do some proactive work about it.
Updating our sitemap
In order to keep our sitemap as fresh as possible, we can re-run the sitemap:refresh
task as frequently as needed.
If we update our site once a day, we can run a once-a-day scheduled task to update our sitemap. We can actually run the task redundantly without losing anything so it's better to be safe than sorry.
An automated way to run the task is using the whenever
gem, which allows us to set rake tasks to run as cron jobs.
So, as a first step, we install the whenever
gem:
gem 'whenever', require: 'false'
Then, we run the gem installation process, which will create a schedule.rb
file which will contain the way in which the update process will happen:
bundle update
bundle exec wheneverize .
This will create an empty schedule.rb
which will contain the logic to run our scheduled task. Initially it's empty, but we can add the following code to run a sitemap refresh every 30 minutes:
every 30.minutes do
bundle exec rake sitemap:refresh
end
The code above adds the */30 * * * *
entrance to the crontab that makes sure the rake task is run every 30 minutes in silent mode.
Don't forget to verify that the task is running correctly before adding it to the crontab and calling it a day.
Deciding how frequently to update our sitemap
When it comes to deciding how often your sitemap should be updated, there's really no hard rule. It depends a lot on your publication schedule or content creation speed.
Remember: the sitemap update frequency controls your side of the equation but search engines decide how and when to index your site on their own.
If search engines notice that you publish frequently they will crawl your site more frequently but there's no guarantee that they will do it when we tell them we do.
Think about it from their perspective: crawling a website requires computational resources that are limited by definition and it's estimated that around 250,000 websites are added to the internet every day.
Search engines try to optimize those resources to keep their indices updated but without unnecessary resource overspending.
Also, when Google crawls a URL it likes to follow every link in it so it's possible to have a URL indexed without it being in the sitemap as long as it's referenced from a URL on your site.
Asynchronous sitemap updates
Some hosting providers don't offer a native cron job feature. That's where solutions like the sidekiq-scheduler
gem or the solid-queue
recurring tasks feature can help us.
Consider that the async approach clashes with the default way the sitemap_generator
gem works because it means we have to define the sitemap generation logic outside the sitemap.rb
file which is used by the rake task by default.
We could use the command/service object pattern, see that we're duplicating the code:
# services/sitemap_generator.rb
class SitemapGenerator
def self.call
SitemapGenerator::Sitemap.default_host = "YOUR_APPLICATION_HOST"
SitemapGenerator::Sitemap.create do
add root_path, changefreq: "weekly", priority: 1
add about_us_path, changefreq: "monthly", priority: 0.1
add faq_path, changefreq: "monthly", priority: 0.2
add services_path, change_freq: "monthly", priority: 0.2
Book.find_each do |book|
add book_path(book), lastmod: book.updated_at, changefreq: "weekly", priority: "0.5"
end
Article.published.find_each do |article|
add article_path(article), lastmod: article.updated_at, changefreq: "weekly", priority: "0.4"
end
end
end
end
Using the sidekiq-scheduler gem
If we prefer the sidekiq
gem to handle async tasks in our Rails apps, the sidekiq-scheduler
gem allows us to schedule recurring tasks.
In order to use this gem we just add it to the Gemfile and install it:
bundle add sidekiq-scheduler
bundle install
We add a class that includes Sidekiq::Job
and implements a perform
method:
# app/workers/sitemap_refresh_job.rb
require 'sidekiq-scheduler'
class SitemapRefreshJob
include Sidekiq::Job
def perform
SitemapGenerator.call
end
end
Then we have to define a config/schedule.yml
file where we add the task and the schedule, which we can add in words or other formats as the gem uses the fugit
parser:
:scheduler:
:schedule:
sitemap_refresh:
cron: */30 * * * *
class: SitemapRefreshJob
Using solid_queue recurring tasks
If our application uses the solid_queue
gem, soon to be a default with Rails 8, we can add a recurring task to refresh our sitemap.
Assuming you have the gem installed and already using it to run recurring jobs, what we need to do is create a class that inherits from ApplicationJob
to perform our desired task:
Then, we add the command to the config/recurrent.yml
file:
production:
sitemap_refresh:
command: SitemapGenerator.call
schedule: 30/* * * *
Please note that we can also pass a job class instead of the command. In that job we could define what we did in the SitemapGenerator
service object, and it would generate the same result. I decided to use the service object to reuse the code from the sidekiq-scheduler
section.
Adding image and video to sitemaps
If we have an image-heavy site we can actually add them to our sitemap. This is particularly useful for images that might not be found by the crawlers, mostly those within JS code.
For example, if you have an Article
model that has a lot of images we could do something like:
# config/sitemap.rb
add article_path(article,
images: article.images.map do |img|
{loc: img.url, title: img.alt_text, caption: img.caption, license: img.license}
end
)
In this example the images
parameter adds an array of images for each Article
URL. The loc
represents the location or URL for the image and is the only required parameter.
The title
, caption
and license
are optional but if you have access to them, adding them can generate traffic to your site from image searches.
We can do the same for videos and for other assets, you can check the gem's documentation for help with those resources.
When it comes to videos, the process is very similar:
add(lesson_path(lesson), video: {
thumbnail_loc: "https://example.com/how-to-add-sitemaps-to-rails-apps-thumb.jpg",
content_loc: "https://example.com/how-to-add-sitemaps-to-rails-apps.mp4",
duration: 2422,
title: "How to add sitemaps to Rails applications",
description: "In this lesson we will learn how to add sitemaps to Rails apps",
publication_date: "2024-10-09",
autoplay: false,
tags: ["sitemaps on rails", "sitemap generator", "rails seo"],
category: "SEO",
family_friendly: true,
requires_subscription: false,
})
When adding images or videos to sitemaps, only those assets that are relevant to the content need to be added. There's no need to add images or videos that are used for aesthetic or other purposes.
Generating a dynamic sitemap with Rails
Certain sites that have large amounts of frequently changing content that's important for their search results can leverage dynamic sitemaps to improve their performance on search engines.
Sites like Amazon which inventory information, pricing and stock, is constantly changing or a news outlet that publishes breaking news stories can benefit from adding a dynamic sitemap.
They are just like a regular sitemap, but they are generated on-request so whenever search engines need them we are sure that they have their latest version of our site's content.
Unfortunately, the sitemap_generator
gem doesn't allow for this feature so we will need to do it ourselves.
First, if we have the gem installed and generating our sitemaps, we need to define an alternative URL like:
# config/routes
get "sitemaps/dynamic-sitemap.xml", to: "dynamic_sitemap#show", as: :dynamic_sitemap
Then, in the controller we should have something like:
# app/controllers/dynamic_sitemap_controller.rb
class DynamicSitemapController < ApplicationController
def show
respond_to do |format|
format.xml
end
end
end
Then, we need to generate the XML, the simplest way is to do it from a view:
xml.instruct! :xml, version: "1.0", encoding: "UTF-8"
xml.urlset xmlns: "http://www.sitemaps.org/schemas/sitemap/0.9" do
static_urls = [root_url, about_us_url, faq_url, customers_url]
static_urls.each do |url|
xml.url do
xml.loc url
xml.changefreq "monthly"
end
end
Article.published.find_each do |post|
xml.url do
xml.loc article_url(post)
xml.lastmod post.updated_at.strftime("%Y-%m-%d")
xml.changefreq "weekly"
xml.priority 0.8
end
end
Book.find_each do |book|
xml.url do
xml.loc book_url(book)
xml.lastmod book.updated_at.strftime("%Y-%m-%d")
xml.changefreq "weekly"
xml.priority 0.8
end
end
end
As you can see, generating a static sitemap is not as hard as it might seem, but this is a very basic implementation of the feature.
You might need to do some tweaking to fit your needs, you can even extract the generation logic into an object or have user-facing customizations if you need to.
Also, consider that having a dynamic sitemap might be an overkill for your application. They're most useful when the update frequency is really high and out-of-phase content indexed is an issue.
Hosting a sitemap in the cloud
Some hosting providers only provide ephemeral storage, meaning that whenever your server is restarted the files on them get deleted.
For those cases, uploading your sitemap to cloud storage services like AWS S3, Google Cloud Storage or similar services is recommended.
Luckily, the sitemap_generator
gem comes with adapters that allow us to handle uploading the sitemap to these services.
Adapters define the write
method which is responsible
The following adapters are included out of the box:
- FileAdapter: used by default. It writes the sitemap to disk, to the
public/
directory. - FogAdapter: it uses the
fog
gem, and can upload to any compatible service, including S3. - AwsSdkAdapter: it uploads to AWS S3 using the
aws-sdk-s3
gem. - GoogleStorageAdapter: it uploads to the Google Cloud Storage service using the
google-cloud-storage
gem. - WaveAdapter: it uploads to any service the
carrierwave
gem can upload to (S3, GCS, Rackspace Cloud Files).
We will be uploading to S3 using the AwsSdkAdapter
. S3 compatible services like Digital Ocean can be implemented by changing the endpoint to the one the provider gave you.
In order to make it work, we need to have our AWS credentials and set the adapter we wish to use before running the code that generates the sitemap:
# config/sitemap.rb or the file where you generate the sitemap
require 'aws-sdk-s3'
SitemapGenerator::Sitemap.adapter = SitemapGenerator::AwsSdkAdapter.new('sitemap-generator-example',
acl: 'public-read', # Bucket permissions might be needed.
access_key_id: 'AKIAIOSFODNN7EXAMPLE',
secret_access_key: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
region: 'us-east-1',
endpoint: 'https://nyc1.digitaloceanspaces.com' # only needed if not uploading to S3
)
# The rest of the code in charge of generating the sitemap
Then, the next time we run bundle exec rake sitemap:refresh
or our async jobs we will be uploading it to our cloud storage provider.
The file should be accessible and public in order for you to associate it with your site from the search engine console you wish to add a sitemap to.
Submit your sitemap to the Google Search Console
After we're happy with our sitemap creation and update flow we can add it to our Google Search Console in order to associate our site with the sitemap.
We will need a GSC account and a verified website. Then we need to go to the Sitemaps page of the console under the Indexing section, which is below the overview section.
There, we just need to add our sitemap URL and submit it:
After this process, Google will validate that the URL responds with 200 OK
and it will start parsing your sitemap.
Submitting a sitemap hosted in the cloud
If you followed the steps above to upload your sitemap to a cloud storage service, adding it to the Google Search Console has an extra step involved.
Google doesn't “like” that your sitemap is hosted in a domain that's not yours. Even if you verify the property of the bucket using the HTML verification method, Google might reject your sitemap altogether.
To fix this, we will need to add a redirect from our site to our sitemap in the cloud storage:
# config/routes.rb
get "/sitemap.xml", to: redirect("https://your-bucket-name.s3.amazonaws.com/sitemap.xml")
# Or, if you're compressing the sitemap:
get "/sitemap.xml.gz", to: redirect("https://your-bucket-name.s3.amazonaws.com/sitemap.xml.gz")
Before submitting the map to the search console, verify that it's actually pointing to the sitemap.
This should keep Google happy about your externally hosted sitemap.
About the search engines ping
The current version of sitemap_generator
, 6.3.0 at the time of writing this, allows us to ping search engines right after our sitemap has been updated in order to let search engines know they should crawl our website again.
When we run the sitemap:refresh
task, actually two tasks are run: sitemap:create
and a call to the SitemapGenerator::Sitemap.ping_search_engines
command.
That command is in charge of sending a request, or ping, to a URL search engines provide in order for webmasters to notify them about changes in sitemaps.
By default, the gem pings Google only, but if you want to add another search engine to ping you have to add it to the search_engines
hash before the sitemap generation logic:
sitemap_url = "https://example.com/sitemap.xml"
SitemapGenerator::Sitemap.search_engines[:search_engine] = "https://www.searchengine.com/ping?sitemap=%s"
However, Bing deprecated the ping feature in 2021 and Google did the same last year. Right now, it seems that no major search engine is supporting this feature.
If we don't do anything, pings will just fail and nothing else will happen, but if we want the pings to not happen at all we have to tell our refresh task to avoid pinging: rake sitemap:refres:no_ping
Sitemap best practices
Here are some things that you should pay attention to when adding a sitemap to your Rails application:
- Use canonical URLs in the sitemap: it's very important that you use your canonical URLs within your sitemap. Having a contradiction between your content defined canonicals and the sitemap defined URLs can lead to unnecessary duplications and crawl budget spends.
- URL limit: individual sitemaps can only have 50,000 URLs or have a max file size of 50 MB, whichever comes first. If you have more URLs than that you have to split your sitemap and have your main sitemap point to your sitemaps.
- Add your sitemap to robots.txt: you simply need to add
Sitemap: YOUR_SITEMAP_URL
to yourrobots.txt
file. - You can probably ignore priority and changefreq: search engines, especially Google, have come forward saying they completely ignore those attributes. You can still add them but never trust them to define search engine's crawl frequency and priority.
- Lastmod is considered only if coherent: the
lastmod
value should be coherent with the resources public modified at or published at. Otherwise, it might be ignored. - Don't add URLs just for the sake of it: even though it's tempting to have more URLs indexed by search engines, truth is that not every type of content gets traffic or should be indexed. Legal pages, pages with thin content or pages that could be considered duplicate pages because of their similitude to other pages should not be added to your sitemap.
- Exclude noindex pages: make sure that if you have pages with the
robots="noindex"
meta tag they are not added to your sitemap. - Use UTF-8: the
sitemap_generator
gem does it by default but if you ever need to generate a sitemap manually don't forge to use this character encoding.
Lastly, consider that submitting a sitemap is not the only way for search engines to index your content. Links from other websites and your own site structure and internal linking is arguably as important if not more important than a sitemap.
Also, you might add your sitemap and notice that search engines don't index some of your content. A sitemap is not a guarantee of search engines indexing your site.
If you're having trouble indexing parts of your site pay attention to your site's structure, the content itself and the backlinks your site might have.
Summary
Adding a sitemap matters because it allows search engines to know about our website's resources without having to actually crawl them.
Using the sitemap_generator
gem, adding a sitemap to a Rails application is a matter of installing the gem, adding a bit of configuration and running a rake task to keep the sitemap updated.
We can run that task in a cron job or using recurring async jobs with libraries like Sidekiq or Solid Queue.
If we host our site on a provider with ephemeral disk storage we need to upload our sitemaps to a cloud storage provider, something that can be trivially done using the gem.
Also, sitemaps aren't just for webpages, we can also add images, videos and other assets to them as long as we comply with the sitemap specification.
Dynamic sitemaps are also very useful for sites that have a very high frequency of important content updates. We can generate them with Rails using an XML builder.
It's critical to avoid contradictions between our canonical URLs and the URLs we add to our sitemaps, otherwise that could lead to indexing problems.
All in all, as your site gets more complex or has more content, managing your sitemaps becomes more complex, but that's always a good issue to have.