Adding a sitemap to a Rails application

By Exequiel Rozas

- November 19, 2024

Content search and categorization have been difficult problems to solve since the dawn of the internet.

However, discoverability and real-time indexing at scale are as hard or even harder because of the sheer volume of content that gets published every day.

Even though search engines can do a pretty good job crawling the web, the process can be slow. That's where sitemaps can provide value to the search engines and to our sites.

In this article, we will learn how to add a sitemap to a Rails application using the "sitemap_generator" gem, how to keep it updated and what are the best practices when it comes to sitemaps.

Let's start by learning what's a sitemap and why your site could benefit from one:

What's a sitemap and why they matter

A sitemap is actually a file that lists the URL within our website.

It's usually an XML file, but other formats like RSS and text files are also allowed following the sitemaps protocol

Sitemaps act as a shortcut for web crawlers: every relevant URL for a given website, old and new, can be extracted from one place without the need to spend many resources on actually crawling the sites.

A conceptual sitemap diagram for a basic website would look like this:

Basic sitemap structure

As you can see, a quick glance suffices to know about every piece of content this website published to date.

From the perspective of a search engine, sitemaps save them resources while allowing them to update their indices whenever they see fit.

From the perspective of a website owner, having a sitemap means that our content can get indexed faster and that we can have a bit more control of the way our content is indexed.

Of course, anybody can try to game search engines using a sitemap and that's why having one doesn't actually mean our content will get indexed or that every directive will be respected by search engines.

As with a lot of things, it's all about being a good citizen and establishing trust.

Now that we know what a sitemap is and why we should have one, let's see how we can add one to a Rails application:

Adding a sitemap with the 'sitemap_generator' gem

We will use an example application that has a couple of static URLs like a home an about page and a FAQ page and some dynamically generated URLs like a list of books and blog articles.

The first step is to add the gem to our Gemfile:

# Gemfile
gem 'sitemap_generator'

After that, we run bundle install to install the gem, then we run the installation command:

bundle exec rake sitemap:install

This will add a config/sitemap.rb file which will contain the configuration for our sitemap generation. It's generated empty, but I usually add the following configuration:

# config/sitemap.rb
require 'rubygems'
require 'sitemap_generator'

# Your site's host. You can use Rails.credentials if you want to
SitemapGenerator::Sitemap.default_host = ENV["APPLICATION_HOST"]
# Creates a sitemap index only if more than one sitemap is generated
SitemapGenerator::Sitemap.create_index = :auto
# Compress set to true will generate an '.xml.gz' file
SitemapGenerator::Sitemap.compress = true

Next up, we add the next logic below our configuration in order to generate the sitemap:

SitemapGenerator::Sitemap.create do
  # We add our static URLs first:
  add root_path, changefreq: "weekly", priority: 1
  add about_us_path, changefreq: "monthly", priority: 0.1
  add faq_path, changefreq: "monthly", priority: 0.2
  add services_path, change_freq: "monthly", priority: 0.2

  # We map over our database backed resources:
  Book.find_each do |book|
    add book_path(book), lastmod: book.updated_at, changefreq: "weekly", priority: "0.5"
  end

  Article.published.find_each do |article|
    add article_path(article), lastmod: article.updated_at, changefreq: "weekly", priority: "0.4"
  end
end

This configuration will add 4 static pages which are the: Home, About us, FAQ and Services pages. Then it will dynamically add every Book and published Article we have on our database.

Of course, we can add as many static or dynamic URLs as we need.

In case you are wondering about the parameters we pass to the add method, here's an explanation:

  • Resource path: the relative path for the resource, the gem will construct the URL with the default_host config parameter.
  • Lastmod: it expects a date time that represents the last time when the resource was modified. It is expected to coincide with updated_at attributes on the front-end.
  • Changefreq: this tells the search engines the estimated update frequency for the resource. It can be estimated with successive crawls, but we can help them at least with an approximation.
  • Priority: the relative importance of a given page. Google and other search engines publicly disclose they don't give any weight to that parameter, but it doesn't hurt to include it.

After adding the resources that we think are important to our sitemap we can proceed and generate the actual sitemap with the following command:

bundle exec rake sitemap:refresh

If ran for the first time, this command will create a sitemap.xml inside our application's public directory. Every subsequent run will overwrite the sitemap with the newest URLs.

This means that if we want our sitemap to be up-to-date we need to do some proactive work about it.

Updating our sitemap

In order to keep our sitemap as fresh as possible, we can re-run the sitemap:refresh task as frequently as needed.

If we update our site once a day, we can run a once-a-day scheduled task to update our sitemap. We can actually run the task redundantly without losing anything so it's better to be safe than sorry.

An automated way to run the task is using the whenever gem, which allows us to set rake tasks to run as cron jobs.

So, as a first step, we install the whenever gem:

gem 'whenever', require: 'false'

Then, we run the gem installation process, which will create a schedule.rb file which will contain the way in which the update process will happen:

bundle update
bundle exec wheneverize .

This will create an empty schedule.rb which will contain the logic to run our scheduled task. Initially it's empty, but we can add the following code to run a sitemap refresh every 30 minutes:

every 30.minutes do
  bundle exec rake sitemap:refresh
end

The code above adds the */30 * * * * entrance to the crontab that makes sure the rake task is run every 30 minutes in silent mode.

Don't forget to verify that the task is running correctly before adding it to the crontab and calling it a day.

Deciding how frequently to update our sitemap

When it comes to deciding how often your sitemap should be updated, there's really no hard rule. It depends a lot on your publication schedule or content creation speed.

Remember: the sitemap update frequency controls your side of the equation but search engines decide how and when to index your site on their own.

If search engines notice that you publish frequently they will crawl your site more frequently but there's no guarantee that they will do it when we tell them we do.

Think about it from their perspective: crawling a website requires computational resources that are limited by definition and it's estimated that around 250,000 websites are added to the internet every day.

Search engines try to optimize those resources to keep their indices updated but without unnecessary resource overspending.

Also, when Google crawls a URL it likes to follow every link in it so it's possible to have a URL indexed without it being in the sitemap as long as it's referenced from a URL on your site.

Asynchronous sitemap updates

Some hosting providers don't offer a native cron job feature. That's where solutions like the sidekiq-scheduler gem or the solid-queue recurring tasks feature can help us.

Consider that the async approach clashes with the default way the sitemap_generator gem works because it means we have to define the sitemap generation logic outside the sitemap.rb file which is used by the rake task by default.

We could use the command/service object pattern, see that we're duplicating the code:

# services/sitemap_generator.rb
class SitemapGenerator
  def self.call
    SitemapGenerator::Sitemap.default_host = "YOUR_APPLICATION_HOST"
    SitemapGenerator::Sitemap.create do
      add root_path, changefreq: "weekly", priority: 1
      add about_us_path, changefreq: "monthly", priority: 0.1
      add faq_path, changefreq: "monthly", priority: 0.2
      add services_path, change_freq: "monthly", priority: 0.2

      Book.find_each do |book|
        add book_path(book), lastmod: book.updated_at, changefreq: "weekly", priority: "0.5"
      end

      Article.published.find_each do |article|
        add article_path(article), lastmod: article.updated_at, changefreq: "weekly", priority: "0.4"
      end
    end
  end
end

Using the sidekiq-scheduler gem

If we prefer the sidekiq gem to handle async tasks in our Rails apps, the sidekiq-scheduler gem allows us to schedule recurring tasks.

In order to use this gem we just add it to the Gemfile and install it:

bundle add sidekiq-scheduler
bundle install

We add a class that includes Sidekiq::Job and implements a perform method:

# app/workers/sitemap_refresh_job.rb
require 'sidekiq-scheduler'

class SitemapRefreshJob
  include Sidekiq::Job

  def perform
    SitemapGenerator.call
  end
end

Then we have to define a config/schedule.yml file where we add the task and the schedule, which we can add in words or other formats as the gem uses the fugit parser:

:scheduler:
  :schedule:
    sitemap_refresh:
      cron: */30 * * * *
      class: SitemapRefreshJob

Using solid_queue recurring tasks

If our application uses the solid_queue gem, soon to be a default with Rails 8, we can add a recurring task to refresh our sitemap.

Assuming you have the gem installed and already using it to run recurring jobs, what we need to do is create a class that inherits from ApplicationJob to perform our desired task:

Then, we add the command to the config/recurrent.yml file:

production:
  sitemap_refresh:
    command: SitemapGenerator.call
    schedule: 30/* * * *

Please note that we can also pass a job class instead of the command. In that job we could define what we did in the SitemapGenerator service object, and it would generate the same result. I decided to use the service object to reuse the code from the sidekiq-scheduler section.

Adding image and video to sitemaps

If we have an image-heavy site we can actually add them to our sitemap. This is particularly useful for images that might not be found by the crawlers, mostly those within JS code.

For example, if you have an Article model that has a lot of images we could do something like:

# config/sitemap.rb
add article_path(article, 
  images: article.images.map do |img|
    {loc: img.url, title: img.alt_text, caption: img.caption, license: img.license}
  end  
)

In this example the images parameter adds an array of images for each Article URL. The loc represents the location or URL for the image and is the only required parameter.

The title, caption and license are optional but if you have access to them, adding them can generate traffic to your site from image searches.

We can do the same for videos and for other assets, you can check the gem's documentation for help with those resources.

When it comes to videos, the process is very similar:

add(lesson_path(lesson), video: {
  thumbnail_loc: "https://example.com/how-to-add-sitemaps-to-rails-apps-thumb.jpg",
  content_loc: "https://example.com/how-to-add-sitemaps-to-rails-apps.mp4",
  duration: 2422,
  title: "How to add sitemaps to Rails applications",
  description: "In this lesson we will learn how to add sitemaps to Rails apps",
  publication_date: "2024-10-09",
  autoplay: false,
  tags: ["sitemaps on rails", "sitemap generator", "rails seo"],
  category: "SEO",
  family_friendly: true,
  requires_subscription: false,
})

When adding images or videos to sitemaps, only those assets that are relevant to the content need to be added. There's no need to add images or videos that are used for aesthetic or other purposes.

Generating a dynamic sitemap with Rails

Certain sites that have large amounts of frequently changing content that's important for their search results can leverage dynamic sitemaps to improve their performance on search engines.

Sites like Amazon which inventory information, pricing and stock, is constantly changing or a news outlet that publishes breaking news stories can benefit from adding a dynamic sitemap.

They are just like a regular sitemap, but they are generated on-request so whenever search engines need them we are sure that they have their latest version of our site's content.

Unfortunately, the sitemap_generator gem doesn't allow for this feature so we will need to do it ourselves.

First, if we have the gem installed and generating our sitemaps, we need to define an alternative URL like:

# config/routes
get "sitemaps/dynamic-sitemap.xml", to: "dynamic_sitemap#show", as: :dynamic_sitemap

Then, in the controller we should have something like:

# app/controllers/dynamic_sitemap_controller.rb
class DynamicSitemapController < ApplicationController
  def show
    respond_to do |format|
      format.xml
    end
  end
end

Then, we need to generate the XML, the simplest way is to do it from a view:

xml.instruct! :xml, version: "1.0", encoding: "UTF-8"
xml.urlset xmlns: "http://www.sitemaps.org/schemas/sitemap/0.9" do
  static_urls = [root_url, about_us_url, faq_url, customers_url]

  static_urls.each do |url|
    xml.url do
      xml.loc url
      xml.changefreq "monthly"
    end
  end

  Article.published.find_each do |post|
    xml.url do
      xml.loc article_url(post)
      xml.lastmod post.updated_at.strftime("%Y-%m-%d")
      xml.changefreq "weekly"
      xml.priority 0.8
    end
  end

  Book.find_each do |book|
    xml.url do
      xml.loc book_url(book)
      xml.lastmod book.updated_at.strftime("%Y-%m-%d")
      xml.changefreq "weekly"
      xml.priority 0.8
    end
  end
end

As you can see, generating a static sitemap is not as hard as it might seem, but this is a very basic implementation of the feature.

You might need to do some tweaking to fit your needs, you can even extract the generation logic into an object or have user-facing customizations if you need to.

Also, consider that having a dynamic sitemap might be an overkill for your application. They're most useful when the update frequency is really high and out-of-phase content indexed is an issue.

Hosting a sitemap in the cloud

Some hosting providers only provide ephemeral storage, meaning that whenever your server is restarted the files on them get deleted.

For those cases, uploading your sitemap to cloud storage services like AWS S3, Google Cloud Storage or similar services is recommended.

Luckily, the sitemap_generator gem comes with adapters that allow us to handle uploading the sitemap to these services.

Adapters define the write method which is responsible

The following adapters are included out of the box:

  • FileAdapter: used by default. It writes the sitemap to disk, to the public/ directory.
  • FogAdapter: it uses the fog gem, and can upload to any compatible service, including S3.
  • AwsSdkAdapter: it uploads to AWS S3 using the aws-sdk-s3 gem.
  • GoogleStorageAdapter: it uploads to the Google Cloud Storage service using the google-cloud-storage gem.
  • WaveAdapter: it uploads to any service the carrierwave gem can upload to (S3, GCS, Rackspace Cloud Files).

We will be uploading to S3 using the AwsSdkAdapter. S3 compatible services like Digital Ocean can be implemented by changing the endpoint to the one the provider gave you.

In order to make it work, we need to have our AWS credentials and set the adapter we wish to use before running the code that generates the sitemap:

# config/sitemap.rb or the file where you generate the sitemap
require 'aws-sdk-s3'
SitemapGenerator::Sitemap.adapter = SitemapGenerator::AwsSdkAdapter.new('sitemap-generator-example',
  acl: 'public-read', # Bucket permissions might be needed.
  access_key_id: 'AKIAIOSFODNN7EXAMPLE',
  secret_access_key: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
  region: 'us-east-1',
  endpoint: 'https://nyc1.digitaloceanspaces.com' # only needed if not uploading to S3
)

# The rest of the code in charge of generating the sitemap

Then, the next time we run bundle exec rake sitemap:refresh or our async jobs we will be uploading it to our cloud storage provider.

The file should be accessible and public in order for you to associate it with your site from the search engine console you wish to add a sitemap to.

Submit your sitemap to the Google Search Console

After we're happy with our sitemap creation and update flow we can add it to our Google Search Console in order to associate our site with the sitemap.

We will need a GSC account and a verified website. Then we need to go to the Sitemaps page of the console under the Indexing section, which is below the overview section.

There, we just need to add our sitemap URL and submit it:

Adding a sitemap to the Google Search Console

After this process, Google will validate that the URL responds with 200 OK and it will start parsing your sitemap.

Submitting a sitemap hosted in the cloud

If you followed the steps above to upload your sitemap to a cloud storage service, adding it to the Google Search Console has an extra step involved.

Google doesn't “like” that your sitemap is hosted in a domain that's not yours. Even if you verify the property of the bucket using the HTML verification method, Google might reject your sitemap altogether.

To fix this, we will need to add a redirect from our site to our sitemap in the cloud storage:

# config/routes.rb
get "/sitemap.xml", to: redirect("https://your-bucket-name.s3.amazonaws.com/sitemap.xml")
# Or, if you're compressing the sitemap:
get "/sitemap.xml.gz", to: redirect("https://your-bucket-name.s3.amazonaws.com/sitemap.xml.gz")

Before submitting the map to the search console, verify that it's actually pointing to the sitemap.

This should keep Google happy about your externally hosted sitemap.

About the search engines ping

The current version of sitemap_generator, 6.3.0 at the time of writing this, allows us to ping search engines right after our sitemap has been updated in order to let search engines know they should crawl our website again.

When we run the sitemap:refresh task, actually two tasks are run: sitemap:create and a call to the SitemapGenerator::Sitemap.ping_search_engines command.

That command is in charge of sending a request, or ping, to a URL search engines provide in order for webmasters to notify them about changes in sitemaps.

By default, the gem pings Google only, but if you want to add another search engine to ping you have to add it to the search_engines hash before the sitemap generation logic:

sitemap_url = "https://example.com/sitemap.xml"
SitemapGenerator::Sitemap.search_engines[:search_engine] = "https://www.searchengine.com/ping?sitemap=%s"

However, Bing deprecated the ping feature in 2021 and Google did the same last year. Right now, it seems that no major search engine is supporting this feature.

If we don't do anything, pings will just fail and nothing else will happen, but if we want the pings to not happen at all we have to tell our refresh task to avoid pinging: rake sitemap:refres:no_ping

Sitemap best practices

Here are some things that you should pay attention to when adding a sitemap to your Rails application:

  • Use canonical URLs in the sitemap: it's very important that you use your canonical URLs within your sitemap. Having a contradiction between your content defined canonicals and the sitemap defined URLs can lead to unnecessary duplications and crawl budget spends.
  • URL limit: individual sitemaps can only have 50,000 URLs or have a max file size of 50 MB, whichever comes first. If you have more URLs than that you have to split your sitemap and have your main sitemap point to your sitemaps.
  • Add your sitemap to robots.txt: you simply need to add Sitemap: YOUR_SITEMAP_URL to your robots.txt file.
  • You can probably ignore priority and changefreq: search engines, especially Google, have come forward saying they completely ignore those attributes. You can still add them but never trust them to define search engine's crawl frequency and priority.
  • Lastmod is considered only if coherent: the lastmod value should be coherent with the resources public modified at or published at. Otherwise, it might be ignored.
  • Don't add URLs just for the sake of it: even though it's tempting to have more URLs indexed by search engines, truth is that not every type of content gets traffic or should be indexed. Legal pages, pages with thin content or pages that could be considered duplicate pages because of their similitude to other pages should not be added to your sitemap.
  • Exclude noindex pages: make sure that if you have pages with the robots="noindex" meta tag they are not added to your sitemap.
  • Use UTF-8: the sitemap_generator gem does it by default but if you ever need to generate a sitemap manually don't forge to use this character encoding.

Lastly, consider that submitting a sitemap is not the only way for search engines to index your content. Links from other websites and your own site structure and internal linking is arguably as important if not more important than a sitemap.

Also, you might add your sitemap and notice that search engines don't index some of your content. A sitemap is not a guarantee of search engines indexing your site.

If you're having trouble indexing parts of your site pay attention to your site's structure, the content itself and the backlinks your site might have.

Summary

Adding a sitemap matters because it allows search engines to know about our website's resources without having to actually crawl them.

Using the sitemap_generator gem, adding a sitemap to a Rails application is a matter of installing the gem, adding a bit of configuration and running a rake task to keep the sitemap updated.

We can run that task in a cron job or using recurring async jobs with libraries like Sidekiq or Solid Queue.

If we host our site on a provider with ephemeral disk storage we need to upload our sitemaps to a cloud storage provider, something that can be trivially done using the gem.

Also, sitemaps aren't just for webpages, we can also add images, videos and other assets to them as long as we comply with the sitemap specification.

Dynamic sitemaps are also very useful for sites that have a very high frequency of important content updates. We can generate them with Rails using an XML builder.

It's critical to avoid contradictions between our canonical URLs and the URLs we add to our sitemaps, otherwise that could lead to indexing problems.

All in all, as your site gets more complex or has more content, managing your sitemaps becomes more complex, but that's always a good issue to have.

Build your next rails app 10x faster with Avo

Avo dashboard showcasing data visualizations through area charts, scatterplot, bar chart, pie charts, custom cards, and others.

Find out how Avo can help you build admin experiences with Rails faster, easier and better.