Tell HN: Robots.txt pitfalls – what I learned the hard way

120 points by pyeri 8 months ago

This applies to sites indexed on Google that hope to gain organic traffic. As an indie blogger and SEO enthusiast, I foolishly updated my robots.txt file to prevent indexing of certain unwanted parts of my site, leading to subtle repercussions that I couldn't have foreseen.

A few days ago, while reading about SEO, I came across the concept of a "crawl budget." Apparently, Google allocates a specific crawl budget to your indexed site, and the more useless content it has to index and store on its servers, the more it affects your site—resulting in delays for new content indexing, favicon updates, and robots.txt crawling.

Being a minimalist and utilitarian, I decided to prevent indexing of the `/uploads/` directory on my site since it mostly contained images used in my articles. I thought blocking this "useless content" would free up more crawling budget for my primary content, i.e., articles. So, I added this directory to my site's robots.txt:

  # Group 1
  User-agent: *
  Disallow: /public/
  Disallow: /drafts/
  Disallow: /theme/
  Disallow: /page*
  Disallow: /uploads/

  Sitemap: https://prahladyeri.github.io/sitemap.xml

The way search engines work means there's typically a 5-7 day gap between updating the robots.txt file and crawlers processing it. After about a week, I noticed that my site's favicon disappeared from SERPs on mobile browsers! Instead, there was a bland (empty) icon in its place. That’s when I realized that my favicons also resided in the `/uploads/` directory. After I recently optimized the favicon format by switching from WEBP to PNG, Google was unable to crawl and index the new favicon at all!

Once I realized this mistake, I removed the blocking of `/uploads/` from the robots.txt and requested a recrawl. But who knows how long it will take for Google's systems to sync this change and start showing the site's favicon back in SERPs! Two lessons learned:

1. The robots.txt file is highly sensitive; avoid modifying it if possible. 2. Applying SEO is like steering an extremely large ship or vessel. You pull a lever now, and the ship only moves after several days!

andrethegiant 8 months ago

> there's typically a 5-7 day gap between updating the robots.txt file and crawlers processing it

You could try moving your favicon to another dir, or root dir, for the time being, and update your HTML to match. That way it would be allowed according to the version that Google still has cached. Also, I think browsers look for a favicon at /favicon.ico regardless, so it might be worth making a copy there too.

  • throwaway2016a 8 months ago

    /favicon.ico is the default and it will be loaded if your page does not specify a different path in the metadata but in my experience most clients respect the metadata and won't try to fetch the default path until after the <head> section of the page loads for HTML content.

    But non-HTML content has no choice but to use the default so it's generally a good idea to make sure the default path resolves.

    • andrewxdiamond 8 months ago

      > won't try to fetch the default path until after the <head> section of the page loads for HTML content

      That's a really interesting optimization. How did you discover this? Reading source?

  • ms7892 8 months ago

    Thanks for sharing, I wasn’t knowing that browsers look for a favicon at /favicon.ico. Thanks again.

    • oniony 8 months ago

      I think it is from the Internet Explorer days. .ico is an actual icon file format on Windows and IIRC originally the icons were in that format, with support for GIF coming when Netscape supported the feature.

      • colejohnson66 8 months ago

        Many browsers will accept a favicon.ico that's actually a PNG file with no issues.

dazc 8 months ago

USE X-Robots-Tag: noindex to prevent files being indexed and let google determine how they crawl your site for themselves.

A nightmare scenario can result, otherwise, where you have content indexed but don't allow googlebot to crawl it. This does not end well.

https://developers.google.com/search/docs/crawling-indexing/...

  • csiegert 8 months ago

    I’ve got two questions:

    1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?

    2. The linked page says to avoid Disallow in robots.txt and to rely on the noindex tag. But how can I prevent googlebot from crawling all user profiles to avoid database hits, bandwidth, etc. without an entry in robots.txt? With noindex, googlebot must visit each user profile page to see that it is not supposed to be indexed.

    • seanwilson 8 months ago

      https://developers.google.com/search/docs/crawling-indexing/...

         "Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it."
      
      It's counterintuitive but if you want a page to never appear on Google search, you need to flag it as noindex, and not block it via robots.txt.

      > 1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?

      It'll usually list the URL with a description like "No information is available for this page". This can happen for example if the page has a lot of backlinks, it's blocked via robots.txt, and it's missing the noindex flag.

    • dazc 8 months ago

      'But how can I prevent googlebot from crawling all user profiles to avoid database hits..'

      If user profiles are noindexed then why should you care if google are crawling, when almost every other crawler out there does not obey robots.txt?

      It's not in google's interest to waste resources on non-indexable content, you are worrying far too much about it.

hk1337 8 months ago

It's good information but...

1. Why is your favicon in the uploads directory? Usually, those would be at the root of your site or in an image directory?

2. Why is there an uploads directory for a static site hosted on GitHub? I don't believe that is useful on GitHub, is it? You cannot have visitors upload files to it, right?

  • gwd 8 months ago

    Speaking for myself:

    1. I want nginx to serve static files, and everything else to be reverse proxied to the webapp

    2. The configuration file that allows /favicon.ico (and others) to be a file but / and other paths to be passed to the webapp is kind of ugly. Here's mine:

        location ~* ^/(favicon.ico|apple-touch-icon.png)$ {
            root $icons_path;
        }
    
    In my own case I've so far decided to accept the ugly config file, but as you can see, I haven't gotten around to adding even a robots.txt or any of the other files the modern web ecosystem expects; and adding them involves adding them one-by-one. I can see why someone would say, "Why make an ugly hack of an nginx config, when I can just define the favicon location in the metadata to a path easily configured to be files-only?"
seanwilson 8 months ago

How big is your site? Crawl budget is likely only relevant for huge sites, not personal blogs.

  • moribunda 8 months ago

    Exactly - this SEOveroptimisation

  • ccgreg 8 months ago

    Crawl budget is relevant to every site in Common Crawl.

xnx 8 months ago

The best SEO advice is to not focus on SEO and make a site that people will like.

  • dewey 8 months ago

    Technical SEO still is a very valid optimization. Making sure you have all the relevant tags, a good structure, fast loading pages etc.

    • fhdsgbbcaA 8 months ago

      Just focus on accessibility and standards.

      • turnsout 8 months ago

        Hmm. I never would have added Schema.org support unless there was an SEO reason.

        • fhdsgbbcaA 8 months ago

          Yeah. If you solve accessibility you generally solve SEO as a byproduct. Mainly because accessible sites are easy for machines to parse.

  • Retr0id 8 months ago

    Thinking about SEO too hard sucks, but having a site that nobody can find (even when they already know about it and are specifically trying to find it again!) sucks even more.

maciekpaprocki 8 months ago

You dont want to exclude your images. That can very much affect your results as it will remove you from image tab, but also content of articles that contain them might be affected.

Theodores 8 months ago

I thought that Google Search Console had tools to test robots.txt and sitemap.xml files, but it has been a while since I have needed to do that.

For those wondering why favicon is in a directory, nowadays there are half a dozen different favicon files for different devices in different situations and there are online tools such as The Real Favicon Generator that will take a source image and make the variants for you. These come with a code snippet for head and the option to use a sub directory so that you don't clutter the root.

Maybe they should offer a robots.txt snippet too.

Fun fact, for a single page, you can base64 encode the favicon and shove it in the page, thereby not needing a separate file. Why would you want to do that? If you base64 encode all the images and add the scripts and stylesheets in, then you can have a HTML page that you don't have to upload, you can email it to someone. This is useful if wanting to share a design mockup.

liendolucas 8 months ago

I'm a complete ignorant when it comes to SEOs so what are the consequences of not having a robots.txt nor a sitemap.xml at all? Will that be detrimental in a big way?

  • jamesfinlayson 8 months ago

    My understanding is that a lack of robots.txt should be fine and the lack of a sitemap.xml shouldn't be too troublesome as long as there is something linking to your site and all pages are linked from somewhere on your site (the sitemap helps search engines find all the links in one place but you have a nav bar or an article list that should work similarly, but you can give a suggestion to search engines about how often they should recrawl in your sitemap, which I don't believe you can influence any other way).

  • 12thhandyman 8 months ago

    There are many metrics and facets of a website considered for SEO, with varying weights. The absence of either a robots.txt or sitemap.xml has a non-negligible but relatively minor weight compared to some other metrics. The files should be present and accurate when optimizing for seo. Ruby on Rails for example creates an empty robots.txt file with new projects, it does not create any sitemap.xml however.

  • shikshake 8 months ago

    I’m also ignorant of this, and to add a question on top of yours: is it worth worrying about robots.txt for personal portfolio websites built from scratch?

    • philipwhiuk 8 months ago

      Depends on your hosting platform.

  • Brajeshwar 8 months ago

    Nope! A site without robots.txt is defaulted to saying, please do the default — crawl my site in its entirety.

tiffanyh 8 months ago

Does anyone have suggestions on what a proper robots.txt would be?

How about:

  User-agent: *
  Allow: /
  Sitemap: https://example.com/sitemap.xml
  • akira2501 8 months ago

    The recommendation is to use an empty "Disallow:" rule rather than a catch all "Allow:" rule.

    Otherwise that is the canonical minimal example.

    • tiffanyh 8 months ago

      Like this?

        User-agent: *
        Disallow: 
        Sitemap: https://example.com/sitemap.xml
      • akira2501 8 months ago

        Precisely.

        • turnsout 8 months ago

          Why not just this:

             Sitemap: https://example.com/sitemap.xml
          
          Won't crawlers crawl by default?
  • bragr 8 months ago

    That's a valid robots.txt, but "proper" is entirely dependent on what you want to achieve. If you aren't looking to treat different bots differently, and are looking allow all of your site to be indexed, then that is exactly what you want.

dewey 8 months ago

If you don’t have millions of pages the crawl budget limitations most likely will have zero impact.

Make sure your basic technical SEO factors are all good. Search console is looking good and then don’t continue to worry unless you are a huge site that’s living off SEO traffic.

Arech 8 months ago

TLDR; I shoot myself in a foot thinking I'm shooting elsewhere. Don't do this!

Thanks for a useful info! /s

KateSterling 8 months ago

SEO can feel like such a balancing act—one tweak, and it’s a waiting game to see the impact! Sounds like you’ve learned a lot about the sensitivity of robots.txt.

If you’re into exploring new tech, you might like Rig. It’s a Rust library for building scalable, modular apps with LLMs, ideal if you’re branching out into AI or complex workflows. Keeps things type-safe and flexible.