CacheU
SEO

Sitemap, Robots.txt, and Web Manifest

A deep technical guide to sitemap.xml, robots.txt, and site.webmanifest — how search engines crawl your site, how bots are controlled, and how websites become installable PWAs.

Sitemap, Robots.txt, and Web Manifest

Modern websites are not designed only for users. They must also communicate clearly with:

  • search engine crawlers
  • indexing bots
  • AI scrapers
  • mobile operating systems
  • browsers acting as application containers

Three critical infrastructure files help with this:

FilePurpose
sitemap.xmlHelps search engines discover pages
robots.txtControls which bots can crawl what
site.webmanifestEnables installable Progressive Web Apps

These files live in the root of your website and are part of a well-structured production system.

Example:


[https://example.com/sitemap.xml](https://example.com/sitemap.xml)
[https://example.com/robots.txt](https://example.com/robots.txt)
[https://example.com/site.webmanifest](https://example.com/site.webmanifest)


1. Sitemap Files (sitemap.xml)

A sitemap is a structured XML file that lists important pages on your website so search engines can discover and crawl them efficiently.

Search engines such as Google, Bing, and DuckDuckGo use sitemaps to understand your site's structure.

Even though bots can follow links automatically, a sitemap ensures:

  • newly published pages are discovered faster
  • deep pages are not missed
  • crawl resources are used efficiently

Why Sitemaps Exist

Search engines allocate a crawl budget to each website.

Crawl budget means:


How many pages a search engine bot is willing to crawl
during a given time period.

If your website contains thousands of pages, bots might miss some of them.

A sitemap helps by explicitly listing the URLs that matter.


Basic Sitemap Structure

A sitemap follows an XML schema.

Main structure:


urlset
└── url
├── loc
├── lastmod
├── changefreq
└── priority

Each <url> entry describes a single page.


Important Sitemap Tags

urlset

Root container for all URLs.


<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

The xmlns attribute defines the sitemap specification version.


url

Represents a single webpage.

<url>
</url>

loc

The page URL.

<loc>https://example.com/blog</loc>

Must be absolute URLs.


lastmod

Indicates the last modification date.

<lastmod>2026-01-09</lastmod>

Search engines may prioritize recently updated pages.


changefreq

Tells bots how often the page changes.

<changefreq>daily</changefreq>

Possible values:

ValueMeaning
alwaysconstantly changing
hourlyevery hour
dailydaily updates
weeklyweekly updates
monthlymonthly
yearlyrarely updated

This is only a hint, not a strict instruction.


priority

Relative importance.

<priority>1.0</priority>

Range:

0.0 → lowest priority
1.0 → highest priority

Example:

PagePriority
Homepage1.0
Blog posts0.8
Archive pages0.4

Full Sitemap Example

<?xml version="1.0" encoding="UTF-8"?>
 
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-01-09T14:00:00+00:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
 
  <url>
    <loc>https://example.com/blog</loc>
    <lastmod>2026-01-08</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
 
</urlset>

Advanced Sitemap Types

Sitemaps can include specialized namespaces.

Examples:

TypePurpose
image sitemapimage indexing
video sitemapvideo discovery
news sitemapGoogle News
multilingual sitemaphreflang localization

Example with images:

<image:image>
  <image:loc>https://example.com/image.png</image:loc>
</image:image>

Sitemap Limits

Search engines enforce limits:

LimitValue
Max URLs per sitemap50,000
Max file size50MB

Large sites must use sitemap indexes.

Example:

sitemap_index.xml
 ├── sitemap1.xml
 ├── sitemap2.xml
 └── sitemap3.xml

2. Robots Exclusion Protocol (robots.txt)

The robots.txt file tells bots how they are allowed to crawl your website.

It is the first file crawlers request when visiting a domain.

Example request:

GET /robots.txt

Location must be:

https://example.com/robots.txt

If the file does not exist, bots assume:

Everything is allowed.

Robots.txt Syntax

A robots file consists of directives.

Structure:

User-agent
Allow
Disallow
Crawl-delay
Sitemap

User-agent

Specifies which crawler the rule applies to.

Examples:

User-agent: *

Applies to all bots.

User-agent: Googlebot

Applies only to Google.


Disallow

Blocks a path from crawling.

Example:

Disallow: /admin/

Bots cannot crawl anything under:

/admin/

Allow

Overrides a disallow rule.

Example:

Disallow: /assets/
Allow: /assets/logo.png

Everything inside /assets is blocked except logo.png.


Crawl-delay

Requests bots to slow down crawling.

Example:

Crawl-delay: 10

Meaning:

Wait 10 seconds between requests.

Not all bots respect this directive.


Complete Robots Example

# Instructions for all bots
User-agent: *
Disallow: /admin/
Disallow: /tmp/
Disallow: /search?q=
Disallow: /api/private/

# Block AI crawlers
User-agent: GPTBot
Disallow: /

# Link to sitemap
Sitemap: https://example.com/sitemap.xml

Important Robots.txt Rules

Robots.txt does NOT enforce security.

It only tells polite bots what they should avoid.

Malicious crawlers can ignore it.

Therefore:

Never rely on robots.txt to protect sensitive data.

Use authentication or server access controls instead.


3. Web Manifest (site.webmanifest)

The Web Manifest is part of the Progressive Web App standard.

It allows a website to behave like a native mobile application.

Browsers use this file to determine:

  • application name
  • icon set
  • start page
  • theme colors
  • display mode

Without a manifest, browsers cannot show:

"Add to Home Screen"

Manifest File Location

Typically stored as:

/site.webmanifest

Linked in HTML:

<link rel="manifest" href="/site.webmanifest">

Manifest Structure

The file uses JSON format.

Main fields:

FieldPurpose
namefull app name
short_nameicon label
start_urlentry page
displayUI style
iconsapp icons
theme_colorbrowser theme

Example Manifest

{
  "name": "Vision Management System",
  "short_name": "VisionOS",
  "description": "Glassmorphism task manager",
  "start_url": "/dashboard",
  "scope": "/",
  "display": "standalone",
  "background_color": "#0a0a0a",
  "theme_color": "#ffbf00",
  "orientation": "any",
  "icons": [
    {
      "src": "/icons/icon-192.png",
      "sizes": "192x192",
      "type": "image/png",
      "purpose": "any maskable"
    },
    {
      "src": "/icons/icon-512.png",
      "sizes": "512x512",
      "type": "image/png"
    }
  ]
}

Display Modes

Manifest supports several UI modes.

browser

Normal browser tab.

standalone

Looks like a native app.

No URL bar.

fullscreen

Completely full screen.

minimal-ui

Small navigation controls.

Most PWAs use:

"display": "standalone"

Icon System

PWAs require multiple icon sizes.

Common sizes:

SizePurpose
192x192Android home screen
512x512splash screens
maskable iconsadaptive shapes

Maskable icons allow Android to crop icons without cutting logos.

Example:

"purpose": "any maskable"

Manifest + Service Worker

To become a true PWA you need:

Manifest
+
Service Worker
+
HTTPS

Service workers enable:

  • offline functionality
  • caching
  • push notifications

Comparison Checklist

FeatureSitemapRobots.txtManifest
PurposeSEO discoverycrawl controlPWA metadata
Visible to usersNoNoYes (install prompt)
File formatXMLPlain textJSON
Locationrootrootroot
SEO importanceHighMediumLow

HTML Integration

Inside the <head> section:

<meta name="robots" content="index, follow">
 
<link rel="manifest" href="/site.webmanifest">
 
<meta name="theme-color" content="#ffbf00">
 
<link rel="sitemap" type="application/xml" href="/sitemap.xml">

For iOS support:

<meta name="apple-mobile-web-app-capable" content="yes">
 
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
 
<meta name="apple-mobile-web-app-title" content="VisionOS">

Implementation Best Practices

Sitemap

Best practice:

Automatically generate it during build.

Example tools:

next-sitemap
astro-sitemap
gatsby-plugin-sitemap

Submit it to search engines via their webmaster tools.


Robots.txt

Best practice:

Block only non-public content.

Examples:

/admin/
/tmp/
/internal/
/api/private/

Never block:

/css/
/js/
/images/

Otherwise search engines cannot render your pages properly.


Web Manifest

Best practice:

Provide multiple icon sizes.

Ensure:

icons/
 ├── icon-192.png
 ├── icon-512.png
 └── maskable-icon.png

Testing Your Setup

Test Robots.txt

Visit:

https://example.com/robots.txt

Search engines also provide testing tools.


Test Sitemap

Upload it to search engine webmaster dashboards.

Also validate using sitemap validators.


Test Manifest

In Chrome:

Right Click → Inspect
Application Tab → Manifest

The panel will show:

  • icons
  • theme colors
  • installability status

Key Takeaways

  • Sitemap.xml improves search engine discovery
  • Robots.txt controls crawler behavior
  • site.webmanifest enables installable web apps
  • All three files belong to the core infrastructure of modern web applications

Together they ensure your site is:

  • Searchable
  • Crawlable
  • Installable