Sitemap, Robots.txt, and Web Manifest

Modern websites are not designed only for users. They must also communicate clearly with:

search engine crawlers
indexing bots
AI scrapers
mobile operating systems
browsers acting as application containers

Three critical infrastructure files help with this:

File	Purpose
sitemap.xml	Helps search engines discover pages
robots.txt	Controls which bots can crawl what
site.webmanifest	Enables installable Progressive Web Apps

These files live in the root of your website and are part of a well-structured production system.

Example:


[https://example.com/sitemap.xml](https://example.com/sitemap.xml)
[https://example.com/robots.txt](https://example.com/robots.txt)
[https://example.com/site.webmanifest](https://example.com/site.webmanifest)

1. Sitemap Files (sitemap.xml)

A sitemap is a structured XML file that lists important pages on your website so search engines can discover and crawl them efficiently.

Search engines such as Google, Bing, and DuckDuckGo use sitemaps to understand your site's structure.

Even though bots can follow links automatically, a sitemap ensures:

newly published pages are discovered faster
deep pages are not missed
crawl resources are used efficiently

Why Sitemaps Exist

Search engines allocate a crawl budget to each website.

Crawl budget means:


How many pages a search engine bot is willing to crawl
during a given time period.

If your website contains thousands of pages, bots might miss some of them.

A sitemap helps by explicitly listing the URLs that matter.

Basic Sitemap Structure

A sitemap follows an XML schema.

Main structure:


urlset
└── url
├── loc
├── lastmod
├── changefreq
└── priority

Each <url> entry describes a single page.

Important Sitemap Tags

urlset

Root container for all URLs.


<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

The xmlns attribute defines the sitemap specification version.

url

Represents a single webpage.

<url>
</url>

loc

The page URL.

<loc>https://example.com/blog</loc>

Must be absolute URLs.

lastmod

Indicates the last modification date.

<lastmod>2026-01-09</lastmod>

Search engines may prioritize recently updated pages.

changefreq

Tells bots how often the page changes.

<changefreq>daily</changefreq>

Possible values:

Value	Meaning
always	constantly changing
hourly	every hour
daily	daily updates
weekly	weekly updates
monthly	monthly
yearly	rarely updated

This is only a hint, not a strict instruction.

priority

Relative importance.

<priority>1.0</priority>

Range:

0.0 → lowest priority
1.0 → highest priority

Example:

Page	Priority
Homepage	1.0
Blog posts	0.8
Archive pages	0.4

Full Sitemap Example

<?xml version="1.0" encoding="UTF-8"?>
 
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-01-09T14:00:00+00:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
 
  <url>
    <loc>https://example.com/blog</loc>
    <lastmod>2026-01-08</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
 
</urlset>

Advanced Sitemap Types

Sitemaps can include specialized namespaces.

Examples:

Type	Purpose
image sitemap	image indexing
video sitemap	video discovery
news sitemap	Google News
multilingual sitemap	hreflang localization

Example with images:

<image:image>
  <image:loc>https://example.com/image.png</image:loc>
</image:image>

Sitemap Limits

Search engines enforce limits:

Limit	Value
Max URLs per sitemap	50,000
Max file size	50MB

Large sites must use sitemap indexes.

Example:

sitemap_index.xml
 ├── sitemap1.xml
 ├── sitemap2.xml
 └── sitemap3.xml

2. Robots Exclusion Protocol (robots.txt)

The robots.txt file tells bots how they are allowed to crawl your website.

It is the first file crawlers request when visiting a domain.

Example request:

GET /robots.txt

Location must be:

https://example.com/robots.txt

If the file does not exist, bots assume:

Everything is allowed.

Robots.txt Syntax

A robots file consists of directives.

Structure:

User-agent
Allow
Disallow
Crawl-delay
Sitemap

User-agent

Specifies which crawler the rule applies to.

Examples:

User-agent: *

Applies to all bots.

User-agent: Googlebot

Applies only to Google.

Disallow

Blocks a path from crawling.

Example:

Disallow: /admin/

Bots cannot crawl anything under:

/admin/

Allow

Overrides a disallow rule.

Example:

Disallow: /assets/
Allow: /assets/logo.png

Everything inside /assets is blocked except logo.png.

Crawl-delay

Requests bots to slow down crawling.

Example:

Crawl-delay: 10

Meaning:

Wait 10 seconds between requests.

Not all bots respect this directive.

Complete Robots Example

# Instructions for all bots
User-agent: *
Disallow: /admin/
Disallow: /tmp/
Disallow: /search?q=
Disallow: /api/private/

# Block AI crawlers
User-agent: GPTBot
Disallow: /

# Link to sitemap
Sitemap: https://example.com/sitemap.xml

Important Robots.txt Rules

Robots.txt does NOT enforce security.

It only tells polite bots what they should avoid.

Malicious crawlers can ignore it.

Therefore:

Never rely on robots.txt to protect sensitive data.

Use authentication or server access controls instead.

3. Web Manifest (site.webmanifest)

The Web Manifest is part of the Progressive Web App standard.

It allows a website to behave like a native mobile application.

Browsers use this file to determine:

application name
icon set
start page
theme colors
display mode

Without a manifest, browsers cannot show:

"Add to Home Screen"

Manifest File Location

Typically stored as:

/site.webmanifest

Linked in HTML:

<link rel="manifest" href="/site.webmanifest">

Manifest Structure

The file uses JSON format.

Main fields:

Field	Purpose
name	full app name
short_name	icon label
start_url	entry page
display	UI style
icons	app icons
theme_color	browser theme

Example Manifest

{
  "name": "Vision Management System",
  "short_name": "VisionOS",
  "description": "Glassmorphism task manager",
  "start_url": "/dashboard",
  "scope": "/",
  "display": "standalone",
  "background_color": "#0a0a0a",
  "theme_color": "#ffbf00",
  "orientation": "any",
  "icons": [
    {
      "src": "/icons/icon-192.png",
      "sizes": "192x192",
      "type": "image/png",
      "purpose": "any maskable"
    },
    {
      "src": "/icons/icon-512.png",
      "sizes": "512x512",
      "type": "image/png"
    }
  ]
}

Display Modes

Manifest supports several UI modes.

browser

Normal browser tab.

standalone

Looks like a native app.

No URL bar.

fullscreen

Completely full screen.

minimal-ui

Small navigation controls.

Most PWAs use:

"display": "standalone"

Icon System

PWAs require multiple icon sizes.

Common sizes:

Size	Purpose
192x192	Android home screen
512x512	splash screens
maskable icons	adaptive shapes

Maskable icons allow Android to crop icons without cutting logos.

Example:

"purpose": "any maskable"

Manifest + Service Worker

To become a true PWA you need:

Manifest
+
Service Worker
+
HTTPS

Service workers enable:

offline functionality
caching
push notifications

Comparison Checklist

Feature	Sitemap	Robots.txt	Manifest
Purpose	SEO discovery	crawl control	PWA metadata
Visible to users	No	No	Yes (install prompt)
File format	XML	Plain text	JSON
Location	root	root	root
SEO importance	High	Medium	Low

HTML Integration

Inside the <head> section:

<meta name="robots" content="index, follow">
 
<link rel="manifest" href="/site.webmanifest">
 
<meta name="theme-color" content="#ffbf00">
 
<link rel="sitemap" type="application/xml" href="/sitemap.xml">

For iOS support:

<meta name="apple-mobile-web-app-capable" content="yes">
 
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
 
<meta name="apple-mobile-web-app-title" content="VisionOS">

Implementation Best Practices

Sitemap

Best practice:

Automatically generate it during build.

Example tools:

next-sitemap
astro-sitemap
gatsby-plugin-sitemap

Submit it to search engines via their webmaster tools.

Robots.txt

Best practice:

Block only non-public content.

Examples:

/admin/
/tmp/
/internal/
/api/private/

Never block:

/css/
/js/
/images/

Otherwise search engines cannot render your pages properly.

Web Manifest

Best practice:

Provide multiple icon sizes.

Ensure:

icons/
 ├── icon-192.png
 ├── icon-512.png
 └── maskable-icon.png

Testing Your Setup

Test Robots.txt

Visit:

https://example.com/robots.txt

Search engines also provide testing tools.

Test Sitemap

Upload it to search engine webmaster dashboards.

Also validate using sitemap validators.

Test Manifest

In Chrome:

Right Click → Inspect
Application Tab → Manifest

The panel will show:

icons
theme colors
installability status

Key Takeaways

Sitemap.xml improves search engine discovery
Robots.txt controls crawler behavior
site.webmanifest enables installable web apps
All three files belong to the core infrastructure of modern web applications

Together they ensure your site is:

Searchable
Crawlable
Installable