VOOZH about

URL: https://dev.to/yuki0510/logging-googlebot-crawls-for-free-with-cloudflare-workers-d1-35d3

⇱ Logging Googlebot Crawls for Free with Cloudflare Workers + D1 - DEV Community


Introduction

When doing SEO work, there are times when you need to investigate whether Googlebot is properly crawling your pages.

Google Search Console has a crawl stats feature, but the sample URLs it surfaces are limited to 1,000 entries. For tracking the crawl status of specific pages over time, it falls a bit short.

Server access logs are the ideal solution for this kind of investigation.

I use this setup on LeapRows, a browser-based CSV tool I built on Vercel.

On a self-managed VPS or on-premise server, Googlebot access is automatically recorded in Nginx or Apache logs.

However, with serverless PaaS platforms like Vercel, there's no server management interface — which means no direct access to access logs.

This is where Cloudflare comes in. By routing your domain's DNS through Cloudflare, you can intercept requests with a Cloudflare Worker before they ever reach Vercel.

[Standard Vercel setup]
Googlebot → Vercel → Response (no logs)

[With Cloudflare]
Googlebot → Cloudflare Worker (logs recorded here) → Vercel → Response

By saving the logs captured by the Worker into Cloudflare's D1 (a SQLite-based database), you can collect Googlebot crawl logs without touching the Vercel side at all — and it runs entirely within the free tier.

This article walks through the setup step by step.

What you can collect

  • Crawl timing per URL (when each page was crawled)
  • Status code monitoring (detecting 4xx/5xx crawl errors)
  • Cache hit rate (DYNAMIC vs HIT)
  • Bot type breakdown (InspectionTool vs Googlebot)

Prerequisites

  • Your domain is managed through Cloudflare
  • Node.js and the Wrangler CLI are available
  • Estimated time: ~30 minutes

Architecture Overview

👁 Architecture diagram showing Googlebot sending a request to a Cloudflare Worker, which intercepts and detects the bot, forwards the request to Vercel, and asynchronously saves the crawl log to a D1 (SQLite) database using ctx.waitUntil.

The Worker intercepts every incoming request and writes crawl data to D1.

ctx.waitUntil is used to handle log saving asynchronously, so the response to Googlebot is never delayed.


Step 0: Install the Wrangler CLI

Install the Wrangler CLI to manage Cloudflare from your terminal. Once installed, log in to your account.

npm install -g wrangler
wrangler login

Step 1: Create the D1 Database

Create a D1 database on Cloudflare.

wrangler d1 create googlebot-logs

The output will include a database_id — make a note of it.

✅ Successfully created DB 'googlebot-logs'
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" ← copy this

Next, create the table definition file and apply it to D1.

Note: without the --remote flag, the command runs against your local D1 instance instead of the remote one — don't forget it.

# Create schema.sql
cat > schema.sql << 'EOF'
CREATE TABLE IF NOT EXISTS crawl_logs (
 id INTEGER PRIMARY KEY AUTOINCREMENT,
 ts TEXT NOT NULL,
 url TEXT NOT NULL,
 method TEXT,
 status INTEGER,
 ua TEXT,
 ip TEXT,
 country TEXT,
 cache TEXT,
 referer TEXT,
 bot_type TEXT,
 content_length INTEGER
);

CREATE INDEX IF NOT EXISTS idx_ts ON crawl_logs(ts);
CREATE INDEX IF NOT EXISTS idx_url ON crawl_logs(url);
EOF

# Apply to D1
wrangler d1 execute googlebot-logs --file=schema.sql --remote

Step 2: Create the Worker

Create a Worker project locally.

mkdir googlebot-logger && cd googlebot-logger
npm init -y

Create wrangler.toml with the following content.

name = "googlebot-logger"
main = "src/index.js"
compatibility_date = "2024-01-01"

# Domain configuration
[[routes]]
pattern = "yourdomain.com/*" # enter your domain
zone_name = "yourdomain.com" # enter your domain

# D1 binding
[[d1_databases]]
binding = "DB"
database_name = "googlebot-logs"
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" # ID from Step 1

Next, create src/index.js. Since we only want to track page-level crawls, static resource files under /_next/ (JS, CSS, etc.) are excluded from logging.

export default {
 async fetch(request, env, ctx) {
 // 1. Forward the request to the origin first
 const response = await fetch(request);

 // 2. Check the User-Agent
 const ua = request.headers.get("User-Agent") || "";
 const botType = detectGoogleBot(ua);

 // 3. If Googlebot, save the log asynchronously without delaying the response
 if (botType) {
 const logResponse = response.clone(); // clone before returning
 ctx.waitUntil(saveLog(env.DB, request, logResponse, ua, botType));
 }

 return response;
 },
};

// Identify the type of Googlebot
function detectGoogleBot(ua) {
 if (/Googlebot-Image/i.test(ua)) return "googlebot-image";
 if (/Googlebot-Video/i.test(ua)) return "googlebot-video";
 if (/Googlebot-News/i.test(ua)) return "googlebot-news";
 if (/AdsBot-Google/i.test(ua)) return "adsbot";
 if (/Google-InspectionTool/i.test(ua)) return "inspection-tool";
 if (/Googlebot/i.test(ua)) return "googlebot";
 return null; // not Googlebot
}

// Save log to D1
async function saveLog(db, request, response, ua, botType) {
 const url = new URL(request.url);
 const path = url.pathname;
 const cf = request.cf || {};

 // Exclude static resource files — page URLs only
 if (
 path.startsWith('/_next/') ||
 path.startsWith('/_vercel/') ||
 path.startsWith('/static/') ||
 /\.(js|css|ico|png|jpg|jpeg|svg|webp|woff|woff2|map|wasm)$/.test(path)
 ) {
 return;
 }

 // If Content-Length is absent, read the body to measure size
 let contentLength = parseInt(response.headers.get('Content-Length') || '0', 10);
 if (!contentLength) {
 const cloned = response.clone();
 const buf = await cloned.arrayBuffer();
 contentLength = buf.byteLength;
 }

 try {
 await db.prepare(`
 INSERT INTO crawl_logs (ts, url, method, status, ua, ip, country, cache, referer, bot_type, content_length)
 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
 `).bind(
 new Date().toISOString(),
 path + url.search,
 request.method,
 response.status,
 ua,
 request.headers.get('CF-Connecting-IP') || '',
 cf.country || '',
 response.headers.get('CF-Cache-Status') || '',
 request.headers.get('Referer') || '',
 botType,
 contentLength
 ).run();
 } catch (e) {
 // Log failures should never affect site availability
 console.error('Log save failed:', e.message);
 }
}

Step 3: Cloudflare DNS Configuration

Configure Cloudflare to route traffic through the Worker.

Verify SSL/TLS encryption mode

Go to SSL/TLS → Overview in the Cloudflare dashboard and confirm the encryption mode is set to Full.

Leaving it on Flexible and then enabling the proxy can cause an HTTPS redirect loop that takes your site down — worth checking first.

Enable proxy on your DNS record

Go to DNS → Records, find the A record for your domain, and click Edit.

👁 Cloudflare DNS Records page showing an A record for leaprows.com with Proxy status set to

Enable the Proxy status toggle and save. The icon will turn into an orange cloud, which means requests will now flow through the Worker.

👁 Cloudflare DNS record edit form showing the Proxy status toggle being switched to


Step 4: Deploy

With Cloudflare configured, deploy the Worker from your local project.

wrangler deploy

That's everything needed to start collecting logs.


Step 5: Verify

To confirm logs are being recorded, run a live test from Google Search Console → URL Inspection → Test Live URL.

👁 Google Search Console URL Inspection tool showing

Search Console's live test uses the Google-InspectionTool User-Agent, so in our setup it will be recorded with bot_type = inspection-tool.

After the test completes, check D1 with the following command:

wrangler d1 execute googlebot-logs --remote --command="SELECT * FROM crawl_logs ORDER BY ts DESC LIMIT 5"

If you see a row with inspection-tool in the bot_type column, everything is working correctly.

👁 Terminal output of a wrangler d1 execute command showing crawl log records in D1, including rows with bot_type values of


Free Tier

At roughly 500 bytes per record, the 5 GB free tier holds approximately 10 million records. For an indie SaaS or personal site, you're unlikely to come close to the limit.

Service Free tier
Workers 100,000 requests / day
D1 rows written 100,000 rows / day
D1 storage 5 GB (total across all databases)

If you'd like to keep things tidy, you can add a cron job to automatically delete old logs:

# Append to wrangler.toml
[triggers]
crons = ["0 0 * * 0"] # runs every Sunday at midnight
// Append to src/index.js
async function scheduled(event, env) {
 await env.DB.prepare(`
 DELETE FROM crawl_logs
 WHERE ts < datetime('now', '-90 days')
 `).run();
}

export default {
 async fetch(request, env, ctx) {
 // ... existing fetch handler code ...
 },
 scheduled,
};

Conclusion

Serverless PaaS platforms like Vercel don't expose server access logs, but by using Cloudflare as a DNS proxy you can collect Googlebot crawl logs without any changes to your server-side code.

The D1 free tier is more than generous enough for small to mid-sized sites, making this essentially free to run.

As a next step, you could join this data with Google Search Console exports to analyze the relationship between crawl frequency and indexing status.