Looking for the #1 Tag Manager Helper?Go to GTMSPY

Race Condition
Track Googlebot in Analytics - without touching log files
Google Analytics

December 1, 2018

Track Googlebot in Analytics - without touching log files

Ready to go in 10 minuts

I will not start this article with explaining what log file analysis in SEO is or when / why it is needed, there are better overviews for that, courtesy of Builtvisible].

So if you know about log file analysis you will agree that it can make tons of sense to just have the needed data available in a powerful tool such as Google Analytics instead of coping with the pure web-server log files. At least the colleagues at DeepCrawl do so in a blog post.

While this idea is not new, the question usually pops up how to get the data from your log files into Analytics. There are different solutions, one of which is Log Hero, a server-side plugin that is predominantly available for Wordpress, though. That is, such solutions usually depend on your server infrastructure, plus pro plans might come with an unreasonable price tag.

Fill your Analytics without touching your server infrastructure

Now here’s the hint at the probably most easiest solution to get running. It became recently possible in that regard as Cloudflare introduced its Workers product, basically a solution to intercept the http traffic and to run JavaScript on client requests and/or server responses, executed on the powerful Cloudflare infrastructure.

As we’re talking about Cloudflare we infer the following prerequisites:

  1. You’re willing to put your web property behind the Cloudflare CDN solution, i.e., you configure your domain to resolve via Cloudflare’s nameservers and route all traffic through Cloudflare which is free, though.

  2. You’re willing to upgrade your free Cloudflare plan with a Workers subscription of $5/month.

As a concrete example, I will take this Medium publication which I put behind Cloudflare exactly to enable the tracking of Googlebot. It’s a great example as I have no other way to get to server log files for this publication as it is all Medium’s infrastructure.

I will not go into details of how to set up Cloudflare for Medium in general, please jump over to Imrat Jn’s nice article for that.

I suggest you open a new Analytics property just for bot tracking. Of course you can use the same property used for the domain anyway, but then it’s maybe better to track the Googlebot activity not as pageviews but as events, depending on your preferences.

Assuming the general Cloudflare setup is finished, please head over to the Workers section of your Cloudflare domain settings and hit ‘Launch editor’.

Copy the script below into the editor and change analyticsId according to your property.

Hit the save button and your Worker is live. What happens now is that whenever Google sends a request to your domain, be it to grab the robots.txt or actual content, the hits will be tracked in Analytics.

Without modifying the script, four custom dimensions will be sent to Analytics:

CD1 → HTTP Status (200, 404, ..)
CD2 → Bot Name (one of Google’s Crawlers as per this disclosure)
CD3 → Request Method (GET, POST, ..)
CD4 → Hit timestamp in seconds

You must make sure that your definitions in Analytics match the configuration of the script. Otherwise, you’re ready to go to monitor Googlebot activity on your domain without ever touching the underlying server systems \o/

const analyticsId = 'UA-xxxxxxxxx-x'

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event))
})

/**
 * Check request object for Googlebot UA to send tracking data
 * @param {Event} event
 */
async function handleRequest(event) {

  const request = event.request
  const ua = request.headers.get('user-agent')
  let botName;

  
  // If Googlebot then track hit in Analytics
  if ((botName = ua.match(/[^\s]+\-Google[^\s;]*|Googlebot[^\s;]*/g))) {
    const response = await fetch(request)
    event.waitUntil(analyticsHit(
      {
        uip: request.headers.get('CF-Connecting-IP'),
        dl: request.url,
        cd1: response.status,
        cd2: botName[0],
        cd3: request.method,
        cd4: Math.round(+new Date() / 1000.0)
      }
    ))
    return response
  }
  
  // or just return the original content
  return fetch(request)

}

/**
 * Send bot tracking data using Analytics Measurement Protocol
 * @param {Object} tracking
 */
function analyticsHit(tracking) {
  let payload = '?v=1&t=pageview&tid='+analyticsId
  for(var key in tracking) {
    payload += '&'+key+'='+tracking[key]
  }
  payload += '&cid='+[Math.round(Math.random() * 2147483647),Math.round(+new Date() / 1000.0)].join('.')
  return fetch(encodeURI('https://www.google-analytics.com/collect'+payload))
}