Ghostbusting Google Analytics

… And exhausted to tears, surrounded by empty vials of holy water, bags that once contained salt, discarded iron bars and the burned bones of vitaly, who no longer ruled Google, the web developer finally sealed the gateway to the beyond, banishing the ghosts from Google Analytics stats forever.

I wish. That would have been a lot more fun and made for a better story.

The true, and considerably less interesting, version is that I spent six weeks or so this spring battling ghost spammers, picking apart stats, searching the web for solutions, trying them out for a few days and adjusting or discarding them. Since I’m managing Google Analytics on 70 web sites, this consumed far too much time I needed to spend on other projects. I finally found something that works and is easier to do than a lot of the other methods out there.

What Doesn’t Work

If you start searching the web to identify and block specific ghost referrals, the how-to’s that come up will be about one or a group of similar bots. The standard method is to create a custom filter that will target a referral, campaign source or search term.*

The articles explain how to create a pattern to match one or a series of bots you want to block, such as this one blocking event-tracking.com:

bad filter in Google AnalyticsThis works a bit, but not enough, for two reasons:

  1. The spambots keep changing. Filter out 20 and 20 more will spring up. This is the same reason that Google’s bot and spider filter doesn’t work.
  2. Ghost spammers play dirty and inject bad data into multiple fields simultaneously. Even if you set up a successful referral filter, it may come in as a campaign source at the same time. I ran into this trying to block event-tracking.com.

When I captured the stats for a four-day period ghost spammers hit the European Studies site’s stats, I found out just how dirty they play.

Google recorded 153 sessions for the site during that time. Filtering out specific bots took that number down to 135. Then I compared the results to a new filter using a different method to remove bots (I’ll describe it in a moment). It showed that exactly one session of the entire 153 was legitimate. Here’s a table of the stats:

Stats for European Studies site for a four-day period, spring 2015

Data Unfiltered Filtered by Bot Filtered by Hostname
All Sessions 153 135 1
Referrals 19 1 1
Direct 134 134 0

Those 134 extra sessions the bot filter missed weren’t referrals at all, nor were they any of the other types of ghosts I’d seen. They’re in the stats as direct visits with no other data attached. Because there’s no identifier, you can’t filter them out using an exclude filter.

So what do you do to get rid of them?

What Does Work

Ironically, once you know the solution you’re looking for, the information is fairly easy to find. After I stumbled across the answer I found good articles on this technique at Megalytic and Analytics Edge, among others.

Those articles are for people who work regularly with Google Analytics. Here’s the step-by-step solution written for people who don’t spend their time thinking about bounce rates and average session durations:

1. Log into Google Analytics and go to the stats for the view you want to remove ghost spam from. In the date selector in the top-right corner, select a large date range (I went back a full year plus a few days when I was doing this).

date selector in Google Analytics

2. Now, under Audience in the left sidebar of the page, find the Technology subheading. If you expand that you’ll see Network. Go to Network. Just below the stats graph on that page you’ll see Primary Dimension links. Click on Hostname.

hostname link in Google Analytics

3. Because the spammers don’t know what sites they’re visiting — they have the Analytics key, not the site name or URL — this will be full of incorrect, fake or unset hostnames. What you need to do now is ignore them and compile a list of the ones that are good hostnames. Those would be all the ones associated with your site. If you see google.com, that’s spam, but anything from googleusercontent.com (usually from Google Translate) is probably good. Be sure to include hostnames of any social media you’re tracking with Google Analytics.

4. Using the list of good hostnames, create a filter pattern. For the European Studies site, mine was (I’m replacing the actual university name with myuniversity here):

.*(myuniversity\.edu|googleusercontent\.com).*

This means “anything that contains myuniversity.edu or googleusercontent.com; it may have other characters before or afterwards.” If you don’t know much about regular expressions, Google Analytics Help has an article on them. If you’re still stumped and have access to a web developer or other programmer, they should be able to help you figure it out.

5. Create a safety net before you continue. Go into Google Analytics admin and create a copy of the view you’re going to work on. Make sure this has all the same settings as the original. I added Unfiltered and Filtered to the names of my views so I’d recognize them immediately.

list of views in Google Analytics

6. For the view you want to remove the ghost spam from (in this case it’s European Studies Filtered), go to Filters and click the New Filter button. Give the filter a name such as Valid Hostname and select the Custom tab. Check Include and select Hostname as the Filter Field. Then add the pattern you created to Filter Pattern. Don’t make it case sensitive. Then apply the filter to the view you want spam free and save your work.

filter to clean out ghosts

7. Go in and check your work after two or three days because it’s easy to make a mistake. When I set up test views to try this out, I accidentally clicked Exclude instead of Include on one site, and the stats for that view flatlined. (That’s one reason why you want an unfiltered view as a safety net.)

Final Notes

The hostname filter is a low-maintenance solution. Once it’s in place you’ll have to check that your filter pattern is still good after changes to your hostnames or if you add or remove social media. I’d check it a couple of times a year anyway to be sure spammers haven’t come up with something new.

Finally, this does absolutely nothing to deal with real spambots. If you’re having problems with semalt.com or any of the thousands of other bots that actually visit web sites, you’ll have to handle those separately.

If you have only a handful you need to block you could put those in your .htaccess file on your server. There are Drupal modules and WordPress plugins to help with bots and other security concerns. If you need a more robust approach, look into Project Honey Pot.


*Quick definitions: referrals are outside links to your site; Campaign Source is traffic from AdSense and other ad campaigns; Search Terms are search-engine keywords that led to your site; and direct hits are those who visit a site directly by typing in the address or using a browser bookmark.