Guides

CRASH

Design guidelines

Briefs & proposals

Articles

Don't believe your web stats

The perils of mailing lists

Wasting money on web sites

dot slash ~ keep your URLs trim

Best-before dates

ISP surveys

1998 Central European ISP Survey

1996 UK ISP Survey

I, robot? Don't believe your web stats

Paola Kathuria

2006

Log analysers are not accurate. They over-report visits and over-count some browsers while under-counting other browsers. They cannot accurately distinguish spiders and robots from human visitors and they do not use fool-proof techniques for counting visits and visitors.

Introduction

Spiders and robots are programs which are sent out to web sites to index it, check links or to fetch content.

Your web browser is also a program: spiders and robots are essentially no different to a web browser in terms of what they can do. And, as it happens, some browsers can be set up to send out link-checking robots.

There are thousands of spiders and robots visiting web sites. We've been compiling a database of them for the past 10 years. We use it to filter out spiders and robots from our own site stats.

A staggering 90% of page requests to this web site (www.limov.com), for example, are by spiders and robots.

There are a handful of legitimate spiders but many are programs created to harvest e-mail addresses, copy content or to look for vulnerabilities in your web server. It is in the originator's interests that their programs look like regular visitors so that they gain full access to your site.

I use examples from actual logs in this article, mostly from this web site.

The difference between spiders and robots

Spiders visit web sites and follow links on a page, normally to collect content so that it can be indexed by search engines. They can also collect specific content such as e-mail addresses, images, and PDFs.

Robot is a term I use to describe programs which make single-hit visits, often hitting the same page at regular intervals. Robots don't follow links.

You don't need to be a computer expert to have your own spider. Source code is freely available online.

I will use 'bot' for the remainder of this article to refer to both spiders and robots.

A brief introduction to server logs

In this section, I'll be covering server log structure, how bots are supposed to identify themselves and how you can find rogue bots in logs. If you know all this, skip to the next section.

What server logs look like

You need to know the difference between requests and hits to be able to interpret web logs and stats. Requests refer to pages. Most web pages include a mixture of text and images. The images are included in the page as links to files on the server. If a web page includes 10 graphics, accessing the page will result in 1 request and 11 hits, with one log line per hit.

Every hit made to a web site is logged. A hit can vary from finding out whether a page or file has been updated to fetching web pages, style sheets, images and other files, such as PDFs.

Failed requests are also logged, for example when a page has been removed or when it's password-protected. Failed hits also include hacking attempts to invoke vulnerabilities in (mostly) Microsoft Windows servers.

This information is logged for each hit:

IP address or host - where the request comes from
username - the username of an authenticated user (via .htaccess)
date/time - date and time of access
request - request type (GET / POST / HEAD)
URL - what was requested
version - HTTP version
status code - success/failure code
size - number of bytes downloaded
referrer - URL of referring page
user-agent - how the browser identifies itself

'Agent' is a term used for tools sent out to act on your behalf. Browsers and bots are agents.

Here are actual logs lines resulting from a visitor displaying one web page, the Colour Selector entrance page (I've changed the IP address):

255.60.45.22 - - [04/Mar/2006:00:40:56 +0000] "GET /colour/ HTTP/1.1" 302 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.145.22 - - [04/Mar/2006:00:40:57 +0000] "GET /colour/?ID=WQVZHB7F7N30D00 HTTP/1.1" 302 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /colour/ HTTP/1.1" 200 - "http://www.google.de/search?hl=de&client=firefox-a&rls=org.mozilla:en-US:official&q=color+scheme+library&spell=1" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /css/screen.css HTTP/1.1" 200 6600 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/icons/colour-favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /css/screen-libr.css HTTP/1.1" 200 1087 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /css/screen-nav.css HTTP/1.1" 200 1560 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /css/print.css HTTP/1.1" 200 2167 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/g-libr.jpg HTTP/1.1" 200 4841 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U;Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/l-limov.gif HTTP/1.1" 200 3613 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /p.gif HTTP/1.1" 200 49 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/tb-serv.gif HTTP/1.1" 200 751 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/tb-port.gif HTTP/1.1" 200 565 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/tb-abou.gif HTTP/1.1" 200 950 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:40:59 +0000] "GET /images/tb-cont.gif HTTP/1.1" 200 932 "http://www.limov.com/css/screen-nav.css" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/p-my-yc-cm.gif HTTP/1.1" 200 2280 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/b00-cs.gif HTTP/1.1" 200 106 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/b00-nv.gif HTTP/1.1" 200 70 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/b00-sw.gif HTTP/1.1" 200 133 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
255.60.45.22 - - [04/Mar/2006:00:41:00 +0000] "GET /images/colour/b00-bg.gif HTTP/1.1" 200 877 "http://www.limov.com/colour/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"

How to find spiders and robots in logs

By the user-agent
If it fetches robots.txt
By its behaviour

Spiders and robots usually identify themselves in the user-agent. However, there is no standard text that they're supposed to add to the user-agent (such as "I'm a spider") so that they can be found.

This means that there is no automatic way that programs which process logs to produce stats can detect them. A list of spider and robot user-agents and IP addresses must be maintained on an on-going basis so that these visitors are not included in your regular web site stats. This does not happen automatically.

Web developers can put instructions in text files for spiders to make some parts of the site off limits. This could be for peformance reasons. The instructions are put on the server in a file called robots.txt. Spiders and robots are supposed to read this file at the start of each visit but there's no way to enforce that they do. Most bots ignore the file.

However, if a visitor does access robots.txt, it's most likely a spider.

A visit from Google's Googlebot spider. It fetches robots.txt and has a helpful user-agent. Notice how the IP address and user-agent changes.

host	date/time	requested file	user-agent
66.249.71.53	05/Mar/2006 @ 00:29:14	/robots.txt	Googlebot/2.1 (+http://www.google.com/bot.html)
66.249.71.53	05/Mar/2006 @ 00:29:15	/contact.lml	Googlebot/2.1 (+http://www.google.com/bot.html)
66.249.65.5	05/Mar/2006 @ 00:29:28	/projects.lml?w=0&wo=1&p=bbr-5	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.65.5	05/Mar/2006 @ 00:34:33	/other-work.lml?p=disney-1	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.71.69	05/Mar/2006 @ 00:39:31	/projects.lml?w=0&wo=1&p=oup-6	Googlebot/2.1 (+http://www.google.com/bot.html)
66.249.64.42	05/Mar/2006 @ 01:06:50	/other-work.lml?p=hp-1	Googlebot/2.1 (+http://www.google.com/bot.html)
66.249.71.32	05/Mar/2006 @ 01:20:26	/projects.lml?p=crash-1	Googlebot/2.1 (+http://www.google.com/bot.html)
66.249.65.5	05/Mar/2006 @ 01:56:53	/	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.64.30	05/Mar/2006 @ 02:18:11	/projects.lml?p=whm-1	Googlebot/2.1 (+http://www.google.com/bot.html)
66.249.71.45	05/Mar/2006 @ 02:19:14	/projects.lml?p=bbr-4	Googlebot/2.1 (+http://www.google.com/bot.html)

A visit from Baiduspider. It usefully identifies itself in the user-agent. It accesses our home page but doesn't fetch robots.txt

host	date/time	requested file	user-agent
202.108.22.72	05/Mar/2006 @ 03:30:17	/	Baiduspider+(+http://www.baidu.com/search/spider.htm)
202.108.22.72	05/Mar/2006 @ 07:32:58	/	Baiduspider+(+http://www.baidu.com/search/spider.htm)
202.108.22.72	05/Mar/2006 @ 12:10:02	/	Baiduspider+(+http://www.baidu.com/search/spider.htm)

There are long gaps between visits. If Baiduspider hasn't been added to your web site stat program's filter file for spiders (assuming one exists), then this spider's visits will show up as regular one-page visits in your web site stats.

Find a spider by how it crawls

Next is a suspect series of visits. The log lines appear together in an uninterrupted block in the log file. The log lines of human visitors are usually interleaved as people take longer between requests compared to spiders.

The accesses shown people are from different IP addresses but they all refer to the same session id. The session id is a unique visitor id we add to the URL if we can't out it in a cookie. The IP addresses are from different countries.

A suspect visit: same session id but each IP address is in a different country. Stylesheets and images are only fetched for one page.

host	date/time	requested file	gap
213.61.13.68	5/Mar/2006 @ 04:37:56	/?ID=X46ZB0H3N5C00B4
213.61.13.68	5/Mar/2006 @ 04:37:57	/whatsnew.lml?ID=X46ZB0H3N5C00B4	1s
213.61.13.68	5/Mar/2006 @ 04:37:58	/projects.lml?ID=X46ZB0H3N5C00B4	1s
213.61.13.68	5/Mar/2006 @ 04:38:01	/journal/?ID=X46ZB0H3N5C00B4	3s
221.45.136.41	5/Mar/2006 @ 04:38:06	/description.lml?sm=1&w=8&ID=X46ZB0H3N5C00B4	5s
196.40.26.246	5/Mar/2006 @ 04:38:11	/contents.lml?ID=X46ZB0H3N5C00B4	5s
192.138.77.36	5/Mar/2006 @ 04:38:12	/preferences.lml?ID=X46ZB0H3N5C00B4	1s
192.138.77.36	5/Mar/2006 @ 04:38:13	/ico/pref-favicon.ico
192.138.77.36	5/Mar/2006 @ 04:38:13	/css/screen-nav.css
192.138.77.36	5/Mar/2006 @ 04:38:13	/css/screen-site.css
192.138.77.36	5/Mar/2006 @ 04:38:13	/css/print.css
192.138.77.36	5/Mar/2006 @ 04:38:13	/p.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/l-limov.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/css/screen.css
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/p-os-01.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/site-offsite.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/p-os-2.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/p-os-1.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/p-fs-s.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/p-lh-s.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/p-fs-l.gif
192.138.77.36	5/Mar/2006 @ 04:38:13	/images/p-lh-t.gif
195.113.171.76	5/Mar/2006 @ 04:38:15	/colour/tips.lml?ID=X46ZB0H3N5C00B4	3s
203.148.194.131	5/Mar/2006 @ 04:38:28	/contact.lml?ID=X46ZB0H3N5C00B4	13s
211.106.21.155	5/Mar/2006 @ 04:39:20	/description.lml?sm=1&w=1&ID=X46ZB0H3N5C00B4	52s
216.41.76.34	5/Mar/2006 @ 04:39:40	/projects.lml?w=0&p=oup-7&ID=X46ZB0H3N5C00B4	20s
219.24.170.3	5/Mar/2006 @ 04:39:49	/projects.lml?wm=1&p=oup-7&ID=X46ZB0H3N5C00B4	9s
220.84.214.190	5/Mar/2006 @ 04:39:59	/about.lml?ID=X46ZB0H3N5C00B4	10s

Is it a coincidence that different people in different countries happened to visit this web site using the same session id in the URL within a few seconds of each other, each only fetching web pages and not the images and style sheets?

I'd say this was a spider. There is nothing in the host or user-agent information which allows us to recognise it. Only its odd behaviour gives it away.

To filter out this visitor from my custom stats in future, I have to block it by the session id and/or all the IP addresses it used.

How to track visits

In addition to using algorithms to process standard server logs, people can develop custom logs with extra information. They track visitors by putting a generated unique session id in the URL or write it to a cookie. The id is read back at every request so that the request can be logged against the session id.

If you don't use session ids, you can make some guesses on which request are from the same visitor by looking at the server logs.

The same host IP address in a short period
The referring page is from another web site
The user-agent looks like a browser

However, any of these might change within a visit.

What can happen within a visit

This is what I've found from reviewing server logs regularly.

Spiders use cookies
The IP address changes within a visit
There is no referring URL
The referring site reappears within a visit
A different referring site can appear within a visit
A long gap doesn't neccessarily mean a different visitor
The user-agent can change within a visit
Cookies aren't 100% reliable
The user-agent isn't reliable
Some robots pretend to be human
Most robots spoof MSIE

Spiders use cookies

Spiders can be sent cookies and allow them to be reread by a site on subsequent visits.

Actual accesses from the same IP address from a repeat visitor because a cookie with visit counts was being maintained.

visit count	IP	date/time	requested file	user-agent	referring page
1	209.167.50.22	21-Oct-2005 @ 15:19:34	/	LinkWalker	www.emlc.org.uk/Links.htm
2	209.167.50.22	24-Oct-2005 @ 12:22:36	/	LinkWalker	www.emlc.org.uk/Links.htm
3	209.167.50.22	25-Oct-2005 @ 16:09:39	/	LinkWalker	www.emlc.org.uk/Links.htm
1	209.167.50.22	26-Oct-2005 @ 14:05:53	/	LinkWalker	www.emlc.org.uk/Links.htm
	209.167.50.22	27-Oct-2005 @ 15:19:10	/	LinkWalker	www.emlc.org.uk/Links.htm
1	209.167.50.22	28-Oct-2005 @ 14:57:17	/	LinkWalker	www.emlc.org.uk/Links.htm
2	209.167.50.22	31-Oct-2005 @ 12:26:23	/	LinkWalker	www.emlc.org.uk/Links.htm
3	209.167.50.22	01-Nov-2005 @ 12:57:48	/	LinkWalker	www.emlc.org.uk/Links.htm
1	209.167.50.22	02-Nov-2005 @ 14:34:34	/	LinkWalker	www.emlc.org.uk/Links.htm

The IP address changes within a visit

When examining raw logs, it is common to see a single visit in which each page access is from a different host. This is how visitors appear in logs when their connection is via a cacheing proxy.

Here is a visitor to the Colour Selector who made 17 page requests from ten different hosts during a single three-minute visit. Most log analysers will interpret these page requests as ten separate visitors.

Log lines from a visitor showing up from a variety of hosts

host address	date/time	requested file	referring page
anchovy.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:35:56	/colour/colour.html	-
mozzarella.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:36:17	/colour/216.html	/colour/colour.html
ham.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:36:46	/colour/216/33ccff.html	/colour/216.html
anchovy.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:37:24	/colour/216/3399ff.html	/colour/216.html
fides.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:37:30	/colour/216/33ffff.html	/colour/216.html
pineapple.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:37:35	/colour/216/66ffff.html	/colour/216.html
thyme.cant.ac.uk	10-Aug-2002 @ 14:37:40	/colour/216/66ccff.html	/colour/216.html
basil.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:37:45	/colour/216/6699ff.html	/colour/216.html
tomato.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:37:49	/colour/216/0099ff.html	/colour/216.html
anchovy.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:38:04	/colour/216/ffccff.html	/colour/216.html
mozzarella.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:38:12	/colour/216/ffcc33.html	/colour/216.html
anchovy.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:38:17	/colour/216/6600ff.html	/colour/216.html
thyme.cant.ac.uk	10-Aug-2002 @ 14:38:27	/colour/216/ccffcc.html	/colour/216.html
mozzarella.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:38:33	/colour/216/ccff66.html	/colour/216.html
tomato.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:38:37	/colour/216/ccffff.html	/colour/216.html
jalapeno.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:38:52	/colour/216/ffffff.html	/colour/216.html
oregano.ulcc.wwwcache.ja.net	10-Aug-2002 @ 14:39:03	/colour/216bg.html	/colour/216.html

This second example is from an AOL user. Four images were viewed during a visit, each from a different host (and IP) address.

A visitor where the host address is different for every access. The source log lines of this visit were logged on the same day.

host address	time	requested file	user-agent
cache-mtc-aa09.proxy.aol.com	00:24:13	/workshops/14th/lindsay.jpg	Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705)
cache-mtc-ak07.proxy.aol.com	00:24:38	/workshops/14th/frank-lindsay.jpg	Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705)
cache-mtc-am07.proxy.aol.com	00:25:27	/workshops/14th/rosa2-2002-06-18.jpg	Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705)
cache-mtc-ak03.proxy.aol.com	00:25:38	/workshops/14th/lindsay-size.jpg	Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1; .NET CLR 1.0.3705)

And, because of cacheing proxies or people who use services such as AOL, different visitors can look like the same visitor if you look only at their host address.

The referring URL is optional

The referring URL is information that is sometimes made available to the web server. It is the URL of the page which included a link to your web site which was followed by the visitor (for example, a page on your site might get listed in a search engine).

For the first page request of a visit, the referring URL will tell you the address of the page on another web site from where a link was followed. For subsequent pages, referrers will be pages within the site.

People use Back a lot - the original referrer can reappear

It isn't always the case that the referring site will only appear as the referrer for the first page in a visit because people use the browser's Back button a lot.

You might expect log lines for a visit to follow the trend of each page being the referrer of the next page accessed. The log lines of a made-up a visit are shown next.

Example log of a possible visit, where each page is the referrer of the next page requested

time	requested file	referring page
12:44:05	/colour/	http://www.site.com/links.html
12:47:00	/colour/tips.lml	/colour/
12:47:16	/colour/colour.lml	/colour/tips.lml
12:47:29	/colour/browse-palettes.lml	/colour/colour.lml
12:50:05	/library/	/colour/browse-palettes.lml
12:50:12	/projects.lml	/library/
12:50:24	/colour/	/projects.lml
12:50:32	/services.lml	/colour/
12:50:45	/colour/colour.lml	/services.lml
12:51:10	/colour/tools.lml	/colour/colour.lml

However, when you look at your web server logs, you will see that people manage to get to pages from a page which doesn't include any links to the new page.

That is because they got to the new page from a link on an earlier page that was cached. When a visitor uses their browser to go back to a cached earlier page, the page isn't logged at the web server; the redisplay of a cached page means the browser gets the page from cache and not from the server.

A visit in which the referring page can be a page earlier than the last page requested. The source log lines for this visit were logged on the same date, with the same session id and user-agent.

time	requested file	referring page
12:44:05	/colour/	http://www.web-graphics.com/feature-002.php
12:47:00	/colour/tips.lml	/colour/
12:47:16	/colour/colour.lml	/colour/
12:47:29	/colour/browse-palettes.lml	/colour/
12:50:05	/library/	/colour/browse-palettes.lml
12:50:12	/projects.lml	/colour/colour.lml
12:50:24	/colour/	/library/
12:50:32	/services.lml	/projects.lml
12:50:45	/colour/colour.lml	/colour/
12:51:10	/colour/tools.lml	/colour/colour.lml

Given this, it is possible that someone can reach your site from a link on another site, explore your site for a bit but then return to the entry page through the browser's Back button.

In this event, the first page request in the visit would have a referrer of another web site. Subsequent pages would have internal referrers but then the new outside referrer would reappear in the logs.

A visit in which the referring site reappears as the referrer for a later page. This was logged on the same day, with the same session id and user-agent.

time	requested file	referring page
14:57:15	/colour/	http://uk.google.yahoo.com/bin/query_uk?p=216+colours
15:06:55	/library/	/colour/
15:07:06	/projects.lml	/library/
15:07:50	/services.lml	/projects.lml
15:07:58	/colour/	http://uk.google.yahoo.com/bin/query_uk?p=216+colours
15:08:14	/colour/tips.lml	/colour/

Another scenario that explains such a visit pattern is when a web site is accessed through multiple browsers. The choice of earlier and later pages on screen to interact with will contribute to the lack of a coherent path through the site in the logs.

In addition, anecdotal evidence suggests that people with screen resolutions higher than 800x600 browse with multiple windows. In the case of a site like this, where the session id is carried around in the URL, this behaviour becomes apparent when the log includes visits from the same referrer and from the same host address and user-agent.

The source log lines of this visit were logged on the same day and with the same user-agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)

session id	time	page requested	referring page
7D9JHV75DE9H9G6	15:06:48	/inetuk/notice.lml	-
GEL8G594EHL5SH4	15:07:06	/inetuk/notice.lml	-
	15:07:55	/ico/hide1-favicon.ico	-
GEL8G594EHL5SH4	15:08:05	/inetuk/links.lml	/inetuk/notice.lml
GEL8G594EHL5SH4	15:08:21	/services.lml	/inetuk/links.lml
7D9JHV75DE9H9G6	15:08:28	/inetuk/about.lml	/inetuk/notice.lml
GEL8G594EHL5SH4	15:08:48	/projects.lml	/services.lml
7D9JHV75DE9H9G6	15:11:04	/projects.lml	/inetuk/about.lml
	15:11:14	/ico/port-favicon.ico	-
GEL8G594EHL5SH4	15:11:43	/projects.lml	/services.lml
GEL8G594EHL5SH4	15:12:17	/projects.lml?s=t	/projects.lml
7D9JHV75DE9H9G6	15:12:22	/inetuk/about.lml	/inetuk/notice.lml
7D9JHV75DE9H9G6	15:14:00	/inetuk/notice.lml	/inetuk/about.lml

A different referring site can appear within a visit

It is possible for a different referring site to appear during a visit.

The next visit is of a visitor to this site via Google but with a different referring site for the sixth page request.

A visit with a referring site for the first page access but a different referring site later on in the visit. This was logged on the same day, with the same session id and user-agent.

time	page requested	referring page
01:17:36	/colour/	http://www.google.com/search?q=color+palettes
01:18:07	/library/guidelines.lml	/colour/
01:18:39	/library/promotion.lml	/library/guidelines.lml
01:18:43	/journal/	/library/promotion.lml
01:22:12	/library/promotion.lml	/library/guidelines.lml
01:22:27	/journal/	http://www.thestudyofdesign.com/links_magazines_l.asp

During their visit, they created a link to our site from theirs (complete with the session id in the URL) and then presumably tested the link which explains the appearance of the second outside referrer.

People take breaks - a long gap doesn't neccessarily mean a different visitor

You can't rely on time between requests to decide if it's a new visit. This is because people might start something at work in the afternoon - go home without closing their browser - then come back and expect to carry on with whatever's in their browser. In this event, a gap between page requests could easily be 17 hours. It is not uncommon to see gaps of an hour or two in logs.

If accesses from the same host have a gap of more than 30 mins, WebTrends counts is as from different visitors.

A visit which includes several long gaps. This was logged on the same date, with the same session id and user-agent.

time	gap	requested file	referring page
10:41:13		/colour/	http://www.google.com/search?q=color+selector
10:41:23	0:00:10	/colour/mix.lml?c=9CF	/colour/
10:41:41	0:00:18	/colour/mix.lml?c=3CF	/colour/mix.lml?c=9CF
10:41:48	0:00:07	/colour/mix.lml?c=6FF	/colour/mix.lml?c=3CF
10:41:55	0:00:07	/colour/mix.lml?c=0FF	/colour/mix.lml?c=6FF
10:42:01	0:00:06	/colour/mix.lml?c=F93	/colour/mix.lml?c=0FF
10:42:12	0:00:11	/colour/mix.lml?c=FC6	/colour/mix.lml?c=F93
10:42:55	0:00:43	/colour/mix.lml?c=F66	/colour/mix.lml?c=FC6
11:17:28	0:34:33	/colour/mix.lml?c=F63	/colour/mix.lml?c=F66
11:18:01	0:00:33	/colour/mix.lml?c=F60	/colour/mix.lml?c=F63
50 page requests not shown - gap range: 2 secs - 11 mins (average: 1 min)
12:18:08	0:00:09	/colour/swatch.lml?c=3F9	/colour/swatch.lml?c=6F6
12:42:41	0:24:33	/colour/swatch.lml?c=3F6	/colour/swatch.lml?c=3F9
12:42:47	0:00:06	/colour/swatch.lml?c=3F3	/colour/swatch.lml?c=3F6
12:47:42	0:04:55	/colour/swatch.lml?c=3C3	/colour/swatch.lml?c=3F3
12:48:39	0:00:57	/colour/swatch.lml?c=3C6	/colour/swatch.lml?c=3C3

The user-agent can change within a visit

It is possible for the user-agent to change within a visit. When it happens, it's usually a robot visitor but it can also happen with human visitors.

The log lines below are a 183-page visit from the same IP address. This can be recognised as a spider by the quick requests in a short space of time.

A single visit in which the user-agent changes.

host IP	date/time	requested file	user-agent
63.144.65.58	18/Apr/2001 @ 01:00:23	/inetuk/providers.html	Mozilla/4.03 [en] (Win95; I)
63.144.65.58	18/Apr/2001 @ 01:02:16	/inetuk/providers/akhter.html	Mozilla/4.03 [en] (Win95; I)
63.144.65.58	18/Apr/2001 @ 01:02:19	/inetuk/providers/agent-cd.html	Mozilla/3.01Gold (Win95; I; 16bit)
63.144.65.58	18/Apr/2001 @ 01:02:21	/inetuk/providers/andover.html	Mozilla/3.01Gold (Win95; I; 16bit)
63.144.65.58	18/Apr/2001 @ 01:02:21	/inetuk/providers/angel.html	Mozilla/2.0 (compatible; MSIE 3.02; Windows 95)
63.144.65.58	18/Apr/2001 @ 01:02:22	/inetuk/providers/aladdin.html	Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
63.144.65.58	18/Apr/2001 @ 01:02:22	/inetuk/providers/apanet.html	Mozilla/3.0 (Win16; I)
63.144.65.58	18/Apr/2001 @ 01:02:22	/inetuk/providers/amity.html	Mozilla/4.03 [en] (Win95; I)
63.144.65.58	18/Apr/2001 @ 01:02:22	/inetuk/notify.html	Mozilla/3.0 (Win16; I)
63.144.65.58	18/Apr/2001 @ 01:02:22	/inetuk/catch/	Mozilla/2.0 (compatible; MSIE 3.02; Windows 95)

A short visit with a changing user-agent, the first references MSIE

host IP	date/time	requested file	user-agent
njproxy4.avaya.com	30/Apr/2001 @ 14:34:18	/colour/navigate.lml	Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; DigExt; WebSite-Watcher (unreg.) http://aignes.net)
njproxy4.avaya.com	30/Apr/2001 @ 14:36:48	/colour/navigate.lml	Mozilla/3.01 (compatible;)

Logs lines from a 324-hit visit, all from agent.lisco.com. The user-agent changes within the visit.

hit #	date/time	requested file	user-agent
1	29/May/2001 @ 14:59:58	/innovations/images/b-home.gif	Mozilla/3.01 (compatible;)
4	29/May/2001 @ 14:59:58	/innovations/library/requirements.html	Mozilla/4.77 [en] (Win95; U)
5	29/May/2001 @ 14:59:58	/innovations/images/d-structure.gif	Mozilla/3.01 (compatible;)
9	29/May/2001 @ 15:00:02	/innovations/innovate.css	Mozilla/4.77 [en] (Win95; U)
10	29/May/2001 @ 15:00:02	/innovations/images/g-libr.jpg	Mozilla/3.01 (compatible;)
14	29/May/2001 @ 15:43:55	/innovations/favicon.ico	Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
15	29/May/2001 @ 16:12:41	/~paola/	Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
16	29/May/2001 @ 16:12:41	/~paola/pictures/icons/paola.jpg	Mozilla/3.01 (compatible;)
23	29/May/2001 @ 16:12:41	/~paola/paola.css	Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
24	29/May/2001 @ 16:12:42	/~paola/pictures/icons/contents.gif	Mozilla/3.01 (compatible;)

Users can change the user-agent sent to web servers in browsers (e.g., Internet Explorer, Mozilla, Opera, Konqueror and Lynx) and spiders.

If these visitors aren't know by stats programs, their visits will be counted in browser stats.

Cookies can be enabled, disabled or removed

On average, 10% of human visitors to this web site can't or won't accept cookies.

With a cookie-enabled browser, web users have control of how cookies are used.

They can accept all cookies
They can only accept cookies from certain domains
They can only accept certain cookies from certain domains
They can reject third-party cookies, those from a domain different to the current site
They can remove one of more cookies
They can edit the cookie contents

Regardless of what browser is used, it is always possible to remove cookies within a visit. Cookies may be removed deliberately, perhaps in a big clearout, or become corrupted or lost during a disk crash.

One can't assume that a) cookies get written or that b) they'll remain on visitors' computers.

StatMarket's HitBox counts visitors by use of third-party cookies. People can easily configure their browsers to reject third-party cookies - those not originating from the web site they're visiting. If someone visits a site with a HitBox counter and their browser rejects the cookie, HitBox will count every page request as a new visit. HitBox over-counts visits.

Example: imagine a site that had 2,000 actual visits in one day with three requests on average. The true visit count is 2,000. If our 10% non-cookie figure is typical, HitBox would correctly count 1,800 (90% of 2,000) of the visits. However, it would process the 600 (10% of 2,000 visits x 3 pages) page requests as visits, producing an incorrect total of 2,400 visits.

The user-agent isn't reliable

A lot of web sites are optimised for Internet Explorer because it's easier for developers to ignore other browsers; until a couple of years ago, the Marks and Spencers web site turned away Mozilla users, telling them to get a better browser.

To get around this problem, modern browsers let users set the user-agent to something else. This is usually MSIE, since so many sites are optimised for IE.

Some robots pass themselves off as humans

Sometimes the only indication that a visitor is robot is the time between accesses. Our sites have repeat visitors which accept cookies, and have an user-agent that looks like a normal browser.

What gives them away as robots are:

They visit at regular intervals and access the same pages, such as all the links on the home page
They access all the links on a web page and in the order they appear
They access 5-10 pages within a second

The last behaviour is how they can be spotted in the logs as their log lines will appear in clumps.

Most robots spoof MSIE

Because so many sites are optimised for Microsoft Internet Explorer (MSIE), bots send an MSIE user-agent. if they aren't detected as bots, they'll over-represent the proportion of IE users, misleading the site's developers into thinking they made the right decision to turn away other browsers.

Below are log lines from a single IP address to this site. It's a robot specifically designed to show in the logs with certain referring sites. This has become a trend in robots once blogs, for example, started publishing trackback links to referring sites. These bots are basically getting other sites to publish links to their sites.

All the accesses came from 166-82-31-14.quickclick.ctc.net with the user-agent Mozilla/4.0 (compatible; MSIE 5.01; Windows 98). I've edited the referrers.

date/time	requested file	referring page
30/Dec/2005 @ 18:20:50	/journal/?ID=H64SBVK4NKN00F8&jm=1&e=1061	http://www.adsense-xpress.falling.net/forex777.htm
30/Dec/2005 @ 18:20:51	/journal/?ID=XQKSBV74NKM00DR&jm=1&e=1061	http://www.adsense-xpress.falling.net/swapclix.htm
24/Jan/2006 @ 12:54:45	/inetuk/interop96.lml	http://www.tvinfomercials.com/
24/Jan/2006 @ 12:54:45	/inetuk/interop96.lml	http://www.7dayplan.war-q.com
17/Feb/2006 @ 06:30:53	/inetuk/interop96.lml	http://www.bugtraininginfo.com/
20/Feb/2006 @ 11:51:05	/inetuk/interop96.lml	http://www.phoneconferences247.com/
20/Feb/2006 @ 11:51:06	/inetuk/interop96.lml	http://www.bugtraininginfo.com
26/Feb/2006 @ 06:32:18	/inetuk/interop96.lml	http://www.catcast2006.com/
03/Mar/2006 @ 14:21:49	/inetuk/interop96.lml	http://www.war-q.com
03/Mar/2006 @ 14:21:51	/inetuk/interop96.lml	http://www.200-free-4resale-products.numbers.com

Don't rely on stats from log analysers

I've spent more time than I care to admit looking at server logs and discovered unexpected behaviour by both human and spider visitors.

Because unique visitors can't be accurately detected, some browsers end up being over- or under-counted by stats.

There are still more complications. For example, many hit counters collect stats via accesses to a GIF placed on your web pages. Text-only browsers and screen-readers don't access the GIF and so are never included in the browser stats. This means that some disabled visitors are not included at all in browser stats.

I've concluded that you mustn't believe your web stats if they're based on log analysis - they'll tend tell you good news when the reality is likely to be discouraging.