|
![]() |
I, robot? Don't believe your web statsLog analysers are not accurate. They over-report visits and over-count some browsers while under-counting other browsers. They cannot accurately distinguish spiders and robots from human visitors and they do not use fool-proof techniques for counting visits and visitors. IntroductionSpiders and robots are programs which are sent out to web sites to index it, check links or to fetch content. Your web browser is also a program: spiders and robots are essentially no different to a web browser in terms of what they can do. And, as it happens, some browsers can be set up to send out link-checking robots. There are thousands of spiders and robots visiting web sites. We've been compiling a database of them for the past 10 years. We use it to filter out spiders and robots from our own site stats. A staggering 90% of page requests to this web site (www.limov.com), for example, are by spiders and robots. There are a handful of legitimate spiders but many are programs created to harvest e-mail addresses, copy content or to look for vulnerabilities in your web server. It is in the originator's interests that their programs look like regular visitors so that they gain full access to your site. I use examples from actual logs in this article, mostly from this web site. The difference between spiders and robotsSpiders visit web sites and follow links on a page, normally to collect content so that it can be indexed by search engines. They can also collect specific content such as e-mail addresses, images, and PDFs. Robot is a term I use to describe programs which make single-hit visits, often hitting the same page at regular intervals. Robots don't follow links. You don't need to be a computer expert to have your own spider. Source code is freely available online. I will use 'bot' for the remainder of this article to refer to both spiders and robots. A brief introduction to server logsIn this section, I'll be covering server log structure, how bots are supposed to identify themselves and how you can find rogue bots in logs. If you know all this, skip to the next section. What server logs look likeYou need to know the difference between requests and hits to be able to interpret web logs and stats. Requests refer to pages. Most web pages include a mixture of text and images. The images are included in the page as links to files on the server. If a web page includes 10 graphics, accessing the page will result in 1 request and 11 hits, with one log line per hit. Every hit made to a web site is logged. A hit can vary from finding out whether a page or file has been updated to fetching web pages, style sheets, images and other files, such as PDFs. Failed requests are also logged, for example when a page has been removed or when it's password-protected. Failed hits also include hacking attempts to invoke vulnerabilities in (mostly) Microsoft Windows servers. This information is logged for each hit:
'Agent' is a term used for tools sent out to act on your behalf. Browsers and bots are agents. Here are actual logs lines resulting from a visitor displaying one web page, the Colour Selector entrance page (I've changed the IP address): |
|
|
How to find spiders and robots in logs
Spiders and robots usually identify themselves in the user-agent. However, there is no standard text that they're supposed to add to the user-agent (such as "I'm a spider") so that they can be found. This means that there is no automatic way that programs which process logs to produce stats can detect them. A list of spider and robot user-agents and IP addresses must be maintained on an on-going basis so that these visitors are not included in your regular web site stats. This does not happen automatically. Web developers can put instructions in text files for spiders to make some parts of the site off limits. This could be for peformance reasons. The instructions are put on the server in a file called robots.txt. Spiders and robots are supposed to read this file at the start of each visit but there's no way to enforce that they do. Most bots ignore the file. However, if a visitor does access robots.txt, it's most likely a spider. |
|
A visit from Google's Googlebot spider. It fetches robots.txt and has a helpful user-agent. Notice how the IP address and user-agent changes.
A visit from Baiduspider. It usefully identifies itself in the user-agent. It accesses our home page but doesn't fetch robots.txt
|
|
There are long gaps between visits. If Baiduspider hasn't been added to your web site stat program's filter file for spiders (assuming one exists), then this spider's visits will show up as regular one-page visits in your web site stats. Find a spider by how it crawlsNext is a suspect series of visits. The log lines appear together in an uninterrupted block in the log file. The log lines of human visitors are usually interleaved as people take longer between requests compared to spiders. The accesses shown people are from different IP addresses but they all refer to the same session id. The session id is a unique visitor id we add to the URL if we can't out it in a cookie. The IP addresses are from different countries. |
|
A suspect visit: same session id but each IP address is in a different country. Stylesheets and images are only fetched for one page.
|
|
Is it a coincidence that different people in different countries happened to visit this web site using the same session id in the URL within a few seconds of each other, each only fetching web pages and not the images and style sheets? I'd say this was a spider. There is nothing in the host or user-agent information which allows us to recognise it. Only its odd behaviour gives it away. To filter out this visitor from my custom stats in future, I have to block it by the session id and/or all the IP addresses it used. How to track visitsIn addition to using algorithms to process standard server logs, people can develop custom logs with extra information. They track visitors by putting a generated unique session id in the URL or write it to a cookie. The id is read back at every request so that the request can be logged against the session id. If you don't use session ids, you can make some guesses on which request are from the same visitor by looking at the server logs.
However, any of these might change within a visit. What can happen within a visitThis is what I've found from reviewing server logs regularly.
Spiders use cookiesSpiders can be sent cookies and allow them to be reread by a site on subsequent visits. |
|
Actual accesses from the same IP address from a repeat visitor because a cookie with visit counts was being maintained.
|
The IP address changes within a visitWhen examining raw logs, it is common to see a single visit in which each page access is from a different host. This is how visitors appear in logs when their connection is via a cacheing proxy. Here is a visitor to the Colour Selector who made 17 page requests from ten different hosts during a single three-minute visit. Most log analysers will interpret these page requests as ten separate visitors. |
|
Log lines from a visitor showing up from a variety of hosts
|
|
This second example is from an AOL user. Four images were viewed during a visit, each from a different host (and IP) address. |
|
A visitor where the host address is different for every access. The source log lines of this visit were logged on the same day.
|
|
And, because of cacheing proxies or people who use services such as AOL, different visitors can look like the same visitor if you look only at their host address. The referring URL is optionalThe referring URL is information that is sometimes made available to the web server. It is the URL of the page which included a link to your web site which was followed by the visitor (for example, a page on your site might get listed in a search engine). For the first page request of a visit, the referring URL will tell you the address of the page on another web site from where a link was followed. For subsequent pages, referrers will be pages within the site. People use Back a lot - the original referrer can reappearIt isn't always the case that the referring site will only appear as the referrer for the first page in a visit because people use the browser's Back button a lot. You might expect log lines for a visit to follow the trend of each page being the referrer of the next page accessed. The log lines of a made-up a visit are shown next. |
|
Example log of a possible visit, where each page is the referrer of the next page requested
|
|
However, when you look at your web server logs, you will see that people manage to get to pages from a page which doesn't include any links to the new page. That is because they got to the new page from a link on an earlier page that was cached. When a visitor uses their browser to go back to a cached earlier page, the page isn't logged at the web server; the redisplay of a cached page means the browser gets the page from cache and not from the server. |
|
A visit in which the referring page can be a page earlier than the last page requested. The source log lines for this visit were logged on the same date, with the same session id and user-agent.
|
|
Given this, it is possible that someone can reach your site from a link on another site, explore your site for a bit but then return to the entry page through the browser's Back button. In this event, the first page request in the visit would have a referrer of another web site. Subsequent pages would have internal referrers but then the new outside referrer would reappear in the logs. |
|
A visit in which the referring site reappears as the referrer for a later page. This was logged on the same day, with the same session id and user-agent.
|
|
Another scenario that explains such a visit pattern is when a web site is accessed through multiple browsers. The choice of earlier and later pages on screen to interact with will contribute to the lack of a coherent path through the site in the logs. In addition, anecdotal evidence suggests that people with screen resolutions higher than 800x600 browse with multiple windows. In the case of a site like this, where the session id is carried around in the URL, this behaviour becomes apparent when the log includes visits from the same referrer and from the same host address and user-agent. |
|
The source log lines of this visit were logged on the same day and with the same user-agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
|
A different referring site can appear within a visitIt is possible for a different referring site to appear during a visit. The next visit is of a visitor to this site via Google but with a different referring site for the sixth page request. |
|
A visit with a referring site for the first page access but a different referring site later on in the visit. This was logged on the same day, with the same session id and user-agent.
|
|
During their visit, they created a link to our site from theirs (complete with the session id in the URL) and then presumably tested the link which explains the appearance of the second outside referrer. People take breaks - a long gap doesn't neccessarily mean a different visitorYou can't rely on time between requests to decide if it's a new visit. This is because people might start something at work in the afternoon - go home without closing their browser - then come back and expect to carry on with whatever's in their browser. In this event, a gap between page requests could easily be 17 hours. It is not uncommon to see gaps of an hour or two in logs. If accesses from the same host have a gap of more than 30 mins, WebTrends counts is as from different visitors. |
|
A visit which includes several long gaps. This was logged on the same date, with the same session id and user-agent.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The user-agent can change within a visitIt is possible for the user-agent to change within a visit. When it happens, it's usually a robot visitor but it can also happen with human visitors. The log lines below are a 183-page visit from the same IP address. This can be recognised as a spider by the quick requests in a short space of time. |
|
A single visit in which the user-agent changes.
|
|
|
|
Users can change the user-agent sent to web servers in browsers (e.g., Internet Explorer, Mozilla, Opera, Konqueror and Lynx) and spiders. If these visitors aren't know by stats programs, their visits will be counted in browser stats. Cookies can be enabled, disabled or removedOn average, 10% of human visitors to this web site can't or won't accept cookies. With a cookie-enabled browser, web users have control of how cookies are used.
Regardless of what browser is used, it is always possible to remove cookies within a visit. Cookies may be removed deliberately, perhaps in a big clearout, or become corrupted or lost during a disk crash. One can't assume that a) cookies get written or that b) they'll remain on visitors' computers. StatMarket's HitBox counts visitors by use of third-party cookies. People can easily configure their browsers to reject third-party cookies - those not originating from the web site they're visiting. If someone visits a site with a HitBox counter and their browser rejects the cookie, HitBox will count every page request as a new visit. HitBox over-counts visits. Example: imagine a site that had 2,000 actual visits in one day with three requests on average. The true visit count is 2,000. If our 10% non-cookie figure is typical, HitBox would correctly count 1,800 (90% of 2,000) of the visits. However, it would process the 600 (10% of 2,000 visits x 3 pages) page requests as visits, producing an incorrect total of 2,400 visits. The user-agent isn't reliableA lot of web sites are optimised for Internet Explorer because it's easier for developers to ignore other browsers; until a couple of years ago, the Marks and Spencers web site turned away Mozilla users, telling them to get a better browser. To get around this problem, modern browsers let users set the user-agent to something else. This is usually MSIE, since so many sites are optimised for IE. Some robots pass themselves off as humansSometimes the only indication that a visitor is robot is the time between accesses. Our sites have repeat visitors which accept cookies, and have an user-agent that looks like a normal browser. What gives them away as robots are:
The last behaviour is how they can be spotted in the logs as their log lines will appear in clumps. Most robots spoof MSIEBecause so many sites are optimised for Microsoft Internet Explorer (MSIE), bots send an MSIE user-agent. if they aren't detected as bots, they'll over-represent the proportion of IE users, misleading the site's developers into thinking they made the right decision to turn away other browsers. Below are log lines from a single IP address to this site. It's a robot specifically designed to show in the logs with certain referring sites. This has become a trend in robots once blogs, for example, started publishing trackback links to referring sites. These bots are basically getting other sites to publish links to their sites. |
|
All the accesses came from 166-82-31-14.quickclick.ctc.net with the user-agent Mozilla/4.0 (compatible; MSIE 5.01; Windows 98). I've edited the referrers.
|
Don't rely on stats from log analysersI've spent more time than I care to admit looking at server logs and discovered unexpected behaviour by both human and spider visitors. Because unique visitors can't be accurately detected, some browsers end up being over- or under-counted by stats. There are still more complications. For example, many hit counters collect stats via accesses to a GIF placed on your web pages. Text-only browsers and screen-readers don't access the GIF and so are never included in the browser stats. This means that some disabled visitors are not included at all in browser stats. I've concluded that you mustn't believe your web stats if they're based on log analysis - they'll tend tell you good news when the reality is likely to be discouraging. Further reading
|
|
|
![]() |