I have recently come across two on-line articles on Web-usage analysis that
throw a lot of doubt on the validity of attempting to identify user
sessions from the type of data that is currently recorded in Web server
logs. User-session identification is made difficult by a number of causes,
including caching, load balancing (which assigns multiple IP addresses
during the same user session), and the use of spiders. One of these
critical articles is by Stephen Turner (Cambridge University) [1], the
other is from Susan Haigh and Janette Megarity (National Library of Canada)
[2].
Haigh & Megarity have described user-session estimations as "at best, gross
estimates". It seems to me that what is needed is a systematic validation
of the efficacy of the various Web-analysis algorithms currently available.
This could be done by simulating log-file data from known transactions and
comparing how well an algorithm is able to recover the transactions from
the data. This should be repeated using a wide range of hypothetical
scenarios, such as very frequent load balancing (as occurs in reality with
AOL users).
Does anyone know if such a validation has been done?
Richard
References
---------------
[1] S. Turner. "Analog 5.03: How the Web Works".
http://www.analog.cx/docs/webworks.html [7 July 2001]
[2] S. Haigh, J. Megarity. "Measuring Web Site Usage: Log File Analysis".
http://www.nlc-bnc.ca/9/1/p1-256-e.html [4 August 1998]
-------------------------------
Richard Dybowski, 143 Village Way, Pinner, Middlesex HA5 5AA, UK
Tel (mobile): 079 76 25 00 92
This archive was generated by hypermail 2b29 : Tue Oct 02 2001 - 14:26:58 PDT