Richard Dybowski wrote:
> I have recently come across two on-line articles on Web-usage analysis that
> throw a lot of doubt on the validity of attempting to identify user
> sessions from the type of data that is currently recorded in Web server
> logs. User-session identification is made difficult by a number of causes,
> including caching, load balancing (which assigns multiple IP addresses
> during the same user session), and the use of spiders. One of these
> critical articles is by Stephen Turner (Cambridge University) [1], the
> other is from Susan Haigh and Janette Megarity (National Library of Canada)
> [2].
>
> Does anyone know if such a validation has been done?
>
One way to avoid the issue is to log at the application server layer,
so you are guaranteed to be consistent with the sessions that the user
has at the level above the webserver layer. For example, see
http://robotics.Stanford.EDU/~ronnyk/goodBadUglyKDDItrack.pdf
http://robotics.Stanford.EDU/~ronnyk/integratingEcom.pdf
Sessionizing from weblogs is impossible to do perfectly. If you can't
log at the application server layer, you might try client-side logs or
use heuristics. There are several articles on the topic at the WEBKDD
workshops
http://robotics.Stanford.EDU/~ronnyk/WEBKDD2001/index.html
http://robotics.Stanford.EDU/~ronnyk/WEBKDD2000/index.html
A good article at the SIAM workshop is
Measuring the Accuracy of Sessionizers for Web Usage Analysis
Berent, Mobasher, Spiliopoulou, and Wiltshire, in the Proceedings of
the Web Mining Workshop at the First SIAM International Conference on
Data Mining, 2001
-- Ronny
This archive was generated by hypermail 2b29 : Thu Oct 04 2001 - 09:22:42 PDT