6533b7d4fe1ef96bd12628f2

RESEARCH PRODUCT

Time series clustering with different distance measures to tell Web bots and humans apart

Grażyna Suchacka

subject

Web sessionTime seriesUnsupervised classificationWeb bot detectionInternet robotSimilarity measureWeb botClusteringDistance measure

description

The paper deals with the problem of differentiating Web sessions of bots and human users by observing some characteristics of their traffic at the Web server input. We propose an approach to cluster bots’ and humans’ sessions represented as time series. First, sessions are expressed as sequences of HTTP requests coming to the server at specific timestamps; then, they are pre-preprocessed to form time series of limited length. Time series are clustered and the clustering performance is evaluated in terms of the ability to partition bots and humans into separate clusters. The proposed approach is applied to real server log data and validated with the use of different time series distance measures and clustering algorithms. Results show that the choice of a distance measure and a clustering method significantly affects clustering efficiency. The best results for the considered scenario were achieved for distance measures based on nonparametric spectral estimators and the Euclidean distance with a complexity correction factor.

https://doi.org/10.7148/2022-0303