6533b85bfe1ef96bd12bb447

RESEARCH PRODUCT

Bot recognition in a Web store: An approach based on unsupervised learning

Francesco MasulliGrażyna SuchackaStefano Rovetta

subject

Unsupervised classificationWeb bot detectionComputer Networks and CommunicationsComputer scienceInternet robot02 engineering and technologyMachine learningcomputer.software_genreWeb trafficWeb serverMachine learning0202 electrical engineering electronic engineering information engineeringArtificial neural networkbusiness.industrySupervised learning020206 networking & telecommunicationsPerceptronWeb application securityWeb botComputer Science ApplicationsSupport vector machineGenerative modelComputingMethodologies_PATTERNRECOGNITIONHardware and ArchitectureSupervised classificationUnsupervised learning020201 artificial intelligence & image processingArtificial intelligencebusinesscomputer

description

Abstract Web traffic on e-business sites is increasingly dominated by artificial agents (Web bots) which pose a threat to the website security, privacy, and performance. To develop efficient bot detection methods and discover reliable e-customer behavioural patterns, the accurate separation of traffic generated by legitimate users and Web bots is necessary. This paper proposes a machine learning solution to the problem of bot and human session classification, with a specific application to e-commerce. The approach studied in this work explores the use of unsupervised learning (k-means and Graded Possibilistic c-Means), followed by supervised labelling of clusters, a generative learning strategy that decouples modelling the data from labelling them. Its efficiency is evaluated through experiments on real e-commerce data, in realistic conditions, and compared to that of supervised learning classifiers (a multi-layer perceptron neural network and a support vector machine). Results demonstrate that the classification based on unsupervised learning is very efficient, achieving a similar performance level as the fully supervised classification. This is an experimental indication that the bot recognition problem can be successfully dealt with using methods that are less sensitive to mislabelled data or missing labels. A very small fraction of sessions remain misclassified in both cases, so an in-depth analysis of misclassified samples was also performed. This analysis exposed the superiority of the proposed approach which was able to correctly recognize more bots, in fact, and identified more camouflaged agents, that had been erroneously labelled as humans.

10.1016/j.jnca.2020.102577https://hdl.handle.net/11567/999600