6533b838fe1ef96bd12a51bc

RESEARCH PRODUCT

Modeling a non-stationary bots’ arrival process at an e-commerce Web site

Daria WotzkaGrażyna Suchacka

subject

Web serverGeneral Computer ScienceComputer scienceInternet robotReal-time computing02 engineering and technologyE-commercecomputer.software_genreSession (web analytics)Theoretical Computer ScienceWeb traffic characterizationWeb serverWeb traffic0202 electrical engineering electronic engineering information engineeringTraffic generation modelWeb traffic analysis and modelingbusiness.industryComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS020206 networking & telecommunicationsWeb botHeavy-tailed distributionModeling and SimulationHeavy-tailed distribution020201 artificial intelligence & image processingThe InternetWeb log analysis softwareLog file analysisData miningbusinessRegression analysiscomputer

description

Abstract The paper concerns the issue of modeling and generating a representative Web workload for Web server performance evaluation through simulation experiments. Web traffic analysis has been done from two decades, usually based on Web server log data. However, while the character of the overall Web traffic has been extensively studied and modeled, relatively few studies have been devoted to the analysis of Web traffic generated by Internet robots (Web bots). Moreover, the overwhelming majority of studies concern the traffic on non e-commerce websites. In this paper we address the problem of modeling a realistic arrival process of bots’ requests on an e-commerce Web server. Based on real log data for an online store, sessions generated by bots were reconstructed and their key features were analyzed, including the interarrival time of bot sessions, the number of HTTP requests per session, and the interarrival time of requests in session. To deal with the problem of non-stationarity of the Web traffic, chunks associated with times of day were distinguished based on the intensity of bot sessions’ arrivals and then features of sessions in individual time chunks were analyzed separately. Using regression analysis, a mathematical model of the bots’ traffic features was developed and implemented in a bot traffic generator. Our findings confirm the existence of a heavy-tail in bot traffic features’ distributions. The bots’ session interarrival times and request interarrival times are best modeled by a Weibull and a sigmoid distributions, respectively, while the model proposed for the numbers of requests per bot session is based on a hybrid function being a combination of one exponential and two normal distribution functions. The suitable fit of the model was confirmed by the high correlation of the real and model data. Furthermore, a visual inspection of the simulation results showed that the estimated values represent distributions close to those of the empirical data.

https://doi.org/10.1016/j.jocs.2017.05.017