Notes
Notes
Notes
This paper proposes a machine learning solution to the problem of bot and human session
classification, with a specific application to e-commerce.
The work presented in this paper addresses the issue of recognizing artificial agents or bots from
human visitors to a Web shop under realistic conditions
This paper proposes a machine learning solution to the problem of bot and human session
classification, with a specific application to e-commerce
Its efficiency is evaluated through experiments on real e-commerce data, in realistic conditions,
and compared to that of supervised learning classifiers (a multi-layer perceptron neural network
and a support vector machine).
Result: Results demonstrate that the classification based on unsupervised learning is very
efficient, achieving a similar performance level as the fully supervised classification
Bots do : The efficient functioning of the electronic marketplace is largely possible due to Web
bots, which explore the Web on a regular basis and automate the execution of many tedious,
recurrent, and routine tasks.
Bots are: A Web bot, also called a Web crawler, Internet robot or intelligent agent, is a software
tool that performs specific actions on computers connected in a network without the
intervention of human users, by following hyperlinks
Good Bots: Search engine indexers, monitoring bots, link checkers, feed fetchers are examples
of “good bots” – they usually have legitimate goals and comply with directives placed by website
maintainers in the robots.txt file to prevent or limit access to specific page subsets.
Bad Bots: Moreover, advanced robots can operate at the application layer, being able to imitate
the way in which legitimate users interact with online applications via their browsers, which
makes them hard to detect. Such bots are often used to gain undue advantage in online
business
Motivation and Approach: The main motivation for our study was the need for reliable
identification of automatically generated visits in online stores.
Parts of problems: First, the ability to identify HTTP bot traffic allows a website administrator to
obtain accurate measurements of actual site popularity and other, business-related metrics.
Second, this ability is fundamental for reliable and solid e-customer behaviour characterisation
and pattern discovery
Research : The objective of our research is to explore a machine learning (ML) solution to the
problem of bot and human session classification, with a specific application to e-commerce and
under a realistic scenario
Research Question: The main research question is whether it is possible to achieve good
recognition rates in the task of distinguishing between sessions of legitimate, human users and
Web robots using computational intelligence techniques rather than hand-engineered filtering
criteria.
İssue: One issue that is commonly faced in real-world applications is the lack, or limited
availability, of labelled data. Labelling is almost invariably an expensive task. Additionally, in the
specific case of interest there is no solid criterion for labelling all possible bots, so, even among
available labels, a fraction may contain unreliable information.
As opposed to traditional, network layer DDoS attacks, which are relatively easy detectable,
HTTP-based application layer attacks are extremely hard to cope with
Ability of supervised: The ability of the inferred function to determine correct class labels for
new, unseen samples is assessed on a test dataset. Many supervised learning techniques
demonstrated their efficiency in classification of bots and humans, e.g., decision trees, support
vector machine, neural networks, k-nearest neighbours
Disadvantages: All supervised learning approaches, however, share a common disadvantage,
related to a difficulty with preparation of a reliable training dataset, in particular with assigning
accurate class labels to sessions of camouflaged robots
Previous Works: In Alam et al. (2014) Particle Swarm Optimization (PSO) was applied to distinguish
robots among genuine Web users based on three session features: total transfer volume, number of
pages, and session duration. The underlying assumption was that bots are outliers. However, as
previously noted, the percentage of bot traffic has dramatically increased, therefore this assumption is
not valid anymore. As a consequence, only some kinds of bots might be detected with this method. For
the experimental scenario in Alam et al. (2014), these are those bots requesting many pages and
Unsupervised works: Two unsupervised neural network learning algorithms, the SelfOrganizing Map
(SOM) and Modified Adaptive Resonance Theory 2 (Modified ART2), were applied in Stevanovic et al.
(2013) with two goals: to obtain a better insight into the types and distribution of Web clients, and to
investigate the relative differences and similarities between malicious bots and other non-malicious
clients. Four client groups were considered: humans, well-behaved bots, malicious bots, and unknown.
Conclusion: A conclusion was that bots, in particular the malicious ones, display a range of browsing
strategies; moreover, as much as 52% of malicious bots exhibit very “human-like” behaviour.
Point : The analysis of literature shows that Web traffic reveals properties discriminating bots from
humans to a large extent, though some sophisticated bots can actually impersonate legitimate users,
thus remaining hard to detect
Since some related works, e.g. (Stevanovic et al., 2013), reported a considerable diversification
of bad bot navigation strategies, we examine a “microclustering” approach, similar to that used
in DBSCAN, using a wide range of the number of centroids to be able to map possibly complex
class boundaries using smaller convex components. Classification efficiency is evaluated on real
e-commerce log data: the classifier is developed on a training dataset and its ability to
generalize results to new observations is verified on a test dataset. Achieved performance
metrics are compared with those obtained for the supervised learning classifiers. Moreover,
features of individual sessions that were misclassified by unsupervised and/or supervised
approaches are analysed thoroughly to inspect possible reasons for the residual errors.
Problem Formulation:
Starting: The input for our approach is one or more logs, containing traffic data for an e-
commerce website over some period of time. We assume to have access to at least the
information contained in the NCSA Combined log format. The following fields are used:
Important term: A Web session is defined as a sequence of HTTP requests coming from a client during a
single visit
Research Question: The research question addressed by this work is whether the task of recognizing
bots offline is (1) learnable with standard ML methods, and (2) characterized by intrinsic differences
between the behaviour of legitimate users and that of automatic software agents, such that the
unsupervised analysis is able to reveal significant, interesting information.
Research methodology:
Data were taken. Logs were recorded in April 2014 according to the NCSA Combined format with a 1-s
resolution and contained 1397838 entries.
2)
A novel method of detecting malicious social bots, including both features selection based on the
transition probability of clickstream sequences and semi-supervised clustering is presented in this paper
This method not only analyzes transition probability of user behavior clickstreams, but also considers
the time feature of behavior.
Findings from our experiments on real online social network platforms demonstrate that detection
accuracy for different types of malicious social bots by the detection method of malicious social bots
based on transition probability of user behavior clickstreams increases by an average of 12.8%, in
comparison to the detection method based on quantitative analysis of user behavior.
User behavior is the most direct manifestation of user intent, as different users have different habits,
preferences, and online behavior (e.g., the way one clicks or types, as well as the speed of typing).
In other words, we may be able to mine and analyze information hidden in user’s online behavior to
profile and identify different users
In order to distinguish social bots from normal users accurately, detect malicious social bots, and
reduce the harm of malicious social bots, we need to acquire and analyze situation awareness of
user behavior and compare and understand the differences of malic
Aim of paper: Specifically, in this paper, we aim to detect malicious social bots on social network
platforms in real-time, by (1) proposing the transition probability features between user
clickstreams based on the situation awareness; and (2) designing an algorithm for detecting
malicious social bots based on spatiotemporal features.
To identify potential malicious social bots in online social networks in real-time, we analyze the
situation awareness behavior of users in online social networks
We also evaluate user behavior features and select the transition probability of user behavior on
the basis of general behavior characteristics.
In this experiment, four malicious social bots that perform different specific purposes and two malicious
social bots with mixed behavior are set up.
We find that (1) the precision of the semisupervised clustering method for the detection of the
same type of malicious social bots based on transition the probability features and mixed
features is higher than that of the semi-supervised clustering method based on the quantitative
feature;(important)