TMPA WhaleShark

Building an Adaptive Logs Classification
System: Industrial Report
Kirill Rudakov1 , Andrey Novikov2 , Anton Sitnikov3 , Eugeny Tsymbalov4 , and

Alexey Zverev5
1
Faculty of Computer Science, High School of Economics, 20 Myasnitskaya str.,
Moscow, 101000, Russia
2
SynData.io, 30a Leninskiy Avenue, Moscow, 119334, Russia
3
Exactpro Systems 71 Vazha Pshavela Avenue, Block 8, floor 4, Office 27 Tbilisi,
0186, Georgia
4
Skolkovo Institute of Science and Technology, 3, Nobelya str., Moscow, 121205,
Russia
5
Exactpro Systems, St Clements House, 27, Clements Lane, London, United
Kingdom, EC4N 7AE
Abstract. This article outlines experience of instrumental log analysis

performed by industry QA company aimed at discovery of system ab-
normal behaviour by early traces in its logs. Addressed issues include
dealing with massive logs, log line polymorphism, choosing appropriate
cluster size, stability of clusters while adding new logs, linking machine-
generated clusters to human understanding of ”same type” and keeping
the system output clearly understandable.
1 Introduction
Testing high-load Fintech systems is a complex task, taking into attention fre-
quent code modification, parallel and concurrent execution, and massiveness of
traces produced. However, the financial and reputational impact of risks associ-
ated with undiscovered errors motivates for using every opportunity for effective
QA.
One of the majorly used approaches is Passive testing, where natural system
behavior (in production load or test environment) is observed by its traces, which
are analyzed 1)to check compliance with expectations and 2)to get a human
understanding of what is happening in the system.
In the area of code error discovery, a promising technique is to observe the
error (STDERR or alike) output produced by the system, as intuitively it should
contain traces of a considerable percentage of errors, including those not yet
known to users or testers. This is especially true for new errors, previously never
appeared and therefore immune to regression testing.
Challenges here are all around logs massiveness - typically up to millions of
log lines per working day. Still, a human QA engineer would argue that not all
of these logs may be of interest, and many of log lines represent essentially same
2 Authors Suppressed Due to Excessive Length
information, though looking different and not easily associated by an analyzing

algorithm.
In this paper, we discuss the experience gained and decisions taken while
designing an industrial error log analysis system used to analyze logs produced
by multiple Fintech high-load distributed applications (clearing and settlement).
We address the above-mentioned challenges and describe the final system design
as it is used after a lengthy try-and-adapt process.
2 Literature overview
Modern large-scale systems describe their current state, the errors and warnings
within the log that arise from the various parts of the system pipeline [1,2]. The
amount and quality of generated logs vary depending on the application, and so
the means of processing and analysis differ.
Most of the log processing pipelines start with a pre-processing, with a goal
of preparing the raw data for further analysis by cleansing, filtering and unifying
the internal representation for the other parts of the pipeline [3,4]. If the further
problem in hand allows for a thorough offline analysis, one can use the template
extraction ( [5–7]) and various NLP (natural language processing) techniques,
such as in [8], where the task of separating a template from variables is considered
as sequential data labeling. Various embedding techniques, such as word2vec [9]
and its variations [10] may also be applied [11].
In a large-scale application, online approaches are used to provide shallow
analysis, e.g., to detect anomalies, extract source or classify the logs. The algo-
rithm called “Spell“ proposed in [12] is one of the streaming-type algorithms that
utilize the longest common subsequence approach. Another streaming algorithm
based on Fixed Depth Tree for online log processing is proposed in [13, 14]. It
uses directed acyclic graphs to divide massive logs to disjoint log groups.
There are several approaches focused on log data enrichment, starting from
the semi-supervised approach [15] with a language model and topic modeling as
the tools for rapid processing, up to suggestions to log message improvement in
the code [16, 17], as this affects the quality of analysis significantly [18].
A thorough analysis of existing solutions for parsing logs can be round in re-
view [19], where research and production tools were analyzed. According to the
report, many algorithms successfully cope with the presentation of unstructured
logs in a structured manner, provided that the message structure is simple. How-
ever, as the complexity of the log structure increases, the quality of separating
template parts of the messages from the parameters drops significantly, which
means that the universal algorithm able to adapt to an arbitrary message struc-
ture is yet far from reality. Moreover, the authors note that the most significant
bottleneck in log analysis systems is slow performance of parsing algorithms.
Typical applications of logs analysis include the error analysis [20], system
behavior understanding [21], diagnostics of failures [16], and detection of anoma-
lies [22].
Building an Adaptive Logs Classification System: Industrial Report 3
For anomalies search, there are three categories of methods generally used:
PCA-based approaches [23], invariant mining-based methods [24,25], and workflow-
based methods [26]. A comprehensive review of the existing methods of searching
for anomalies based on log analysis may be found in [27], where an attempt to
formalize the process of searching for anomalies starting with parsing of raw logs
received from the system was made.
Recently, the methods based on deep learning have been actively used for
log analysis [10,22,28]. Neural networks here emerge as a promising tool for fast
and efficient anomaly prediction and log classification.
This paper describes a practical attempt to analyze massive amounts of logs

for QA purposes and build a framework which is simple yet robust enough to pro-
vide test engineers with understanding of system states, behaviour and changes,
as well as to provide insights on what components and modules require deeper
testing and investigation at the moment.
We describe a number of approaches tried - from the naive approach of
classifying raw log to three-layered data processing with parameters adjustment
to make clustering ”good” from QA engineer point of view and compact enough
to make it observable.
This work aims at providing practical insight into what worked and what did
not, in application to the particular task of discovering new record types and
having informed control of the system behavior.
3 Business context and task
Our target user is a QA engineer/manager responsible for the correct operation

of a financial transactions system (trade, post-trade, or both) or its part. The
user reviews the logs during routine work on bug investigation and understands
the nature of most of the messages. However, due to the massiveness of logs,
it would be impossible to review the logs as a whole to get insight on what is
happening in the system.
To facilitate human understanding of logs, we introduce a notion of ”error
class”. An error class is a set of log lines meaning essentially the same in human
understanding. Specific lines in class may differ by timestamps, IP addresses,
and other parameters. Sometimes a difference of messages within an error class
is substantial:
– Developers may change error output between versions

– Error descriptions produced by same code may come with different number
of parameters
– Some parameters are hard to universally tokenize, such as IDs that may look
much like IP addresses, or words. or number or a mixture.
– Sometimes absolutely different log lines mean the same for the QA engineer,
as they are traces of one situation.
If error classes are perfectly distilled, an engineer will mark some (usually
most) of them as ”not interesting” as they normally appear, for example, on
system start. Others will be well known to him, and others will be new.
A minimally viable product should allow answering the following questions:
– What new error classes (not known before) appeared today?
– What is the general picture of the day - what error classes appeared and
how many instances of each?
– What is the story of a particular error class - first appearance, basic appear-
ance statistics over time?
In the following parts of this article, we describe means and decisions to grasp
the error class through clusterization as close to human understanding of error
classes as possible.
As the primary result, we come to clusterization providing clusters that are:
– Stable over time. When we add logs from next days and update clusters, we
need to be sure that Cluster X today is the same error as Cluster X two
weeks ago.
– Big enough to make engineer work doable, even with million log lines per
day.
– Close to human understanding of error classes. An error class may span
across several machine-generated clusters, but not vice versa.
In the closing part of this article, we describe the functioning of a resultant
clustering-based tool used by QA engineers to explore the error logs.
4 Data structure
In this chapter, we describe the log structure from both applications that our
tool is working with.
For experiments, we retrieved a subset of logs dated from November 2018
to January 2019 inclusive for one (A) data set and July 2019 for other (B ). A
data set contains 7.2M messages with length up to 15,000 characters (including
error trace stacks). B data set contains 0.37M messages with mean message
length about 9 words (no error trace stacks). Log messages are accompanied by
supplementary information on the system with up to 42 fields in each message.
The information comprising each message (see Fig. 1) may be discerned into
three main categories:
– Timeline information. For every log message, this includes a timestamp;
however, on the timeline such logs may form various structures based on
periodicity or event triggering. In some cases, statistical estimates of log
appearing time may be derived.
– Log text itself. It may include a stack trace, some of the system details, error
codes, or even some information on the error in executables, modules or lines
of code. This subject to further detailed analysis by templates or text mining
techniques.
– Supplementary information. This may include out-of-log information such

as instance details, application details, and output type. Each of these prop-
erties is naturally presented as a categorical variable. The pattern, in this
case, may be represented as a subset of the categories set. While most of
aforementioned works are focused on log text analysis, we would like to note
that thorough processing of supplementary data may help in classification
and labeling on the training stage; time-series analysis, on the other hand,
may assist in the task of anomaly prediction and system load estimation.
Fig. 1. Various kinds of log data.
We provide a general overview of the analyzed data to highlight some of its

properties.
In the beginning, A data set was considered. The amount of information each
message contains is related to its length, thus we provide a cumulative number
of logs with the length not exceeding a certain value at Fig. 2a. 84% of all the
log messages are as short as 600 characters; 92 % of the logs have the length of
under 6,000 characters, which allows us to focus on the short messages first.
While every log message contains a text of warning, error or similar event,
the availability of supplementary information varies, see Fig.2b. The typical sup-
plementary fields are stream (STDOUT, STDERR), host, application, and some
others. Approximately one-third of the fields are mandatory for each message,
while another third of features are unavailable for most messages. The occurrence
of the rare features may indicate an anomaly in the system or data.
The available features have different label distribution (see Fig. 2c), which
encourages labels-based analysis to be performed. The message classification pro-
cedure can benefit from discrete pattern mining on both training and inference
stages. However, our preliminary analysis is focused on the text, which is always
available compared to supplementary information.
Another type of information we consistently have is timestamp. While time-
series analytics is an essential part of every tracking system, and the load peaks
must be forecast and responded to promptly, one would expect to have some
daily or weekly patterns in it. However, we would like to illustrate (see Fig.2d)
the fact that high load peaks may occur in different time windows during the
day or may not occur at all, due to stochastic nature of the log data.
Fig. 2. General overview of the data. a) Cumulative number of logs with length not
exceeding N . b) Data availability plot. c) Data diversity bar plot. d) Number of logs
grouped in 15-minute intervals as a function of time for a few trading days.
The most important insight from the data we want to highlight is that while
log messages are dissimilar in time occurrences, label availability, and classes,
most of them (up to 84%) are short, with the length not exceeding 600 characters.
This encourages us to use in our analysis the more robust and straightforward
techniques for data processing based on string comparison algorithms. In this
way, for further experiments only text messages were used from B data set.
5 Raw log clustering
As the first step, we tried rather straightforward approach to get the visible and
usable result as fast as possible. On the one hand, this would deliver business
benefit fast. On the other, this was a foundation for establishing an evidence-
based dialogue with users necessary before trying more sophisticated methods.
We implemented K-means on raw (non-tokenized) strings with the number
of clusters around 100, which seemed to be reasonable from an expert point
of view. The daily launched algorithm would learn on previous three days and
report on any log line distant enough from any of existing cluster.
Although this approach led to fast results and, it showed the following limi-
tations:
– Clustering did not allow user assessments. Because of large size (hundreds
thousands of records each day), there was no means for a human to judge
on the quality of clustering and how it related to human understanding of
an ”error class”.
– There was no link between clusters calculated in different days. Although

clusterizations could be compared, there was no direct inheritance of clusters,
and certain patterns moved from cluster to cluster.
– The technique did not allow to make the whole log volume something tan-
gible for a QA manager.
The decisions taken after the first experiment were:
– Tokenize;
– Allow for the addition of new data to set of clusters;
– Make clusters compact enough so that one cluster represents only one error
class
– Make a hierarchy so that a user could comfortably browse clusters, mark
some of them as unimportant and others as important.
Tokenization led to a notion of ”signature” (called by analogy with antivirus

software that catches viruses that have common DNA but are polymorphic). A
signature is represented by a string like ”DATETIME — fatal error in aspx.py in
line NUM NUM IPADDR”. At this stage, we learned that in the UI, a signature
should always be accompanied with one example string implementing it, so that
a QA engineer understands it without a learning curve.
A search for suitable clustering led to trying several methods and comparing
their results. In the next chapter, we focus on two different approaches we used
for clustering.
6 Clustering comparisons
In this chapter, we compare two different approaches for clustering on signatures:

template-based one and NLP-based pipeline on A data set.
The log data we observe contains a lot of user-sensitive information, such
as IDs and IP addresses with instance names, as well as time labels mixed in a
number of ways. Our basic pre-processing is focused on bleaching the messages
from this information.
6.1 Template-based approach
Here, we focus on short logs (with less than 600 characters), which make up the
majority of all messages. This approach, which is similar to an intermediate step
between log-hashing approach [29] and regular expressions is based on a brute-
force string comparison for all the logs. We consider two messages belonging
to two different templates if they differ by at least 20% of their content (in
the sense of total length of common non-intersecting substrings larger that three
characters). The log message which is different from its predecessors forms a new
template. This results in a greedy approach that runs through all the messages
and compares them with the templates that were already found.
While being time-consuming for the large data sets yet simple, this approach
results in a set of templates, which can be rapidly compared with each new mes-
sage using the aforementioned string comparison or regular expressions. How-
ever, we found 109 templates for a short messages, see Fig. 3 for visualization.
Fig. 3. Dendrogram for the templates found. The distance metric used here is a
weighted length of common non-intersecting substrings larger that 3 characters.
6.2 NLP-based approach

While the previous approach is focused more on the short messages, here we do
not get rid of long log messages, investigating a broader set of data. The approach
is based on text vectorization with linear dimensionality reduction and further
clustering. This is a baseline model that will continue to improve.
As the first step, we pre-processed a random yet evenly sampled subset con-
sisting of 70K initial logs messages and tokenized them into a list of tokens with
only alphabet chars for each entity. At the second step, we trained the TF-IDF
model on the entire data set. During the training, we ignored the terms with
the document frequency less than 3. Also, we visualized the most frequent key-
words on a graph combining the TF-IDF model and horizontal visibility graphs
(HVG) [30]. These words are candidates to be significant features as they de-
termine the properties of clusters; also, they are intended to be used as visual
tags by human operators while drilling down into large sets of logs. (The graph
visualization is not included here to avoid disclosure of the customer sensitive
information.)
After that, we applied our TF-IDF model and transformed the tokenized
data to a document-term sparse matrix, i.e., each textual member was converted
to 6,776-dimensional vector representation where each component is a TF-IDF
vocabulary term.
At the third step, we truncated the dimensionality from 6,776 into 25 with the
cumulative sum of explained variance ratio greater than 0.99 using the truncated
singular value decomposition (SVD) method. In other words, we applied the
technique inherent to latent semantic analysis (LSA) [31].
At the last step, we applied the K-Means clustering and assessed the resultant
cluster sets by the mean Silhouette Coefficient [32]. For the number of clusters
greater than 50, the silhouette coefficient value does not grow sharply. It can
be explained by large clusters splitting into more specific ones. We can see such
separation on Fig. 4a and Fig. 4b. Moreover, we can notice that the composition
changed slightly with increasing number of clusters, there are a few clusters
which contain more than a half members.
Fig. 4. Visualization of obtained clusters using t-SNE [33] (right-side) and silhouette
score over their amount (left-side). a) 50 clusters. b) 90 clusters.
6.3 Approaches comparison
Both approaches result in approximately 100 clusters in the data. Both clustering
algorithms are in good correspondence with each other, with an adjusted Rand
index [34] of 0.896 for 90 NLP-based clusters and 0.904 for 50 NLP-based clusters.
This outcome allows for the intuition that there are actually around 100 sub-
stantially different message types in human understanding. Further, this result
ensures that practically ”good enough” classification is obtained for the initial
human understanding on the content of the logs. That understanding is to be
rectified on further steps of the methodology development.
7 Greedy clustering with cutoff by Jaccard index
In this chapter, we outline the final clustering algorithm we implemented for this
tool. It addresses all the challenges, including cluster stability and correlation
to human understanding of an error. The aim was to develop an algorithm that
will allow retraining on new data without changing old clusters if there is no
user’s suggestion (to unite or divide the clusters).
The approach is the same for both log series and based on kind of greedy
clustering algorithm with the Jaccard coefficient as the metric. The Jaccard
coefficient was chosen because a log message structure is often not like natural
language. We decided to use a set of 1,2,3-grams for each message and compare
Jaccard index for them.
At the first step, a data set should be split into training (historical) and test
(new day). Also, each message of a data set should be tokenized and transformed
into a set of 1,2,3-grams. Next, at the historical set the pairwise similarity ma-
trix is calculated using Jaccard index and according to the indicated threshold
messages (sets) are collected in clusters. If there are messages that cannot be
attributed to the cluster (there are no pairs with similarity at least the thresh-
old) the message should be referred to ”lonely rangers”. The density inside the
cluster cannot fall below the threshold. At the end of this step, nested list with
cluster members is received. At the new day set the same steps should be hap-
pen; statistics on active and new clusters is also added. Prior ”lonely rangers”
can form new clusters. Fig. 5 outlines the working loop. The difference between
the number of clusters and ”lonely rangers” for full A data set and its split data
sets into one historical data set and batches of a new day are presented on Fig. 6.
According to the results, the proposed algorithm does not significantly reduce
the number of clusters and crucially increase the number of ”lonely rangers” at
additional learning.
Fig. 5. Data analysis loop.
Compared to template-based and NLP-based approaches this algorithm gen-

erates more clusters including lonely rangers if the high threshold is set, but
changing the threshold utilizes the algorithm as an agglomerative clustering, so
that parent (larger) clusters can be generated. Moreover, the approach allows
unite or divide clusters easier and assign ambiguous signatures to the appropriate
group.
The algorithm was implemented and deployed at lightweight WSGI web ap-
plication framework Flask with access for respective QA team. The appearance
of a developed tool is presented on Fig. 7 (to avoid disclosing client confidential
information some fields are blurred).
8 Conclusion
Log analysis for industrial-scale distributed financial software is an important

yet difficult part of the quality control process. Over a process of try-and-update
Fig. 6. Warm start learning of the model.
Fig. 7. Appearance of the developed tool. a) Historical clusters list. b) Detailed content
of the cluster. c) New clusters at a new day. d) Active clusters at a new day.
work, we have applied different data analysis methods to the project of testing
clearing and settlement systems. The main results are as follows:
– Logs unify massive amounts of records produced by many modules. Modules
and configurations evolve and impact the output, making the log files a
changing system useful but difficult to analyze.
– The naive approach of clustering raw log data produces clusters with unclear
quality that cannot be assessed by a human. Moreover, these clusters tend
to move when new logs are added over time. So signature extraction and
clustering to dense and stable clusters is a crucial part of the analysis process.
– Clustering parameters should be adjusted in a way to make sure that a
single error class may span across several clusters, but all errors in a cluster
represent the same error. This is accomplished through close interaction with
the system users.
– For large systems, the number of so defined ”good” clusters can be large.
In this case, they may be grouped to larger sets (”user clusters”) to make
them understandable. Part of user clusters will be then probably moved out
of attention as not interesting.
The project is in its active phase. Our future plans include extending the
functionality of the log analysis framework as well as developing a consolidated
user interface, which will result in a stand-alone tool for user-assisted quality
control of distributed applications from different domains.
References
1. M. Haibo, W. Huaimin, Z. Yangfan, M. R. Lyu, and C. Hua, “Toward fine-grained,

unsupervised, scalable performance diagnosis for production cloud computing sys-
tems,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 6, pp.
1245–1255, 2013.
2. D. Yuan, S. Park, and Y. Zhou, “Characterizing logging practices in open-source
software,” Proceedings of the 34th International Conference on Software Engineer-
ing, pp. 102–112, 2012.
3. Y. Ding, P. Soyeon, and Z. Yuanyuan, “Characterizing logging practices in open-
source software,” Proceedings of the 34th International Conference on Software
Engineering, pp. 102–112, 2012.
4. J. Zhu, P. He, Q. Fu, H. Zhang, M. R. Lyu, and D. Zhang, “Learning to log: Helping
developers make informed logging decisions,” Proceedings of the 37th International
Conference on Software Engineering-Volume 1, pp. 415–425, 2015.
5. S. Messaoudi, A. Panichella, D. Bianculli, L. Briand, and R. Sasnauskas, “A
search-based approach for accurate identification of log message formats,” Proceed-
ings of the 26th IEEE/ACM International Conference on Program Comprehension
(ICPC18), 2018.
6. Y. Li, Y. Wang, Z. Zhang, Y. Wang, D. Ma, and J. Huang, “A novel fast and
memory efficient parallel mlcs algorithm for long and large-scale sequences align-
ments,” Data Engineering (ICDE), 2016 IEEE 32nd International Conference on,
pp. 1170–1181, 2016.
7. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “A lightweight algorithm
for message type extraction in system application logs,” IEEE Transactions on
Knowledge and Data Engineering, vol. 24, no. 11, pp. 1921–1936, 2012.
8. S. Kobayashi, K. Fukuda, and H. Esaki, “Towards an nlp-based log template gener-
ation algorithm for system log analysis,” in Proceedings of The Ninth International
Conference on Future Internet Technologies. ACM, 2014, p. 11.
9. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed rep-
resentations of words and phrases and their compositionality,” Advances in neural
information processing systems, pp. 3111–3119, 2013.
10. Y. Z. S. Z. D. P. Y. L. Y. C. R. Z. S. T. P. S. R. Z. Weibin Meng, Ying Liu,
“Loganomaly: Unsupervised detection of sequential and quantitative anomalies in
unstructured logs,” Proceedings of the Twenty-Eighth International Joint Confer-
ence on Artificial Intelligence, pp. 4739–4745, 2019.
11. C. Bertero, M. Roy, C. Sauvanaud, and G. Trédan, “Experience report: Log mining
using natural language processing and application to anomaly detection,” in Soft-
ware Reliability Engineering (ISSRE), 2017 IEEE 28th International Symposium
on. IEEE, 2017, pp. 351–360.
12. M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in Data Mining
(ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 859–864.
13. P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach
with fixed depth tree,” in Web Services (ICWS), 2017 IEEE International Con-
ference on. IEEE, 2017, pp. 33–40.
14. P. He, J. Zhu, P. Xu, Z. Zheng, and M. R. Lyu, “A directed acyclic graph approach
to online log parsing,” arXiv preprint arXiv:1806.04356, 2018.
15. G. Li, P. Zhu, and Z. Chen, “Accelerating system log processing by semi-supervised
learning: A technical report,” arXiv preprint arXiv:1811.01833, 2018.
16. D. Yuan, S. Park, P. Huang, Y. Liu, M. M.-J. Lee, X. Tang, Y. Zhou, and S. Sav-
age, “Be conservative: Enhancing failure diagnosis with proactive logging.” OSDI,
vol. 12, pp. 293–306, 2012.
17. Q. Fu, J. Zhu, W. Hu, J.-G. Lou, R. Ding, Q. Lin, D. Zhang, and T. Xie, “Where do
developers log? an empirical study on logging practices in industry,” in Companion
Proceedings of the 36th International Conference on Software Engineering. ACM,
2014, pp. 24–33.
18. D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage, “Improving software diagnos-
ability via log enhancement,” ACM Transactions on Computer Systems (TOCS),
vol. 30, no. 1, p. 4, 2012.
19. J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and bench-
marks for automated log parsing,” arXiv preprint arXiv:1811.03509, 2018.
20. K. Nagaraj, C. Killian, and J. Neville, “Structured comparative analysis of systems
logs to diagnose performance problems,” in Proceedings of the 9th USENIX con-
ference on Networked Systems Design and Implementation. USENIX Association,
2012, pp. 26–26.
21. K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols,
D. Grant, G. Loihle, and G. Hunt, “Debugging in the (very) large: ten years of
implementation and experience,” in Proceedings of the ACM SIGOPS 22nd sym-
posium on Operating systems principles. ACM, 2009, pp. 103–116.
22. M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and di-
agnosis from system logs through deep learning,” Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security, pp. 1285–1298,
2017.
23. W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scale
system problems by mining console logs,” in Proceedings of the ACM SIGOPS
22nd symposium on Operating systems principles. ACM, 2009, pp. 117–132.
24. J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu, “Mining program workflow from inter-
leaved traces,” in Proceedings of the 16th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 2010, pp. 613–622.
25. J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li, “Mining invariants from console logs
for system problem detection.” in USENIX Annual Technical Conference, 2010,
pp. 23–25.
26. X. Yu, P. Joshi, J. Xu, G. Jin, H. Zhang, and G. Jiang, “Cloudseer: Workflow mon-
itoring of cloud infrastructures via interleaved logs,” in ACM SIGPLAN Notices,
vol. 51, no. 4. ACM, 2016, pp. 489–502.
27. S. He, J. Zhu, P. He, and M. R. Lyu, “Experience report: system log analysis for
anomaly detection,” in Software Reliability Engineering (ISSRE), 2016 IEEE 27th
International Symposium on. IEEE, 2016, pp. 207–218.
28. A. Brown, A. Tuor, B. Hutchinson, and N. Nichols, “Recurrent neural network at-
tention mechanisms for interpretable system log anomaly detection,” ACM HPDC
(First Workshop On Machine Learning for Computer Systems), 2018.
29. Q. Fu, J.-G. Lou, Y. Wang, and J. Li, “Execution anomaly detection in distributed
systems through unstructured log analysis,” in Data Mining, 2009. ICDM’09.
Ninth IEEE International Conference on. IEEE, 2009, pp. 149–158.
30. B. Luque, L. Lacasa, F. Ballesteros, and J. Luque, “Horizontal visibility graphs:
Exact results for random time series,” Physical Review E, vol. 80, no. 4, p. 046103,
2009.
31. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman,
“Indexing by latent semantic analysis,” Journal of the American society for infor-
mation science, vol. 41, no. 6, pp. 391–407, 1990.
32. P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation
of cluster analysis,” Journal of computational and applied mathematics, vol. 20,
pp. 53–65, 1987.
33. L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine
learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
34. L. Hubert and P. Arabie, “Comparing partitions,” Journal of classification, vol. 2,
no. 1, pp. 193–218, 1985.

TMPA WhaleShark

Uploaded by

Copyright:

Available Formats

TMPA WhaleShark

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TMPA WhaleShark

Uploaded by

Copyright:

Available Formats

Building an Adaptive Logs Classification

System: Industrial Report

Kirill Rudakov1 , Andrey Novikov2 , Anton Sitnikov3 , Eugeny Tsymbalov4 , and

Abstract. This article outlines experience of instrumental log analysis

information, though looking different and not easily associated by an analyzing

This paper describes a practical attempt to analyze massive amounts of logs

3 Business context and task

Our target user is a QA engineer/manager responsible for the correct operation

– Developers may change error output between versions

– Supplementary information. This may include out-of-log information such

Fig. 1. Various kinds of log data.

We provide a general overview of the analyzed data to highlight some of its

5 Raw log clustering

– There was no link between clusters calculated in different days. Although

The decisions taken after the first experiment were:

Tokenization led to a notion of ”signature” (called by analogy with antivirus

In this chapter, we compare two different approaches for clustering on signatures:

6.1 Template-based approach

6.2 NLP-based approach

6.3 Approaches comparison

7 Greedy clustering with cutoff by Jaccard index

Fig. 5. Data analysis loop.

Compared to template-based and NLP-based approaches this algorithm gen-

Log analysis for industrial-scale distributed financial software is an important

Fig. 6. Warm start learning of the model.

1. M. Haibo, W. Huaimin, Z. Yangfan, M. R. Lyu, and C. Hua, “Toward fine-grained,

You might also like