TMPA WhaleShark
TMPA WhaleShark
TMPA WhaleShark
1 Introduction
Testing high-load Fintech systems is a complex task, taking into attention fre-
quent code modification, parallel and concurrent execution, and massiveness of
traces produced. However, the financial and reputational impact of risks associ-
ated with undiscovered errors motivates for using every opportunity for effective
QA.
One of the majorly used approaches is Passive testing, where natural system
behavior (in production load or test environment) is observed by its traces, which
are analyzed 1)to check compliance with expectations and 2)to get a human
understanding of what is happening in the system.
In the area of code error discovery, a promising technique is to observe the
error (STDERR or alike) output produced by the system, as intuitively it should
contain traces of a considerable percentage of errors, including those not yet
known to users or testers. This is especially true for new errors, previously never
appeared and therefore immune to regression testing.
Challenges here are all around logs massiveness - typically up to millions of
log lines per working day. Still, a human QA engineer would argue that not all
of these logs may be of interest, and many of log lines represent essentially same
2 Authors Suppressed Due to Excessive Length
2 Literature overview
Modern large-scale systems describe their current state, the errors and warnings
within the log that arise from the various parts of the system pipeline [1,2]. The
amount and quality of generated logs vary depending on the application, and so
the means of processing and analysis differ.
Most of the log processing pipelines start with a pre-processing, with a goal
of preparing the raw data for further analysis by cleansing, filtering and unifying
the internal representation for the other parts of the pipeline [3,4]. If the further
problem in hand allows for a thorough offline analysis, one can use the template
extraction ( [5–7]) and various NLP (natural language processing) techniques,
such as in [8], where the task of separating a template from variables is considered
as sequential data labeling. Various embedding techniques, such as word2vec [9]
and its variations [10] may also be applied [11].
In a large-scale application, online approaches are used to provide shallow
analysis, e.g., to detect anomalies, extract source or classify the logs. The algo-
rithm called “Spell“ proposed in [12] is one of the streaming-type algorithms that
utilize the longest common subsequence approach. Another streaming algorithm
based on Fixed Depth Tree for online log processing is proposed in [13, 14]. It
uses directed acyclic graphs to divide massive logs to disjoint log groups.
There are several approaches focused on log data enrichment, starting from
the semi-supervised approach [15] with a language model and topic modeling as
the tools for rapid processing, up to suggestions to log message improvement in
the code [16, 17], as this affects the quality of analysis significantly [18].
A thorough analysis of existing solutions for parsing logs can be round in re-
view [19], where research and production tools were analyzed. According to the
report, many algorithms successfully cope with the presentation of unstructured
logs in a structured manner, provided that the message structure is simple. How-
ever, as the complexity of the log structure increases, the quality of separating
template parts of the messages from the parameters drops significantly, which
means that the universal algorithm able to adapt to an arbitrary message struc-
ture is yet far from reality. Moreover, the authors note that the most significant
bottleneck in log analysis systems is slow performance of parsing algorithms.
Typical applications of logs analysis include the error analysis [20], system
behavior understanding [21], diagnostics of failures [16], and detection of anoma-
lies [22].
Building an Adaptive Logs Classification System: Industrial Report 3
For anomalies search, there are three categories of methods generally used:
PCA-based approaches [23], invariant mining-based methods [24,25], and workflow-
based methods [26]. A comprehensive review of the existing methods of searching
for anomalies based on log analysis may be found in [27], where an attempt to
formalize the process of searching for anomalies starting with parsing of raw logs
received from the system was made.
Recently, the methods based on deep learning have been actively used for
log analysis [10,22,28]. Neural networks here emerge as a promising tool for fast
and efficient anomaly prediction and log classification.
If error classes are perfectly distilled, an engineer will mark some (usually
most) of them as ”not interesting” as they normally appear, for example, on
system start. Others will be well known to him, and others will be new.
A minimally viable product should allow answering the following questions:
– What new error classes (not known before) appeared today?
– What is the general picture of the day - what error classes appeared and
how many instances of each?
– What is the story of a particular error class - first appearance, basic appear-
ance statistics over time?
In the following parts of this article, we describe means and decisions to grasp
the error class through clusterization as close to human understanding of error
classes as possible.
As the primary result, we come to clusterization providing clusters that are:
– Stable over time. When we add logs from next days and update clusters, we
need to be sure that Cluster X today is the same error as Cluster X two
weeks ago.
– Big enough to make engineer work doable, even with million log lines per
day.
– Close to human understanding of error classes. An error class may span
across several machine-generated clusters, but not vice versa.
In the closing part of this article, we describe the functioning of a resultant
clustering-based tool used by QA engineers to explore the error logs.
4 Data structure
In this chapter, we describe the log structure from both applications that our
tool is working with.
For experiments, we retrieved a subset of logs dated from November 2018
to January 2019 inclusive for one (A) data set and July 2019 for other (B ). A
data set contains 7.2M messages with length up to 15,000 characters (including
error trace stacks). B data set contains 0.37M messages with mean message
length about 9 words (no error trace stacks). Log messages are accompanied by
supplementary information on the system with up to 42 fields in each message.
The information comprising each message (see Fig. 1) may be discerned into
three main categories:
– Timeline information. For every log message, this includes a timestamp;
however, on the timeline such logs may form various structures based on
periodicity or event triggering. In some cases, statistical estimates of log
appearing time may be derived.
– Log text itself. It may include a stack trace, some of the system details, error
codes, or even some information on the error in executables, modules or lines
of code. This subject to further detailed analysis by templates or text mining
techniques.
Building an Adaptive Logs Classification System: Industrial Report 5
daily or weekly patterns in it. However, we would like to illustrate (see Fig.2d)
the fact that high load peaks may occur in different time windows during the
day or may not occur at all, due to stochastic nature of the log data.
Fig. 2. General overview of the data. a) Cumulative number of logs with length not
exceeding N . b) Data availability plot. c) Data diversity bar plot. d) Number of logs
grouped in 15-minute intervals as a function of time for a few trading days.
The most important insight from the data we want to highlight is that while
log messages are dissimilar in time occurrences, label availability, and classes,
most of them (up to 84%) are short, with the length not exceeding 600 characters.
This encourages us to use in our analysis the more robust and straightforward
techniques for data processing based on string comparison algorithms. In this
way, for further experiments only text messages were used from B data set.
As the first step, we tried rather straightforward approach to get the visible and
usable result as fast as possible. On the one hand, this would deliver business
benefit fast. On the other, this was a foundation for establishing an evidence-
based dialogue with users necessary before trying more sophisticated methods.
We implemented K-means on raw (non-tokenized) strings with the number
of clusters around 100, which seemed to be reasonable from an expert point
of view. The daily launched algorithm would learn on previous three days and
report on any log line distant enough from any of existing cluster.
Although this approach led to fast results and, it showed the following limi-
tations:
– Clustering did not allow user assessments. Because of large size (hundreds
thousands of records each day), there was no means for a human to judge
on the quality of clustering and how it related to human understanding of
an ”error class”.
Building an Adaptive Logs Classification System: Industrial Report 7
– Tokenize;
– Allow for the addition of new data to set of clusters;
– Make clusters compact enough so that one cluster represents only one error
class
– Make a hierarchy so that a user could comfortably browse clusters, mark
some of them as unimportant and others as important.
6 Clustering comparisons
Here, we focus on short logs (with less than 600 characters), which make up the
majority of all messages. This approach, which is similar to an intermediate step
between log-hashing approach [29] and regular expressions is based on a brute-
force string comparison for all the logs. We consider two messages belonging
to two different templates if they differ by at least 20% of their content (in
the sense of total length of common non-intersecting substrings larger that three
characters). The log message which is different from its predecessors forms a new
template. This results in a greedy approach that runs through all the messages
and compares them with the templates that were already found.
8 Authors Suppressed Due to Excessive Length
While being time-consuming for the large data sets yet simple, this approach
results in a set of templates, which can be rapidly compared with each new mes-
sage using the aforementioned string comparison or regular expressions. How-
ever, we found 109 templates for a short messages, see Fig. 3 for visualization.
Fig. 3. Dendrogram for the templates found. The distance metric used here is a
weighted length of common non-intersecting substrings larger that 3 characters.
Fig. 4. Visualization of obtained clusters using t-SNE [33] (right-side) and silhouette
score over their amount (left-side). a) 50 clusters. b) 90 clusters.
Both approaches result in approximately 100 clusters in the data. Both clustering
algorithms are in good correspondence with each other, with an adjusted Rand
index [34] of 0.896 for 90 NLP-based clusters and 0.904 for 50 NLP-based clusters.
This outcome allows for the intuition that there are actually around 100 sub-
stantially different message types in human understanding. Further, this result
ensures that practically ”good enough” classification is obtained for the initial
human understanding on the content of the logs. That understanding is to be
rectified on further steps of the methodology development.
In this chapter, we outline the final clustering algorithm we implemented for this
tool. It addresses all the challenges, including cluster stability and correlation
to human understanding of an error. The aim was to develop an algorithm that
will allow retraining on new data without changing old clusters if there is no
user’s suggestion (to unite or divide the clusters).
The approach is the same for both log series and based on kind of greedy
clustering algorithm with the Jaccard coefficient as the metric. The Jaccard
coefficient was chosen because a log message structure is often not like natural
language. We decided to use a set of 1,2,3-grams for each message and compare
Jaccard index for them.
At the first step, a data set should be split into training (historical) and test
(new day). Also, each message of a data set should be tokenized and transformed
10 Authors Suppressed Due to Excessive Length
into a set of 1,2,3-grams. Next, at the historical set the pairwise similarity ma-
trix is calculated using Jaccard index and according to the indicated threshold
messages (sets) are collected in clusters. If there are messages that cannot be
attributed to the cluster (there are no pairs with similarity at least the thresh-
old) the message should be referred to ”lonely rangers”. The density inside the
cluster cannot fall below the threshold. At the end of this step, nested list with
cluster members is received. At the new day set the same steps should be hap-
pen; statistics on active and new clusters is also added. Prior ”lonely rangers”
can form new clusters. Fig. 5 outlines the working loop. The difference between
the number of clusters and ”lonely rangers” for full A data set and its split data
sets into one historical data set and batches of a new day are presented on Fig. 6.
According to the results, the proposed algorithm does not significantly reduce
the number of clusters and crucially increase the number of ”lonely rangers” at
additional learning.
8 Conclusion
Fig. 7. Appearance of the developed tool. a) Historical clusters list. b) Detailed content
of the cluster. c) New clusters at a new day. d) Active clusters at a new day.
work, we have applied different data analysis methods to the project of testing
clearing and settlement systems. The main results are as follows:
– Logs unify massive amounts of records produced by many modules. Modules
and configurations evolve and impact the output, making the log files a
changing system useful but difficult to analyze.
– The naive approach of clustering raw log data produces clusters with unclear
quality that cannot be assessed by a human. Moreover, these clusters tend
to move when new logs are added over time. So signature extraction and
clustering to dense and stable clusters is a crucial part of the analysis process.
– Clustering parameters should be adjusted in a way to make sure that a
single error class may span across several clusters, but all errors in a cluster
12 Authors Suppressed Due to Excessive Length
represent the same error. This is accomplished through close interaction with
the system users.
– For large systems, the number of so defined ”good” clusters can be large.
In this case, they may be grouped to larger sets (”user clusters”) to make
them understandable. Part of user clusters will be then probably moved out
of attention as not interesting.
The project is in its active phase. Our future plans include extending the
functionality of the log analysis framework as well as developing a consolidated
user interface, which will result in a stand-alone tool for user-assisted quality
control of distributed applications from different domains.
References
11. C. Bertero, M. Roy, C. Sauvanaud, and G. Trédan, “Experience report: Log mining
using natural language processing and application to anomaly detection,” in Soft-
ware Reliability Engineering (ISSRE), 2017 IEEE 28th International Symposium
on. IEEE, 2017, pp. 351–360.
12. M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in Data Mining
(ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 859–864.
13. P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach
with fixed depth tree,” in Web Services (ICWS), 2017 IEEE International Con-
ference on. IEEE, 2017, pp. 33–40.
14. P. He, J. Zhu, P. Xu, Z. Zheng, and M. R. Lyu, “A directed acyclic graph approach
to online log parsing,” arXiv preprint arXiv:1806.04356, 2018.
15. G. Li, P. Zhu, and Z. Chen, “Accelerating system log processing by semi-supervised
learning: A technical report,” arXiv preprint arXiv:1811.01833, 2018.
16. D. Yuan, S. Park, P. Huang, Y. Liu, M. M.-J. Lee, X. Tang, Y. Zhou, and S. Sav-
age, “Be conservative: Enhancing failure diagnosis with proactive logging.” OSDI,
vol. 12, pp. 293–306, 2012.
17. Q. Fu, J. Zhu, W. Hu, J.-G. Lou, R. Ding, Q. Lin, D. Zhang, and T. Xie, “Where do
developers log? an empirical study on logging practices in industry,” in Companion
Proceedings of the 36th International Conference on Software Engineering. ACM,
2014, pp. 24–33.
18. D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage, “Improving software diagnos-
ability via log enhancement,” ACM Transactions on Computer Systems (TOCS),
vol. 30, no. 1, p. 4, 2012.
19. J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and bench-
marks for automated log parsing,” arXiv preprint arXiv:1811.03509, 2018.
20. K. Nagaraj, C. Killian, and J. Neville, “Structured comparative analysis of systems
logs to diagnose performance problems,” in Proceedings of the 9th USENIX con-
ference on Networked Systems Design and Implementation. USENIX Association,
2012, pp. 26–26.
21. K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols,
D. Grant, G. Loihle, and G. Hunt, “Debugging in the (very) large: ten years of
implementation and experience,” in Proceedings of the ACM SIGOPS 22nd sym-
posium on Operating systems principles. ACM, 2009, pp. 103–116.
22. M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and di-
agnosis from system logs through deep learning,” Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security, pp. 1285–1298,
2017.
23. W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scale
system problems by mining console logs,” in Proceedings of the ACM SIGOPS
22nd symposium on Operating systems principles. ACM, 2009, pp. 117–132.
24. J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu, “Mining program workflow from inter-
leaved traces,” in Proceedings of the 16th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 2010, pp. 613–622.
25. J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li, “Mining invariants from console logs
for system problem detection.” in USENIX Annual Technical Conference, 2010,
pp. 23–25.
26. X. Yu, P. Joshi, J. Xu, G. Jin, H. Zhang, and G. Jiang, “Cloudseer: Workflow mon-
itoring of cloud infrastructures via interleaved logs,” in ACM SIGPLAN Notices,
vol. 51, no. 4. ACM, 2016, pp. 489–502.
14 Authors Suppressed Due to Excessive Length
27. S. He, J. Zhu, P. He, and M. R. Lyu, “Experience report: system log analysis for
anomaly detection,” in Software Reliability Engineering (ISSRE), 2016 IEEE 27th
International Symposium on. IEEE, 2016, pp. 207–218.
28. A. Brown, A. Tuor, B. Hutchinson, and N. Nichols, “Recurrent neural network at-
tention mechanisms for interpretable system log anomaly detection,” ACM HPDC
(First Workshop On Machine Learning for Computer Systems), 2018.
29. Q. Fu, J.-G. Lou, Y. Wang, and J. Li, “Execution anomaly detection in distributed
systems through unstructured log analysis,” in Data Mining, 2009. ICDM’09.
Ninth IEEE International Conference on. IEEE, 2009, pp. 149–158.
30. B. Luque, L. Lacasa, F. Ballesteros, and J. Luque, “Horizontal visibility graphs:
Exact results for random time series,” Physical Review E, vol. 80, no. 4, p. 046103,
2009.
31. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman,
“Indexing by latent semantic analysis,” Journal of the American society for infor-
mation science, vol. 41, no. 6, pp. 391–407, 1990.
32. P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation
of cluster analysis,” Journal of computational and applied mathematics, vol. 20,
pp. 53–65, 1987.
33. L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine
learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
34. L. Hubert and P. Arabie, “Comparing partitions,” Journal of classification, vol. 2,
no. 1, pp. 193–218, 1985.