Practical Experiences with Chronics Discovery in Large Telecommunications Systems
Chronics are recurrent problems that fly under the radar of
operations teams because they do not perturb the system
enough to set off alarms or violate service-level objectives.
The discovery and diagnosis of never-before seen chronics
poses new challenges as they are not detected by traditional
threshold-based techniques, and many chronics can be present
in a system at once, all starting and ending at different times.
In this paper, we describe our experiences diagnosing chronics
using server logs on a large telecommunications service.
Our technique uses a scalable Bayesian distribution learner
coupled with an information theoretic measure of distance
(KL divergence), to identify the attributes that best distinguish
failed calls from successful calls. Our preliminary results
demonstrate the usefulness of our technique by providing
examples of actual instances where we helped operators
discover and diagnose chronics.
BLR-D: Applying Bilinear Logistic Regression to Factored Diagnosis Problems
In this paper, we address a pattern of diagnosis problems in which
each of J entities produces the same K features,
yet we are only
informed of overall faults from the ensemble. Furthermore, we
suspect that only certain entities and certain features are leading to
the problem. The task, then, is to reliably identify which entities
and which features are at fault. Such problems are particularly
prevalent in the world of computer systems, in which a datacenter
with hundreds of machines, each with the same performance
counters, occasionally produces overall faults. In this paper, we
present a means of using a constrained form of bilinear logistic
regression for diagnosis in such problems. The bilinear treatment
allows us to represent the scenarios with J+K instead of
JK parameters,
resulting in more easily interpretable results and far
fewer false positives compared to treating the parameters independently.
We develop statistical tests to determine which features
and entities, if any, may be responsible for the labeled faults,
and use false discovery rate (FDR) analysis to ensure that our
values are meaningful. We show results in comparison to ordinary
logistic regression (with L1 regularization) on two scenarios: a
synthetic dataset based on a model of faults in a datacenter, and a
real problem of finding problematic processes/features based on
user-reported hangs.
Mining Temporal Invariants from Partially Ordered Logs
A common assumption made in log analysis research is
that the underlying log is totally ordered. For concurrent
systems, this assumption constrains the generated log
to either exclude concurrency altogether, or to capture a
particular interleaving of concurrent events. This paper
argues that capturing concurrency as a partial order is
useful and often indispensable for answering important
questions about concurrent systems. To this end, we motivate
a family of event ordering invariants over partially
ordered event traces, give three algorithms for mining
these invariants from logs, and evaluate their scalability
on simulated distributed system logs.
Adaptive Event Prediction Strategy with Dynamic Time Window for Large-Scale HPC Systems
In this paper, we analyse messages generated by different HPC
large-scale systems in order to extract sequences of correlated events
which we lately use to predict the normal and faulty behaviour of
the system. Our method uses a dynamic window strategy that is
able to find frequent sequences of events regardless on the time delay
between them. Most of the current related research narrows
the correlation extraction to fixed and relatively small time windows
that do not reflect the whole behaviour of the system. The
generated events are in constant change during the lifetime of the
machine. We consider that it is important to update the sequences at
runtime by applying modifications after each prediction phase according
to the forecast’s accuracy and the difference between what
was expected and what really happened. Our experiments show
that our analysing system is able to predict around 60% of events
with a precision of around 85% at a lower event granularity than
before.
Mining large distributed log data in near real time
Analyzing huge amounts of log data is often a difficult task,
especially if it has to be done in real time (e.g., fraud detection)
or when large amounts of stored data are required
for the analysis. Graphs are a data structure often used in
log analysis. Examples are clique analysis and communities
of interest (COI). However, little attention has been paid
to large distributed graphs that allow a high throughput of
updates with very low latency.
In this paper, we present a distributed graph mining system
that is able to process around 39 million log entries per
second on a 50 node cluster while providing processing latencies
below 10 ms. We validate our approach by presenting
two example applications, namely telephony fraud detection
and internet attack detection. A thorough evaluation proves
the scalability and near real-time properties of our system.
Web Analytics and the Art of Data Summarization
Web Analytics has become a critical component of many
business decisions. With an ever growing number of
transactions happening through web interfaces, the ability
to understand and introspect web site activity is critical.
In this paper, we describe the importance and intricacies
of summarization for analytics and report generation
on web log data. We specifically elaborate on how
summarization is exposed in Splunk and discuss analytics
search design trade-offs.
Panel: Assessing and improving the quality of program logs
PAL: Propagation-aware Anomaly Localization for Cloud Hosted Distributed Applications
Distributed applications running inside cloud are prone to
performance anomalies due to various reasons such as insufficient
resource allocations, unexpected workload increases,
or software bugs. However, those applications often consist
of multiple interacting components where one component
anomaly may cause its dependent components to exhibit
anomalous behavior as well. It is challenging to identify the
faulty components among numerous distributed application
components. In this paper, we present a Propagation-aware
Anomaly Localization (PAL) system that can pinpoint the
source faulty components in distributed applications by extracting
anomaly propagation patterns. PAL provides a
robust critical change point discovery algorithm to accurately
capture the onset of anomaly symptoms at different
application components. We then derive the propagation
pattern by sorting all critical change points in chronological
order. PAL is completely application-agnostic and nonintrusive,
which only relies on system-level metrics. We have
implemented PAL on top of the Xen platform and tested it
on a production cloud computing infrastructure using the
RUBiS online auction benchmark application and the IBM
System S data streaming processing application with a range
of common software bugs. Our experimental results show
that PAL can pinpoint faulty components in distributed
applications with high accuracy and low overhead.