Using Neural Networks for Proactive Triaging
The power of machine learning appraises its full potential with the combination of rich, relevant and reliable data. In the domain of application performance monitoring, its rather imperative to have a rich collection of data, however, it requires a combination of domain expertise, statistical learning, robust underlying mathematical models and machine learning models to build efficient capabilities that leverage Artificial Intelligence. In this blog, we talk about how we use neural networks to provide CA Application Performance Management (APM) the ability to learn and recognize complex patterns formed by multiple metrics and inform the users in advance about critical situations and requirement to take actions. The beauty of the solution lies in the human-interpretable cognition of a situation, formed by accounting for different aspects of the system hollistically instead of single metric analysis that has been implemented in the past.
CA APM customers can access the status of multiple metrics that are reported to the Enterprise Manager by the APM agents. These metrics are typically collected every 15 seconds and report different aspects of the application performance. When any of these metrics behave abnormally, CA APM raises an alert/alarm. Typically, APM solutions capture anomalous behavior in two ways:
- Static Threshold Based Alarms: When either of the metric values goes past a preset user configurable threshold, an alarm is raised.
- Univariate Statistical Analysis: Based on historical behavior of each metric, an alarm is raised whenever the metric value goes past the nth percentile value.
However, APM users are required to set thresholds for different metrics in order to identify the anomalous situation and generate alerts. This leads to work from the end users’ side and creates unwanted alarms, less educated thresholds and overall degradation of the user experience. In addition, these alarms/alerts may not even indicate any important situation if identified in isolation based on single metric.
It is important to note that certain critical anomalies are not necessarily captured by the achievement of a threshold but are rather identified by certain patterns or trends evolving in the observed data. These patterns are not a function of a single metric but multiple metrics emerging together to demonstrate an anomalous condition. In other words, the numerical magnitude (that indicates threshold) does not necessarily capture the existence of such anomalies, it is the joint behavior of the different metrics together that indicates the approach/occurrence of certain patterns that indicate these anomalies. Users cannot set up any thresholds for such patterns, and they must be proactively informed if such patterns are getting built in the system.
The question is: How do we capture patterns that indicate useful anomalies in real time? How do we make the process to capture the anomalies seamless and automated?
The key aim of this work is to provide the users with an experience that releases them from the burden of setting up alarms based on threshold, thereby achieving the “Cognitive Triaging” based on pattern recognition on multivariate data. Multi-variate anomaly detection helps to capture such anomalies and address them proactively.
With the help of machine learning based data analysis, we recognize patterns formed by multiple metrics in real time. These patterns indicate the existence and severity of problematic pattern on a scale. This becomes equivalent to a single metric- the “smarter” metric now identifies the existence of the pattern/problem. It indicates the existence/approach of a problematic situation and hence automatically raises alarms for the users as to what is the possible problem and (in most cases) what is the expected root cause of this problem.
This approach allows us to raise alerts with varying degree of confidence and severity at the time the problem is intensifying and thereby provides a seamless experience to the APM user to capture problems intelligently without manual intervention. It not only performs crude triaging by accounting for multiple metrics together and combining them into a single scale metric of the problem, but also captures the anomaly that reflects an objective situation like memory leak, thread stagnation, infinite loop, etc.
From a scientific point of view: we call our approach ”Multivariate Anomaly detection” (MVAD). MVAD works in the space of pattern recognition with artificial neural networks. It captures the patterns formed by a previously conceived relevant set of metrics and learns the emergence of these patterns in the application. As these patterns emerge, partially or fully, it sends a notification to an APM-Triaging engine, namely “Assisted Triage”, about the occurrence of an anomalous situation. Assisted triage traverses through the graph model of the application components in order to identify the possible affected nodes through the anomalous situation and the root cause of the problem. Assisted Triage then provides the user the tentative culprit of the problem. The end user experience therefore improves categorically through an intelligent, proactive and robust multivariate anomaly detection system based on pattern recognition.
Example: Memory Leak Pattern detection
An important use case is memory anomaly pattern. In case the JVM® hosting the app notices an increase in memory usage and decrease in empty memory space available, while a concurrent increase in frequency and time of garbage collection cycles, this may indicate a memory anomaly, only when there is a relatively normal load behavior. If there is an increase in load cycle, the pattern may still be considered normal memory behavior. However, if there is a general decrease or constant load, it is a potential memory leak situation.
The memory leak example as explained above is one of the high value use cases for MVAD to solve. In order to capture the memory leak/anomaly detection in Sun and IBM® JVMs, we use a machine learning approach to recognize the patterns formed by the time-series metric data captured by CA APM.
The different metrics we consider when forming the patterns are related to memory aspects and are identified through domain analysis with the experts. This helps us to eliminate the feature selection process and ensure the “lightweightedness” of the solution.
The ML module initially baselines the app load to adapt to the “normal” behavior of the app. Once the baselining is done, a change detector based on density behavior kicks in to identify whenever there is an unseen occurrence of the data sets. The ML module wakes up every X minutes and collects these metric data points. It cleans up the data and extracts the trend of the data using moving average smoothning. From the smoothened data, we extract the different features that are the key indicators to identify the pattern, such as incremental slope, differential slope, monotonicity identifier, etc. These features are fed into a neural net along with a data annotator for the supervised learning. Using the data model, we identify the patterns as and when they occur and the memory leak pattern is identified. Based on past actions taken on indicators, the data model corrects itself and trains on the feedback equipped data to learn better over a period of time.
Figure 1: Typically, the use of neural networks as a classifier is leveraged to perform image classification. CA APM recognizes patterns from multiple metrics as a human brain would interpret an image in order to detect anomalous situations.
The key features of our solution are as follows:
- Light-weight solution: We have not only selected the “features” but also selected what “points of time” those values are relevant. Hence, we do not collect data that does not contribute to pattern making it much more lightweight (opposed to “leak-hunter”).
- Product Agnostic: Any JVM that can provide at least the required metrics can be monitored for leak through.
- Speed Adaptive: Except for very fast leaks, all kinds of leaks can be trapped through this.
Our approach is outlined as follows:
Based on these observations, we indicate the leak severity and confidence in the leak.
Figure 2: Sample Kibana® dashboard
Pattern recognition in the monitoring space is a major breakthrough as it relies on the concept of learning not only through data but also through experiences. The vast gamut of experiences that are nestled within the armor of CA due to our existence in the domain for years have helped us devise this unique and robust system. Our solution is devised on the motto of making our products “Behave like humans and think like machines.”
To learn more about our unique approach, read the white paper AIOps Essentials of Root Cause Analytics.