The beauty of artificial intelligence lies in its power to enhance human intelligence with machine-like computation capacity. AIOps enabled solutions derive logics and decisions based on assessment of high quality data. While it has found its relevance in multiple arenas, one of the important applications of AI lies in context generation. As the data availability increases, it becomes imperative to help end users reduce noise from this data based on the context and use “relative noise reduction” to enhance efficiency for humans.
CA’s Digital Operational Intelligence uses the powerful statistical methods combined with machine learning (ML) and provides the users with the right “microscope” set accurately to focus on their systems that not only predicts its behavior and helps in fixing the problems before they occur, but also provides contextual and noiseless vision into their systems.
The disparate monitoring platforms in enterprises have natural flexibility to be conservative or liberal when it comes to alarming the user about possible situations/issues. Typically, alarms are raised based on metric values crossing certain threshold values and are a reason of concern. However, all the alarms are not necessarily indicative of a new incident and it is not a concern of an admin call at midnight! What we would want to have is the “incremental liberalism” in assessment of alarms to make the life of end user easier. This means, the end user needs to be exposed not to every alarm raised in systems due to conservative thresholds, but rather the broad issues that are present in the system. This is imperative not only for reducing the end user’s effort of sifting through multiple alarms but also allowing for efficient and quick root cause analysis. For CA Digital Operational Intelligence, we use machine learning-based modules to help the end user reduce the noise generated by alarms and look at their system health with data-driven context generation.
Let’s first identify these alarms as Noise Alarms. There are different kinds of noise alarms:
- Short Term Flapping Alarms: These are the alarms that repeatedly transitions between the alarm state and the normal state in a short period of time. These types of alarms are typically generated due to random noise and/or disturbances on metrics configured with alarms, especially when the metrics are operating close to their thresholds.
- Long Term Flapping Alarms: There are another type of repeating alarms which repeatedly make transitions between alarm and non-alarm states with regular (possibly large) time periods. These can be induced by repeated on–off actions on devices or regular oscillatory disturbances in metrics.
- Standing Alarms : Another set of noise alarms are standing alarms, or alarms that remain in an active state for a prolonged duration. The major reason for these alarms is typically the inefficiencies in operations and maintenance.
- Alarm Storms/Floods: Finally, there are Alarm Floods which occur when an abnormal situation occurs at some entity in the environment, the fault may spread to many other places through interconnections between devices and process units. These are the most important groups of Alarms that signify any abnormality in the environment and contains the root cause along with huge amount of alarms from the affected components.
It is important to segregate noise alarms and identifying alarm storms and floods to help the operator identify important issues.
We consider alarms as basically skeletons to build out the story of issues. Alarms are simply the episodes that have occurred within the broader story, and using artificial intelligence, we are able to build out that story to tell the end users about the issue that the episodes represent (and not the details of episodes unless asked for). Our mechanisms ensure that all the intricacies and nuances arising in the episodes (individual alarms) are covered “just enough” in our story to help the user identify the state of their system based on context of the examination.
Figure 1: Steps to Reduce Alarm Noise in Digital Operational Intelligence Platform
In architecting this solution, we identified multiple paradigms ranging from sequence mining, Sequence mining with temporal context like WINEPI and MINEPI and static clustering methods. However, in the field of system monitoring there are pragmatic constraints imposed by scaling aspects of any solution. We constructed a novel ensemble method to create this solution that assimilates the key aspects of requirements while ensuring scalability.
To tackle the alarm noise reduction problem, we use a combination of textual, temporal and topological properties of an alarm. Our ML modules assign a token/feature vector to each alarm generated in history. This feature vector decorates the alarm with information such as text relevance of alarms, the time of origin of the alarm, its proximity with other partially or completely similar alarms and the system topology where they occur. This smart decoration of the alarm is then used to define groups of alarms that are similar in nature using dynamic clustering. The dynamic clustering reduces relevant entropy among the clusters and ensures that the alarms clustered together share textual, topological and temporal similarity. In real time, as and when new alarms enter the system, they are identified to be placed under the right issue thereby providing relevant issues to the end user. Consequently, as new issues arise based on unrelated alarms, they are created to ensure capturing of all relevant new issues in the system. Novelty detection is one of the key paradigms in providing successful delivery of context generation. Furthermore, with association rules mining, we provide capabilities that allow prediction and agglomeration of the alarms to suppress them from surfacing unnecessarily.
Figure 2: The before and after visualization of alarm volumes using t-SNE. The top figure shows the different alarms projected on two-dimensional space and the figure below indicates the alarms clustered together on the basis of multiple dimensions. The colors indicate the cluster number and dark violet indicates noise.
Our approach to alarm noise reduction has multiple benefits. It allows the users to look at issues marring their systems instead of alarms thereby setting the “microscope” right. It helps in root cause analysis of the system with speed and accuracy as the irrelevant information is removed. It also helps to generate the right context for the user by tackling situations like alarm storms, alternating alarms or sequential alarms, preserving the stories of alarms that occur as sequences. This context grouping provides insights to several aspects of the system health that are otherwise not visible. Particularly, the sequence of alarm occurrences within one or multiple devices provides a solid correlation across devices affected by a situation.
Figure 3: These results show the noise reduction in case of flapping alarms. We found that the number of alarms after our alarm noise reduction module were significantly lower than the original number of alarms.
Using the AIOps-powered alarm noise reduction module with CA Digital Operational Intelligence, we have been able to achieve significant reduction in alarm noise thereby helping our users to get a much better handle of their systems and avoid unnecessary red alerts.
Figure 4: Overview of Alarm noise reduction in CA DOI and impact on MTTR.
To learn more about our unique approach, read the EMA™ QUICK TAKE: CA Digital Operational Intelligence: A Fresh Take on Autonomic Computing. EMA comments that the CA solution shines, since the platform applies machine learning to leverage domain knowledge from customers’ existing operations management tools to automatically create topology maps used as the source for its AI-driven risk analysis.