Machine Learning For The Rest Of Us: Maximizing The Value Of AIOps By Cultivating The Citizen Data Scientist

Today’s IT environments present unprecedented challenges for IT teams. Doing more of the same won’t cut it if IT teams are to address all of the urgent imperatives of their organizations. One of the ways IT teams can address their increasing demands is through the use of artificial intelligence for IT operations (AIOps).

As outlined in our prior post, AIOps can yield significant benefits and have fundamental implications for people, processes and technology. However, like machine learning more generally, the true promise of AIOps will only be realized when it moves out of labs and into the business.

While AIOps is emerging as a key imperative for many organizations, the reality is that many haven’t broadly deployed or fully operationalized AIOps, meaning its vast potential isn’t being fully realized.

A Methodology For Mass AIOps Adoption: The Citizen Data Scientist

For some time now, Gartner has been reporting on the concept of the citizen data scientist. At a high level, this approach refers to capabilities and practices that allow users to extract insights from data without needing to be as skilled and technically sophisticated as expert data scientists.

This is an important objective because it only continues to get more critical for businesses to leverage data science to compete and realize enhanced business results. However, around the world, expert data scientists are in short supply, and it looks like that will remain the case for quite some time to come.

In a number of reports, Gartner offers insights for cultivating the development of citizen data scientists. By establishing processes and capabilities in support of the citizen data scientist, organizations can more quickly unleash the power of machine learning and artificial intelligence to fuel business gains. While this citizen data scientist concept applies to data science in general, it’s very well aligned with AIOps more specifically.

Fully Leveraging AIOps: Four Keys To Success

As teams set out to cultivate the development and support of the citizen data scientist in their organizations, they’ll be well-served by applying a number of core principles. Most fundamentally, they’ll need to develop processes and frameworks that enable consistent use and sharing of data across the organization. This is fundamental to data science and to AIOps. When it comes to maximizing the utility of AIOps across the organization, here are four key recommendations to follow:

1. Expand the variety of data accessible for analysis. To make the most of AIOps, the widest range of data sources and types needs to be aggregated. When considering an AIOps implementation, teams are therefore well-served by taking an all-of-the-above approach. This includes aggregating and correlating data from across environments, architectural layers, data types and technology domains.

2. Increase the range of analytics capabilities available. Today’s teams need tools that offer maximum support in facilitating analytics. This includes leveraging capabilities for data preparation and self-service analytics. Teams need unified tools that offer a complete set of prepackaged capabilities, such as dashboards and interfaces, that make it fast and efficient to normalize and aggregate disparate data sets.

3. Make advanced analytics accessible to a wider audience. To foster maximum success across an organization, AIOps solutions must enable the broadest number of business users to leverage machine learning, without requiring the assistance of expert data scientists. Toward this end, it is important to leverage prepackaged algorithms and unified capabilities. With these capabilities, advanced AIOps solutions make machine learning much more broadly accessible.

4. Maximize analytics agility. It is important to recognize that the analytics life cycle is composed of a number of essential steps, including acquiring, organizing, analyzing, delivering and measuring data. It is vital that advanced AIOps platforms offer strong support for each of these areas. The more teams can leverage a single unified platform for all these efforts, the better equipped they’ll be to gain optimized agility and operational efficiency in their AIOps administration.


The demand for AIOps is high, and intensifying. In order to maximize the benefits of AIOps, however, these capabilities can’t be restricted to the select few. When teams align their AIOps implementation with the cultivation and support of citizen data scientists, they’ll be poised to establish the pervasive AIOps capabilities that fuel optimized efficiency and service levels.

A Unified Data Model as the Foundation for Advanced Analytics

Learn how by adopting a flexible and open-ended unified data model dynamically built using a time-journaled directed graph of attributed objects, you’ll gain the analytical foundation to collect, group, correlate, and visualize more complex performance conditions spanning applications, infrastructure, and networks.

What is a Unified Data Model and Why Would You Use It

Managing modern application environments is hard. A unified data model can make it easier. Here’s how.

The nature of modern app environments

Modern distributed application systems are growing increasingly complex. Not only are they larger and spread across scale-out environments, but they are also composed of more layers, due especially to the trend toward software-defined networking, storage and everything else. The environments are also highly dynamic, with configurations that are auto-updated on a recurring basis.

Add to this picture microservices architectures and hybrid clouds, and things get even more complex.

Whereas in the past you would typically have run a monolithic application in a static environment on a single server, today you probably have containerized microservices distributed across clusters of servers, using software-defined networks and storage layers. Even if you have simpler virtual machines, your infrastructure is still likely to be highly distributed, and your machine images might move between host servers.

This complexity makes it difficult to map, manage and integrate multiple tools within your environment, especially when each tool uses its own data model. It creates multiple issues for DevOps practitioners and developers alike.

What is a Unified Data Model?

This is why organizations are increasingly adopting unified data models. A unified data model creates an opportunity for an organization to analyze data from multiple sources in the context of shared business initiatives. 

A unified data model forces your DevOps and development teams to determine the methods, practices, and architectural patterns that correlate to the best outcomes in your organization. It will also force your institution to future-proof your data architecture by leveraging new technology data types and attributes.

As the complexity of systems increases, diminishing returns of different data modeling impact our ability to maintain and monitor web applications. Individual modeling for different systems creates a contextual gap in regard to the overarching infrastructure. CA Technologies’ white paper on the Essentials of Root Cause Analytics goes into much greater detail on this.

A unified data model acts as a bridge between your different ecosystems, allowing you to contextualize data sources across multiple services. It acts as a foundation upon which data can be consistently consumed, combined and correlated, which allows for machine learning application across different data sets. It could be argued that this is the future of DevOps monitoring and maintenance.

Lastly, a unified data model will allow for refactoring and migration of data across your infrastructure. As a result, careful consideration should be given to the flexibility of the data components in your organization’s ecosystem, and design should be addressed with future-proofing in mind. Every data layer and every data source should serve to increase the understanding of your overarching data model and ecosystem. 

AI-Driven IT Operations – Secrets to Success Beyond Great Math

Once upon a time we had visibility across IT Infrastructure. We had physical data centers and lovingly nurtured our servers and networks. Of course, the applications under our control became increasingly complicated, but we could always get under the hood when things went wrong.

But consider this.

Due to all-things cloud, most folks entering the tech workforce today will never get to see a physical server or play with a patch panel and configure a router. They’ll never need to acquire that sysadmin “sixth-sense” knowledge of what’s needed to keep the systems up and running.

So, what’s needed to fill the void? Well, two things — data and analytics.

There’s no shortage of data or Big Data in IT operations. Acquire a new cloud service or dip your toes into serverless computing and IoT and you get even more data — more sensor data, logs and metrics to supplement the existing overabundance of application component maps, clickstreams and capacity information.

But what’s missing from this glut of data are analytics and AI-driven IT operations (AIOps for short). It’s tragic that organizations rich in recorded information lack the ability to derive knowledge and insights from this information. Kind of like owning the highest grade gold bearing ore but not having the tools to extract it – or worse, not even realizing you have the gold at all.

Most organizations understand there’s “gold in them there thar hills” and are employing methods to mine it. In the last few years, we’ve seen fantastic strides in data gathering and instrumentation, with many new monitoring tools appearing almost as fast as each new tech and data source. So, as organizations sign up for a new cloud service there always seems to be another monitoring widget or opensource dashboard to go with it — along with the other 20+ dashboards already existing. That’s like ordering a burger and being offered free fries. Immediate visual gratification yes, but it’ll only add maintenance pounds in the long term.

This isn’t to say that these new tools aren’t useful. Visualizing data is a great starting point for any analytics journey. The problem, however, is that these offerings present information in a narrow observational range. Plus by exponentially increasing the metrics under watch they can increase the likelihood that something critical will be missed.

It’s not surprising then that some organizations sauce their fries with automated alerting. At its crudest, this involves setting static baselines with binary pass/fail conditions. All fine for predictable legacy systems, but only increasing noise levels and false-positives in more fluid and modern applications. Or worse, false negatives — those “everything looked hunky dory” moments just before the major web application fell off the cliff. Plus these mechanisms only analyze one variable (they’re univariate in math speak) so cannot always deliver a complete picture of performance.

To address this and other issues many IT operations teams are turning to math and data science, which has been used to great effect in many facets of a business ranging from fraud and credit risk to marketing and retail pricing. Surely, if a team of archeologists can apply some neat math to find lost cities in the Middle East, IT operations teams can apply similar thinking. After all you have data on tap and unlike archeologists, you don’t have to decipher ancient clay tablets and scrolls.

So why has IT operations lagged behind in the application of analytics? Here are three factors impacting success:


It’s tempting to get sucked into data science and start applying algorithms in a piecemeal fashion. When things go wrong, it’s easy to blame the math without actually recognizing that it’s never the math that fails, but the application in a correct IT operations context. This can be especially troubling with the sub-optimal use of machine learning, where the lack of actionable alerting and workflows means critical events are missed and the system starts teaching itself that a prolonged period of poor customer experience is the new normal.


Many organizations start an analytics program by looking at single metrics using normal standard deviations and probability models — basically if a metric falls outside a range there’s an alert. This seems reasonable, but what if 100 metrics are being singularly captured every minute in a twenty four hour period. This nets out to 144,000 observations and probably 1000+ alerts (assuming events are triggered at the second standard deviation) — too many for anyone to reasonably process.

Furthermore, these models fail to accommodate multi-metric associations. For example, the non-linear relationship between CPU utilization across a container cluster and application latency, where a small increase can massively impact responsiveness. Nuances like these suggest new approaches are needed in data modeling (vs simplistic tagging) where metrics can be automatically collected, aggregated and correlated intro groups. This is foundational and extremely beneficial since algorithms gain context across previously silo’d information and can be more powerfully applied.


The complexity of modern applications and customer engagement means there’ll be many unexpected conditions — the unknown unknowns. What becomes critical therefore is the ability to capture, group elements and analyze across the entire application stack, without which unanticipated events will tax even the best algorithms.

Take for example a case where an organization has experienced declining revenue capture from a web or mobile application. The initial response could be to correlate revenue with a series of performance metrics such as response-times, page load times, API and back-end calls etc. But what if the actual cause is an unintentional blacklisting of outbound customer emails by a service provider? Unless that metric along with other indicators across the tech stack is captured and correlated there’ll be longer recovery times and unnecessary finger pointing across teams.


Analytics can fall short if it can’t help staff (or systems) make decisions as to what to do when an anomalous pattern or condition is detected. Key then is a workflow-driven approach where the system not only collects, analyzes and contextualizes, but also injects decision making processes such as self-healing into the model. To do this the best systems will be those that leverage existing knowledge and learnings of staff and solution providers with significant experience working in complex IT environments.

Today, cloud and modern tech stacks are the only way to conduct business at scale. While we might bemoan the loss of visibility, what we can win back is far more beneficial — analytical insights based on the digital experience of our customers. Doing this requires great math, but also demands fresh approaches to its application in context of business goals, underpinning technologies, and the people and processes needed to support them.

By analyzing increased data volumes and solving more complex problems, AIOps equips teams to speed delivery, gain efficiencies and deliver a superior user experience. So if you’re hungry for some AI-driven insights, attend the AIOps Virtual Summit, Wednesday June 20 from 11 a.m. to 3 p.m.ET, where leading thinkers will share their learnings in AI, machine learning, modern monitoring and analytics.

Finding the Right Microscope: Using AIOps for Context Generation and Alarm Noise Reduction

The beauty of artificial intelligence lies in its power to enhance human intelligence with machine-like computation capacity. AIOps enabled solutions derive logics and decisions based on assessment of high quality data.  While it has found its relevance in multiple arenas, one of the important applications of AI lies in context generation. As the data availability increases, it becomes imperative to help end users reduce noise from this data based on the context and use “relative noise reduction” to enhance efficiency for humans.

CA’s Digital Operational Intelligence uses the powerful statistical methods combined with machine learning (ML) and provides the users with the right “microscope” set accurately to focus on their systems that not only predicts its behavior and helps in fixing the problems before they occur, but also provides contextual and noiseless vision into their systems.

The disparate monitoring platforms in enterprises have natural flexibility to be conservative or liberal when it comes to alarming the user about possible situations/issues. Typically, alarms are raised based on metric values crossing certain threshold values and are a reason of concern. However, all the alarms are not necessarily indicative of a new incident and it is not a concern of an admin call at midnight!  What we would want to have is the “incremental liberalism” in assessment of alarms to make the life of end user easier. This means, the end user needs to be exposed not to every alarm raised in systems due to conservative thresholds, but rather the broad issues that are present in the system. This is imperative not only for reducing the end user’s effort of sifting through multiple alarms but also allowing for efficient and quick root cause analysis. For CA Digital Operational Intelligence, we use machine learning-based modules to help the end user reduce the noise generated by alarms and look at their system health with data-driven context generation.

Let’s first identify these alarms as Noise Alarms. There are different kinds of noise alarms:

  • Short Term Flapping Alarms: These are the alarms that repeatedly transitions between the alarm state and the normal state in a short period of time. These types of alarms are typically generated due to random noise and/or disturbances on metrics configured with alarms, especially when the metrics are operating close to their thresholds.
  • Long Term Flapping Alarms: There are another type of repeating alarms which repeatedly make transitions between alarm and non-alarm states with regular (possibly large) time periods. These can be induced by repeated on–off actions on devices or regular oscillatory disturbances in metrics.
  • Standing Alarms : Another set of noise alarms are standing alarms, or alarms that remain in an active state for a prolonged duration. The major reason for these alarms is typically the inefficiencies in operations and maintenance.
  • Alarm Storms/Floods: Finally, there are Alarm Floods which occur when an abnormal situation occurs at some entity in the environment, the fault may spread to many other places through interconnections between devices and process units. These are the most important groups of Alarms that signify any abnormality in the environment and contains the root cause along with huge amount of alarms from the affected components.

It is important to segregate noise alarms and identifying alarm storms and floods to help the operator identify important issues.

We consider alarms as basically skeletons to build out the story of issues. Alarms are simply the episodes that have occurred within the broader story, and using artificial intelligence, we are able to build out that story to tell the end users about the issue that the episodes represent (and not the details of episodes unless asked for). Our mechanisms ensure that all the intricacies and nuances arising in the episodes (individual alarms) are covered “just enough” in our story to help the user identify the state of their system based on context of the examination.


Figure 1: Steps to Reduce Alarm Noise in Digital Operational Intelligence Platform

In architecting this solution, we identified multiple paradigms ranging from sequence mining, Sequence mining with temporal context like WINEPI and MINEPI and static clustering methods. However, in the field of system monitoring there are pragmatic constraints imposed by scaling aspects of any solution. We constructed a novel ensemble method to create this solution that assimilates the key aspects of requirements while ensuring scalability.

To tackle the alarm noise reduction problem, we use a combination of textual, temporal and topological properties of an alarm. Our ML modules assign a token/feature vector to each alarm generated in history. This feature vector decorates the alarm with information such as text relevance of alarms, the time of origin of the alarm, its proximity with other partially or completely similar alarms and the system topology where they occur. This smart decoration of the alarm is then used to define groups of alarms that are similar in nature using dynamic clustering. The dynamic clustering reduces relevant entropy among the clusters and ensures that the alarms clustered together share textual, topological and temporal similarity. In real time, as and when new alarms enter the system, they are identified to be placed under the right issue thereby providing relevant issues to the end user. Consequently, as new issues arise based on unrelated alarms, they are created to ensure capturing of all relevant new issues in the system. Novelty detection is one of the key paradigms in providing successful delivery of context generation. Furthermore, with association rules mining, we provide capabilities that allow prediction and agglomeration of the alarms to suppress them from surfacing unnecessarily.

Figure 2: The before and after visualization of alarm volumes using t-SNE. The top figure shows the different alarms projected on two-dimensional space and the figure below indicates the alarms clustered together on the basis of multiple dimensions. The colors indicate the cluster number and dark violet indicates noise.

Our approach to alarm noise reduction has multiple benefits. It allows the users to look at issues marring their systems instead of alarms thereby setting the “microscope” right. It helps in root cause analysis of the system with speed and accuracy as the irrelevant information is removed. It also helps to generate the right context for the user by tackling situations like alarm storms, alternating alarms or sequential alarms, preserving the stories of alarms that occur as sequences. This context grouping provides insights to several aspects of the system health that are otherwise not visible. Particularly, the sequence of alarm occurrences within one or multiple devices provides a solid correlation across devices affected by a situation.


Figure 3: These results show the noise reduction in case of flapping alarms. We found that the number of alarms after our alarm noise reduction module were significantly lower than the original number of alarms.

Using the AIOps-powered alarm noise reduction module with CA Digital Operational Intelligence, we have been able to achieve significant reduction in alarm noise thereby helping our users to get a much better handle of their systems and avoid unnecessary red alerts.


Figure 4: Overview of Alarm noise reduction in CA DOI and impact on MTTR.

To learn more about our unique approach, read the EMA™ QUICK TAKE: CA Digital Operational Intelligence: A Fresh Take on Autonomic Computing. EMA comments that the CA solution shines, since the platform applies machine learning to leverage domain knowledge from customers’ existing operations management tools to automatically create topology maps used as the source for its AI-driven risk analysis.