AIOps Root-Cause Analysis Best Practices

As infrastructures and software environments grow ever more complex, detecting the root cause of performance or availability problems is becoming ever more challenging. Fortunately, there is a new class of tools and a new type of strategy to meet the challenge: AIOps.

In this blog, we take a look at what AIOps-assisted root-cause analysis means and how best to perform root-cause analysis with the help of AIOps tools.

What is Root-Cause Analysis?

In IT, root-cause analysis is the process of determining what the core underlying cause of a hardware or software problem is.

Root-cause analysis is important because in many cases, there are multiple possible causes for a problem, and it’s not obvious from the problem itself what the cause is. For example, if an application starts responding slowly, it’s hard to know from that information alone whether the cause of the problem is poorly written code in the application itself, a problem with the operating system hosting the application, an issue with the file system that the application is using, a problem with the networking or storage infrastructure on which the application depends, or something else. It’s also possible that there are multiple underlying issues at play.

Why Root-Cause Analysis is Especially Challenging Today

Once upon a time, root-cause analysis was relatively simple because IT teams had fewer layers of hardware and software to manage. There were also few abstractions between physical infrastructure and hardware environments. Thus, if your monitoring software detected a performance problem with a disk, you could be relatively certain that either the disk itself or the file system used to format it was the root problem.

Today, however, we rely on highly dynamic, multi-layered, software-defined environments. Mapping the relationships between all of the components in these environments is difficult, especially because configurations change constantly. It’s also very hard to interpret how a problem that manifests itself at one layer of the environment relates to the other layers. Today, the root cause of a storage performance problem could lie not necessarily in a physical disk or local file system, but also in the networked or distributed file system that makes the storage available to remote systems. Or it could be the virtualized network that delivers the storage.

Making the Most of Root-Cause Analysis with AIOps

It is due in part to the difficulty of root-cause analysis in modern environments that AIOps has become so important. By using machine learning to map and interpret complex environments and causal relationships automatically, AIOps helps IT teams get to the root of performance or availability problems much more quickly than they could when relying on manual analysis alone. Simply having AIOps tools in place will do much to improve your root-cause analysis abilities.

That said, there are steps you can take to ensure that you are getting the very most out of AIOps-assisted root-cause analysis. They include the following.

1. Remember that configurations change quickly — and so can root causes

One of the tricky things about root-cause analysis in modern, fast-changing environments is that what constitutes a root problem at one moment could change the next. The root cause of a slow-performing application could be network congestion at one moment, but change to I/O bottlenecks the next, as network traffic patterns and storage system loads fluctuate.

AIOps tools can help to sort out these changes, but for human engineers, the important thing is to remember that root causes can change. Don’t assume that your core issues will always be the same.

2. Consider automated response

Another key feature of AIOps is that it makes it possible for software tools to take automated action to solve a problem. While automated response is not the right solution in every situation (you may want to have human engineers review high-stakes changes before they are made, for instance), automated response for simpler issues can be effective for helping to ensure that you not only identify root causes quickly, but also resolve them before they cause serious problems for your end-users.

3. Don’t assume there is a single root cause

As noted above, it’s possible that multiple issues lie at the root of a software or hardware problem. An application that has stopped responding might have done so because of poorly written code that failed to allow the application to recover from an unexpected networking error; in that case, both the application code and a networking problem would be the root cause of the issue.

The key takeaway here is that, while on the one hand you should strive to separate ancillary problems from root causes when performing root-cause analysis, you should not exclude the possibility that there could be two or more core underlying problems at play.

4. Strive for environment-agnostic root-cause analysis

Ideally, your root-cause analysis workflow should be effective for any type of infrastructure or environment. This won’t happen if you depend on monitoring or analytics tools (like those from a particular cloud vendor, or ones designed only for one type of operating system) that support only a certain type of environment or infrastructure.

The lesson here is that you should look for AIOps tools that can assist in root-cause analysis on any type of infrastructure. Where possible, steer clear of those that lock you in.