AI-Driven IT Operations – Secrets to Success Beyond Great Math

Once upon a time we had visibility across IT Infrastructure. We had physical data centers and lovingly nurtured our servers and networks. Of course, the applications under our control became increasingly complicated, but we could always get under the hood when things went wrong.

But consider this.

Due to all-things cloud, most folks entering the tech workforce today will never get to see a physical server or play with a patch panel and configure a router. They’ll never need to acquire that sysadmin “sixth-sense” knowledge of what’s needed to keep the systems up and running.

So, what’s needed to fill the void? Well, two things — data and analytics.

There’s no shortage of data or Big Data in IT operations. Acquire a new cloud service or dip your toes into serverless computing and IoT and you get even more data — more sensor data, logs and metrics to supplement the existing overabundance of application component maps, clickstreams and capacity information.

But what’s missing from this glut of data are analytics and AI-driven IT operations (AIOps for short). It’s tragic that organizations rich in recorded information lack the ability to derive knowledge and insights from this information. Kind of like owning the highest grade gold bearing ore but not having the tools to extract it – or worse, not even realizing you have the gold at all.

Most organizations understand there’s “gold in them there thar hills” and are employing methods to mine it. In the last few years, we’ve seen fantastic strides in data gathering and instrumentation, with many new monitoring tools appearing almost as fast as each new tech and data source. So, as organizations sign up for a new cloud service there always seems to be another monitoring widget or opensource dashboard to go with it — along with the other 20+ dashboards already existing. That’s like ordering a burger and being offered free fries. Immediate visual gratification yes, but it’ll only add maintenance pounds in the long term.

This isn’t to say that these new tools aren’t useful. Visualizing data is a great starting point for any analytics journey. The problem, however, is that these offerings present information in a narrow observational range. Plus by exponentially increasing the metrics under watch they can increase the likelihood that something critical will be missed.

It’s not surprising then that some organizations sauce their fries with automated alerting. At its crudest, this involves setting static baselines with binary pass/fail conditions. All fine for predictable legacy systems, but only increasing noise levels and false-positives in more fluid and modern applications. Or worse, false negatives — those “everything looked hunky dory” moments just before the major web application fell off the cliff. Plus these mechanisms only analyze one variable (they’re univariate in math speak) so cannot always deliver a complete picture of performance.

To address this and other issues many IT operations teams are turning to math and data science, which has been used to great effect in many facets of a business ranging from fraud and credit risk to marketing and retail pricing. Surely, if a team of archeologists can apply some neat math to find lost cities in the Middle East, IT operations teams can apply similar thinking. After all you have data on tap and unlike archeologists, you don’t have to decipher ancient clay tablets and scrolls.

So why has IT operations lagged behind in the application of analytics? Here are three factors impacting success:


It’s tempting to get sucked into data science and start applying algorithms in a piecemeal fashion. When things go wrong, it’s easy to blame the math without actually recognizing that it’s never the math that fails, but the application in a correct IT operations context. This can be especially troubling with the sub-optimal use of machine learning, where the lack of actionable alerting and workflows means critical events are missed and the system starts teaching itself that a prolonged period of poor customer experience is the new normal.


Many organizations start an analytics program by looking at single metrics using normal standard deviations and probability models — basically if a metric falls outside a range there’s an alert. This seems reasonable, but what if 100 metrics are being singularly captured every minute in a twenty four hour period. This nets out to 144,000 observations and probably 1000+ alerts (assuming events are triggered at the second standard deviation) — too many for anyone to reasonably process.

Furthermore, these models fail to accommodate multi-metric associations. For example, the non-linear relationship between CPU utilization across a container cluster and application latency, where a small increase can massively impact responsiveness. Nuances like these suggest new approaches are needed in data modeling (vs simplistic tagging) where metrics can be automatically collected, aggregated and correlated intro groups. This is foundational and extremely beneficial since algorithms gain context across previously silo’d information and can be more powerfully applied.


The complexity of modern applications and customer engagement means there’ll be many unexpected conditions — the unknown unknowns. What becomes critical therefore is the ability to capture, group elements and analyze across the entire application stack, without which unanticipated events will tax even the best algorithms.

Take for example a case where an organization has experienced declining revenue capture from a web or mobile application. The initial response could be to correlate revenue with a series of performance metrics such as response-times, page load times, API and back-end calls etc. But what if the actual cause is an unintentional blacklisting of outbound customer emails by a service provider? Unless that metric along with other indicators across the tech stack is captured and correlated there’ll be longer recovery times and unnecessary finger pointing across teams.


Analytics can fall short if it can’t help staff (or systems) make decisions as to what to do when an anomalous pattern or condition is detected. Key then is a workflow-driven approach where the system not only collects, analyzes and contextualizes, but also injects decision making processes such as self-healing into the model. To do this the best systems will be those that leverage existing knowledge and learnings of staff and solution providers with significant experience working in complex IT environments.

Today, cloud and modern tech stacks are the only way to conduct business at scale. While we might bemoan the loss of visibility, what we can win back is far more beneficial — analytical insights based on the digital experience of our customers. Doing this requires great math, but also demands fresh approaches to its application in context of business goals, underpinning technologies, and the people and processes needed to support them.

By analyzing increased data volumes and solving more complex problems, AIOps equips teams to speed delivery, gain efficiencies and deliver a superior user experience. So if you’re hungry for some AI-driven insights, attend the AIOps Virtual Summit, Wednesday June 20 from 11 a.m. to 3 p.m.ET, where leading thinkers will share their learnings in AI, machine learning, modern monitoring and analytics.