For years, we’ve bemoaned how maintaining performance across complicated application architecture is so difficult. From transactional application behemoths to client-server to microservices and now stateless functions, each new wave has added more cognitive load to an already burdened IT operation.
However, we’ve always tended to cope. “Keeping up” has been hardwired into our operational mantra, plus the infrequency of major change has been kind to us, providing more time to build requisite skills and knowledge. And however complicated we have considered applications to be, their performance has generally been dependent on a small number of tightly coupled components.
From Complicated To Complex Systems
As we embrace microservice-style development and API-centric everything, application complexity increases substantially. More components mean more points of failures. Add to that the faster change cycles that DevOps practices and cloud computing support, and there’s less time to gain control. As a result, valuable staff can spend too much responding to problems when they should be driving software improvements.
But we now face a far greater challenge, one that renders the old break-fix model of IT operations defunct — managing complex adaptive systems. These systems are where the performance of applications is difficult to model or predict due to the dynamic relationships and dependencies within and across ephemeral components.
The Next Frontier Of Resilience Engineering
The dynamic nature of modern IT systems means we should aim to engineer those processes that enable them to withstand inevitable problems. However, since these systems behave in unpredictable ways, traditional monitoring approaches such as manually setting performance thresholds become ineffective. Furthermore, because these systems have many dependencies, they can exhibit contagion-like properties where minor performance conditions spread very quickly — similar to the way a virus spreads through a human population.
Not surprisingly, organizations are shifting the focus to analytical methods that not only speed problem identification but also reveal deeper insights about inner application workings — the unknown unknowns. Armed with this information, teams can pivot from old rule-based practices fixated on reliability, stability and control toward those that build resilience, such as where we can cost optimize cloud workloads and which application code correlates to better customer outcomes.
Rethinking IT Operations Analytics
Unfortunately, many analytical approaches purported to be modern are inextricably linked to outdated software engineering and operations practices. When they should be imparting critical system-level insights our cross-functional engineering teams use to increase resilience, they focus attention on discrete system elements and problems. True, we gain increased problem clarity albeit within a narrow lens, but system-level improvements are neglected due to lack of context.
Gaining context across complex systems requires new approaches, especially artificial intelligence for IT operations (AIOps). These modern software platforms leverage big data and machine learning to support and enhance many IT operations and functions. AIOps is beneficial because it both assists and augments, quickly and automatically identifying data patterns staff would normally obtain manually but also correlating system events, patterns and behaviors to reveal deeper insights.
Charting A Successful AIOps Journey
As with an emerging technology like AIOps, there’ll be much hype and hyperbole. So how do you find business value in all the noise, and what practical steps can you take to implement such a technology? It’s a topic beyond the scope of this small piece, but here are some key pointers:
• Automate data supply and demand: AIOps shouldn’t become a process where staff spend disproportionate amounts of time manually collecting, cleaning and arranging data before any meaningful analysis is conducted. You should seek out methods to automate all the heavy lifting, automatically ingesting cross-domain data on a massive scale (including logs, metrics, time series and business data) to build an operational data lake that everyone can drink from (e.g., IT operations, support analysts, cloud architects, LOB, developers).
• Gain context to win the battle: Data lakes are all fine, but their potential is truly maximized when connections and dependencies between application components are automatically established. Just like marketers use graph analytics to better understand social network influences, you should apply the same thinking with AIOps. For example, by automatically mapping dependencies within and across applications, you can detect faults faster, determine application performance blind spots and better understand the business impact of problematic conditions.
• Reach low, aim high: Eliminating false positives (alarms), detecting false negatives and finding the root cause of problems in a haystack of needles means less operational cost and more productive staff. However, this shouldn’t be your ultimate AIOps goal. Massive event correlation and alarm reduction even in the most complex systems are table stakes for basic AIOps today — it’s an assistance capability. What we also require is deeper machine learning to tell us what we don’t know easily (e.g., the optimum mix of cloud instances based on future consumption and capacity). Insights become actionable, especially when attached to dollars and customer experience.
• Develop mature AIOps capabilities: Today, and for good reason, AIOps has been centered on real-time and historical data ingestion and analysis. In time, however, these systems will become far more useful when the learnings gained initiate automated actions for users; they’ll become bi-directional. Exciting, too, is the point where AIOps systems become trainers themselves, providing automated advice and guidance to staff and processes.
We’re long past the point where IT operations staff can realistically keep on top of modern application environments, and like it or not, the future success of a digital business inextricably linked to building and operating highly complex systems.
It’s time to embrace that future with AIOps.