Three Essentials for Service-Driven Autonomous Remediation in AIOps Platforms

In my last blog I talked about how Point, Reactive Monitoring and Automation Tools Don’t Work For Today’s Digital Businesses and why you need service-driven autonomous remediation. Here’s three essentials for this capability to work:

Predictive identification of potential risks to services

Leveraging traditional, reactive monitoring tools and approaches, IT teams lack the insights needed to effectively predict issues before a business service or application is disrupted. Given the criticality of delivering a phenomenal user experience, these teams need an AIOps platform that offers algorithmic- or machine-learning-based insights for detecting abnormal behaviors and predicting potential issues. It’s also essential that AIOps platforms offer capabilities for mapping issues to associated services, so IT teams can intelligently prioritize troubleshooting and remediation efforts based on which issues will have the biggest potential business impact. For example, if two issues arise and administrators can see that one is affecting a payroll service that isn’t being run currently, and another is hitting an e-commerce service that runs 24/7 and accounts for the bulk of the company’s revenues, they can prioritize their efforts accordingly.

Automate root cause analysis across domains and technologies

Even with the best predictive tools in place, downtime and performance issues may still arise, whether due to an administrator’s configuration error, external service outages or a host of other causes.

Within many IT organizations, when these performance issues or downtime occur, operators struggle to determine why. While a single issue may be the culprit, large numbers of redundant or false alerts may be generated, making it difficult for administrators to filter through the noise and identify the issue that needs to be addressed. At the same time, when operators see that a service is experiencing issues, it may be difficult to determine how or if the issue is affecting business services.

To combat these challenges, operators need timely, targeted insights that can enable fast, automated root cause analysis. To address these requirements, AIOps platforms need to provide machine learning driven intelligence that can automatically identify the probable root cause. To support this machine learning, these platforms must also offer a topology analytics service that automatically discovers and maps key IT assets and stores topology information in a graphic database. This service needs to consume data and correlate intelligence from multiple architectural layers to effectively determine the probable cause.

Establish comprehensive, contextual automated remediation

Once an issue has been identified, whether predictively or through automated root cause analysis, IT teams need comprehensive, intelligent capabilities that can automatically execute remediation tasks required in a complex, dynamic enterprise environment. To ensure success, AIOps platforms need to provide scalable, flexible and easy-to-use automation that can be aligned with fast changing business and technology environments. AIOps platforms must be able to orchestrate the delivery of services in business, application and infrastructure layers, across on-premises, cloud and hybrid environments. This automation should seamlessly support complex, organization-specific processes. For example, an AIOps platform may detect an impending storage issue in an Amazon Web Services EC2 instance and trigger the provision of an additional instance. This server provisioning may need approval from a budgetary, compliance or business perspective. These approval workflows should be easily accommodated. By leveraging these contextual automated remediation capabilities, IT teams can ensure that service requests aren’t just logged—they’re acted upon before the end user has a negative experience.

At CA Technologies, we recently delivered new capabilities in our AIOps platform that use service driven autonomous remediation to proactively resolve complex enterprise issues before user experience suffers. You can read our recent press announcement for more information.