Blog Page 3

Why SREs are Protectors of the User Experience

Being a site reliability engineer isn’t easy. As described by Andrew Widdowson, “it’s like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.”

Known as the “automaters”, SREs are often asked to observe application environments and manage incidents… at all hours of the day. Because everyone knows, when your app is down, so is your business.

The SRE’s job is to secure a flawless user-experience. To deliver site reliability. SREs bridge Dev and Ops, ensuring new releases improve the product, rather than breaking it.

The Challenge

The trouble with monitoring application environments is that there are hundreds of thousands of monitoring data points. How do you prioritize which data points are useful, and which can be ignored? Alarm storms aren’t helpful. They prompt panic, instead of resolution.

…And when a crucial incident does occur, how do you quickly mitigate it? The common SRE approach is to spend a ton of time and energy manually sifting through data – often at the expense of other initiatives, or worse, personal time (e.g. responding to the dinner-time incident alert).

What if you could get to that Aha! moment faster? What if instead of the typical hair-on-fire response, you had a trusted guide that could quickly lead you to the source of the incident?

Automation.ai as your trusted guide

What if you could empower SREs with the insights needed to drive improvements? What if instead of the typical war rooms and on-call burn out, SREs had a trusted guide to quickly fix problems?

Broadcom provides a “just add water” approach that can help your IT teams automate incident response through our AI-driven, self healing platform automation.ai. Leveraging our deep domain expertise, we can help your SRE teams prevent alert fatigue by triaging alerting rules continuously using a combination of notification rules, process changes, dashboards and machine learning (ML) to proactively monitor the SRE four golden signals and measure what really matters for customer experience.

Site Reliability Engineering Is a Kind of Magic

A site reliability engineer (SRE) can be considered the IT equivalent of a wizard, or as Andrew Widdowson, an SRE at Google, described it “Like being part of the world’s most intense pit crew… changing the tires of a race car as it’s going 100 mph.”

So how is a site reliability engineer (SRE) different from traditional IT operations, and can a discipline originating from the world of web-scale, cloud-native unicorns ever apply to steady as she goes state of Enterprise IT?

Yes, it can. The scale out way is really the new way of managing enterprise IT. The notion that Enterprise IT exists behind closed walls doesn’t exist anymore. Now, the only way to create and conduct business at scale is through engineering reliability managed in an unprecedented manner. The demand for mobile experiences and the advent of complex cloud architectures has shifted the operational focus. It’s no longer about keeping the lights on. It’s instead about performance. The apps have to work well, the experience great and the infrastructure behind it needs continual monitoring.

Reliability like any feature isn’t something that’s retrofitted after deployment; it’s established and enhanced as software is developed, tested and released. That means establishing a new discipline, which Ben Treynor — Google’s original SRE lead — describes as “what happens when a software engineer is tasked with what used to be called operations.”

A Sobering Reality

It’s easy to throw out yet another three-letter acronym and claim it’s a magical elixir for all the problems involved with running complex IT systems. In reality, engineering reliability into distributed systems with thousands of containerized applications and microservices is a tough gig. Not least because of all the moving parts, but also because any preconceived notions about predictable system behavior no longer apply.

Take for example keeping watch over a modern software application. This might consist of business logic written in polyglot languages and linked to the legacy ERP system (custom built or packaged or both). There’ll also be a raft of databases — traditional relational for transactional support, yes, but more likely a smorgasbord of NoSQL data stores — be that in-memory, graphing or document — perhaps fronted by recently adopted Node.js.

Some of this componentry will be on-premise, some will be containerized and moved to the public cloud — that might mean Docker and Kubernetes on AWS, but maybe Azure and Mesos — heck, why not both for some hybrid-style resilience?

But like the old Monty Python sketch, “you’ll be lucky” if this is all you ever have to manage. Depending on the nature of the business, there’ll also be a glut of third-party services — including payment processing and reconciliation. That’s not to mention all the new web and mobile apps interacting with the core business systems through an API gateway and possibly some analytics horsepower delivered by the likes of Hadoop and Elasticsearch. It’ll take a lot of operational wizardry to keep all that performant.

Fortune Favors the Bold

In a wonderful talk at SREcon, Julia Evans from Stripe described the realities of managing today’s complex distributed systems. What was refreshing about her presentation was the open admission that she often finds the work difficult and how there’s always a ton of new stuff to learn. As she says in her abstract, she doesn’t always feel like a wizard (echoing the protestations of Harry Potter).

This honesty illustrates what’s exciting about being an SRE. With systems like the ones described above causing any number of thorny problems, it’ll be the inquisitive and brave that keep business on track. Being an SRE isn’t for the faint-hearted or those happy with a fire-fighting status-quo. It’s for those within our ranks who get bored easily — those super sleuths who keep asking reliability questions, crafting improvements — and learning as they go.

So, if we consider a typical business-critical problem that could impact our modern application — let’s say some latency issue is causing an increasing number of mobile app users to abandon a booking service? How would teams address the issue? Problems like this might go unnoticed for some time, or there could be a deluge of alarms. Even when a problem is identified, where do teams find the root-cause? Is it a problem with a new code release or at the API gateway? Is it down to some weird microservices auto-scaling issue and was that earlier CPU increase we thought was OK but actually was really bad?

With an SRE-style approach, business critical problems are never addressed in knee-jerk fashion. Using modern tooling in areas such as AIOps and application performance monitoring, SREs can observe the real-time behavior of applications, with systems collecting and correlating information from all related components. Rather than react after the fact, these solutions continuously identify anomalous patterns (like those mobile app abandonments) and compare them to historical trends — meaning SREs are alerted well before the business is impacted.

But beyond exposing new normal application weirdness and “unknown-unknowns,” modern tools also encourage and stimulate more of the SRE detective work — the real valuable stuff. These tools won’t just detect anomalies and then leave teams scrambling to find the needle in a haystack of needles. Instead, they’ll analytically gather all the evidence and lead teams in fact-based fashion towards a solution. Like for example, using an SRE inspired monitoring service to detect a performance anomaly introduced with a new software build and then tracing to the actual code causing the problem.

Like Harry Potter, operations professionals might have a hard time accepting they’re wizards. But ask yourself this — do you want to remain a silly muggle getting burnt out by constant fire-fighting? Of course not, it’s career limiting and sucks. Time then for some SRE magic — gaining the skills and tools needed to adopt new tech like containers and microservices — becoming an essential part of future-proofing your business.

Automation.ai as your trusted guide

What if you could empower SREs with the insights needed to drive improvements? What if instead of the typical war rooms and on-call burn out, SREs had a trusted guide to quickly fix problems?

Broadcom provides a “just add water” approach that can help your IT teams automate incident response through our AI-driven, self healing platform automation.ai. Leveraging our deep domain expertise, we can help your SRE teams prevent alert fatigue by triaging alerting rules continuously using a combination of notification rules, process changes, dashboards and machine learning (ML) to proactively monitor the SRE four golden signals and measure what really matters for customer experience.


What is SRE?

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to IT operations problems. The main goals are to create ultra-scalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.” Essentially, SRE combines IT Ops roles of architect, tools developer, and operations into one. An SRE is responsible for all of these roles, with a focus on automation and developing the tooling to reduce the reactive work.

Empowering SREs With Insights

Now, the only way to create and conduct business at scale is through engineering reliability managed in an unprecedented manner. The demand for digital experiences and the advent of complex cloud architectures has shifted the operational focus. It’s no longer about keeping the lights on. It’s instead about performance. The apps have to work well, the experience has to be great and the infrastructure behind it needs continual monitoring.

The trouble with monitoring application environments is that there are hundreds of thousands of monitoring data points. Containers and cloud platforms are blurring the lines between applications and infrastructure, while DevOps increases the pace of delivery. In particular, containers now package an application and all of its dependencies in an abstracted layer that requires a combined view of infrastructure and applications. How do you prioritize which data points are useful and which can be ignored? Alarm storms aren’t helpful. They prompt panic and fatigue instead of resolution and improvement.

And when a crucial incident does occur, how do you quickly mitigate it? The necessary SRE approach is to optimize and automate monitoring for every team–freeing up their valuable time to improve code, apps and business services. In fact, one of the core principles mandates that SREs can only spend 50 percent of their time on Ops work. As much of their time as possible should be spent writing code and building systems to improve performance and operational efficiency.

Automation.ai as your trusted guide

What if you could empower SREs with the insights needed to drive improvements? What if instead of the typical war rooms and on-call burn out, SREs had a trusted guide to quickly fix problems?

Broadcom provides a “just add water” approach that can help your IT teams automate incident response through our AI-driven, self healing platform automation.ai. Leveraging our deep domain expertise, we can help your SRE teams prevent alert fatigue by triaging alerting rules continuously using a combination of notification rules, process changes, dashboards and machine learning (ML) to proactively monitor the SRE four golden signals and measure what really matters for customer experience.

Automation.ai as your trusted guide

What if you could empower SREs with the insights needed to drive improvements? What if instead of the typical war rooms and on-call burn out, SREs had a trusted guide to quickly fix problems?

Broadcom provides a “just add water” approach that can help your IT teams automate incident response through our AI-driven, self healing platform automation.ai. Leveraging our deep domain expertise, we can help your SRE teams prevent alert fatigue by triaging alerting rules continuously using a combination of notification rules, process changes, dashboards and machine learning (ML) to proactively monitor the SRE four golden signals and measure what really matters for customer experience. 

AIOps: Helping SREs Predict the Future?

As a kid I grew up reading a lot of science fiction. My forbearing parents used to let me take out from the library the max number of books each week they would allow (30, I still remember that number). And each week I would go back for more. Given this constant consumption of augury you would think something I read would have prepared me for the future we now face within the Operations space.

While there are definitely some inklings in the science fiction canon about computer systems constructed at such scale that they would be hard for humans to understand, there is precious little attention paid to what it would take to operate them in production. Welcome to my world (and your reality, too, I bet).

At the AIOps Virtual Summit we discussed two separate approaches to handling this level of complexity and how they intersect. The first is the engineering discipline known as Site Reliability Engineering (SRE) which aims to engineer failure out of the system. The second, AIOps, is a newly coined term for the application of a class of advanced algorithms to the massive corpus of operational data we are now accumulating just as part of the ordinary day-to-day activity of running all of these systems and services.

One goal of the former is to construct a set of operational practices that allow us to navigate the tricky path between a desired feature velocity (iterating the software as fast as possible to provide the features a business needs to provide to its customer base) and a desired level of operational stability (keeping the system available for those customers). This is trickier than it sounds for at least three reasons:

  1. There are often completely different sets of people working on these problems.
  2. They have very different incentives around the work.
  3. Communication between these groups is often, shall we say, a little dicey.

SRE, like many other engineering disciplines, is a data-driven approach. It uses data (in ways we’ll talk about in the upcoming session) to help create productive conversations and decision making easier between these different groups.

AIOps similarly tries to use operational data to provide a big win for an organization. It attempts to address the hard problem of “we have all of this data on the operational status and performance of our infrastructure, what can we learn from it?”

Can the record of the past help us understand how things are working in the present or even help predict the future? Is there information in the data I have already that might provide some insight into how my systems are behaving? For example:

  • Is this just a spike in traffic or an indication my systems are about to experience a tailspin into failure?
  • Are there any difficult-to-see patterns in the load in my system that could help me optimally provision my resources so I don’t pay more than I need to?
  • Have we ever seen a outage like the one we are experiencing? (and how did we deal with it last time?)

Some of this is real today, some of it is easily imagined. There are definitely limits on what AIOps can offer our operations practices, but we surely haven’t taken it to its full potential yet.

Watch a replay of myself and Todd, Palino, who’s a senior SRE at LinkedIn discuss How AI is Helping Site Reliability Engineers Automate Incident Response. We discuss both approaches and their potential to bring a little bit of the future into your present. 

Automation.ai as your trusted guide

What if you could empower SREs with the insights needed to drive improvements? What if instead of the typical war rooms and on-call burn out, SREs had a trusted guide to quickly fix problems?

Broadcom provides a “just add water” approach that can help your IT teams automate incident response through our AI-driven, self healing platform automation.ai. Leveraging our deep domain expertise, we can help your SRE teams prevent alert fatigue by triaging alerting rules continuously using a combination of notification rules, process changes, dashboards and machine learning (ML) to proactively monitor the SRE four golden signals and measure what really matters for customer experience.

How AI is Helping Site Reliability Engineers Automate Incident Response

In this AIOps webcast, Kieran Taylor, Head of AIOps Product Marketing for the Enterprise Software Division at Broadcom is joined by Todd Palino, a Senior SRE at LinkedIn, and David Blank-Edelman, an SRE author and SRECon founder.