Most businesses might not have a Site Reliability Engineer now, but Todd Palino insists many companies will want one in the near future. Palino knows a thing or two about bringing seemingly opposing worlds together.
To start, Palino worked when in high school as a computer sales person on the East Coast, and after studying computer science, navigated the high-tech world of Silicon Valley with ease. In his current role at LinkedIn, he combines his developer skills with his IT Ops experience to help create what is known on the West Coast as a Site Reliability Engineer, or SRE.
While SRE may not be a household—or rather datacenter—name just yet, Palino is confident IT organizations around the world will soon be clamoring to find the person for their team that can bring together skills that run the gamut of operations, development and leading-edge technology.
In this Q&A, Palino shares his take on the must-have new technology role, why businesses need an SRE and how he makes use of his varied skills to better serve his team at LinkedIn.
Todd Palino: It’s a new title to a lot of people. Site Reliability Engineer is very much a West Coast invention, and if you’re not in the Bay Area, you tend not to know what an SRE is. Or you may have heard of it, but you’re not really sure what an SRE does and how it’s different from a traditional operations role.
There are a lot of different ways to describe it, but I consider it to be a particular discipline of DevOps. Essentially, it combines roles that many of us in the operations fields were already doing—architect, tools developer, and operations—into one. An SRE is responsible for all of these things, with a focus on automation and developing the tooling so that you stop doing the reactive work. You’re constantly focusing on the proactive work instead.
How does your role as SRE help LinkedIn overall?
SRE is the glue that binds the entire organization at LinkedIn together. You have product teams and you have developers, but SREs are the people who know how the pieces all fit together. They create the pipelines that the developers use. So, you can have developers who use DevOps practices, but they can’t without a proper toolset. That’s one of the things that SRE brings to the table: developing and maintaining that toolset.
Does the SRE role lessen the amount of IT firefighting that needs to be done?
We need to reduce IT firefighting because the systems that we’re working on are getting larger and larger. I run in excess of 2,000 servers with a team of only three or four engineers in the U.S. That’s my SRE team, and we are running petabytes of data per day through Apache Kafka. We have numerous services that we run, but we’re doing it with only a few people because we have automation to support us.
What would you say to those who fear automation technology will replace humans and lead to fewer jobs?
Honestly, this is a transition that’s been going on for years now—manual jobs being eliminated by technology and automation. Technology is increasingly taking over roles that were traditionally done by people and we’ve seen it in nearly every industry. Manufacturing is a big example, but now we’re seeing it in fast food. Fast food workers are disappearing, and computers are taking over the job. You’re going to see the rideshare business get decimated by self-driving cars. The fact of the matter is, this is all being driven by improvements in technology. For me, this is exciting to see from a personal career standpoint but it’s driving change in numerous industries.
The way to weather this change is to continually challenge ourselves, and there’s no difference whether we are talking about a manufacturing job or automating a systems administration process. We should not be trying to halt this progress, but rather we should be the ones creating and embracing it. Then we make our work about maintaining the automation, and finding the next challenge.
How does an SRE align with the technology change happening across industries?
When working in SRE, you must develop the mindset to let things go. A lot of IT ops people with a fixed mindset, often people in long-standing systems administration roles, want to hold on to the processes too tightly. They believe that the manual work they do is critical to the existence of their job. So, they can’t let a developer take it over. SREs not only are developers, but we have the operational expertise where we can develop with a sense of what’s going to work in production. Especially in an embedded SRE role, you are working with developers who don’t necessarily have that experience, and you’re helping to inform their development process by telling them how their application is going to work in production. This comes from the experience we’ve had previously in operations. We also help them to get their applications deployed so they can focus on developing the code and not on figuring out how to use the DevOps tool chain: how to get hardware, how to get resources assigned, how to get firewall rules, and everything else that you need to make the software work.
Does it take a certain work culture to bring an SRE on board?
Yes, and it was difficult for me at first because I was an East Coaster for a very long time before I joined LinkedIn, and the culture is completely different. The open and honest culture of LinkedIn drives a blameless culture. It drives an environment where you know that when something gets messed up—when you make a mistake—that you’re not going to be personally blamed for it. We’re very careful when we’re creating incident tickets: we don’t put names in the comments for those tickets. We know who caused the problem, but we don’t care that they caused the problem. We care about why that problem happened. Were they following a process that wasn’t good? Did they not follow the process? Why didn’t they follow the process? How can we improve that the next time? Can we improve the situation with technology? It really is all about not blaming people and keeping people involved in the discussion.
How is the success of an SRE or a team of SREs measured?
Everything’s about the data. If you have fewer site issues, then you’re being successful. If you are getting your releases out faster, you’re being successful. These are things that you can measure and they’re things that we do measure. LinkedIn tends to be a very data-driven organization, to the point that we have hourly reports that go out to executives with all kind of metrics on site growth, stability and many other aspects about the platform. When it comes to measuring whether or not SRE is successful, it’s data as well. Most of it is quantifiable while some parts, like culture and morale, aren’t as easy to measure.
Are team members happier or more fulfilled working in this type of culture?
Speaking from my current position with LinkedIn’s SRE team, I would say that on the whole, the company’s SRE organization is quite happy. We have a healthy culture, enjoy the work we are doing and the colleagues that we are doing it with. Of course, all organizations have turnover, but we choose to celebrate the “next play”, fully supporting teammates who move on to their next challenge, whether it is within LinkedIn or at another company.
It’s very difficult to bring a culture like SRE into entrenched organizations that don’t have some of the building blocks. It really is a matter of having top-down support in the organization for that type of culture. Developer happiness is hard to measure, but you can still source feedback from teams.
If these symptoms describe your SRE team, or you’re a site reliability engineer who is anxiously awaiting a fix to this problem, learn more about how Broadcom can help automate incident response through our AI-driven, self healing platform by visiting automation.ai. Leveraging our deep domain expertise, we can help your SRE teams prevent alert fatigue by triaging alerting rules continuously using a combination of notification rules, process changes, dashboards and machine learning (ML) to proactively monitor the SRE four golden signals and measure what really matters for customer experience.
Automation.ai as your trusted guide
What if you could empower SREs with the insights needed to drive improvements? What if instead of the typical war rooms and on-call burn out, SREs had a trusted guide to quickly fix problems?
Broadcom provides a “just add water” approach that can help your IT teams automate incident response through our AI-driven, self healing platform automation.ai. Leveraging our deep domain expertise, we can help your SRE teams prevent alert fatigue by triaging alerting rules continuously using a combination of notification rules, process changes, dashboards and machine learning (ML) to proactively monitor the SRE four golden signals and measure what really matters for customer experience.