Site Reliability Engineer (SRE): What is an SRE?
Site Reliability Engineer (SRE): What is an SRE?
A site reliability engineer, or SRE, is a person who works with both software and operations/infrastructure. DevOps and operations play a big part in this strategy. It also includes a set of practises and principles that are used across different service offerings.
It was first used at Google in 2003 when a site reliability team was formed. During that time, the team was made up of people who worked on software. Since then, the idea of site reliability engineering has evolved and spread across the whole software development industry. It now has its own job in businesses.
Site reliability engineers connect the dots between operations and the people who write software. When it comes to what a site reliability engineer does, there isn't a one-size-fits-all answer. In general, a site reliability engineer's job can include a wide range of tasks, such as managing and monitoring system availability and latency, performance, efficiency, incident response, and capacity planning for an organisation's services. Let's look into this a little more to learn more about this job and how it works in businesses.
What is Site Reliability Engineering?
This is where the traditional IT role, or system administrator role, and DevOps come together. Site reliability engineering is the place where these two roles meet. In the past, organisations may have had a team of system administrators who took care of very complicated systems. The main responsibility is to make sure that software is installed correctly and to make sure that users get a good service. Furthermore, their job is to deal with any problems or issues that arise after the software is installed.
However, system administrators don't work on the software itself. This is where the roles of development and system administrators can be at odds. Developers are more concerned with making software and getting it into the hands of users than they are with the aspects or effects of deploying software, though. It is at this point that the site reliability engineer role comes into play.
A site reliability engineer is someone who works on software that is both scalable and reliable. This includes making sure that development work is both efficient and reliable, so that when the finished product is ready for production, there are no surprises.
As a Site Reliability Engineer, what does he or she have to do?
Site reliability engineering is about dividing your time between development and operations. For example, a site reliability engineer might deal with help desk tickets, on-call incidents, manual tasks, and so on. This isn't the only thing a site reliability engineer might be working on. They might also work on things like automating tasks, improving system reliability, and so on, in an effort to cut down on how much manual work there is and make sure all the parts that need to keep software deployments running (infrastructure/hardware, middleware, software) are working well.
Do these things happen all the time with SREs?
There are many different types of SRE roles at different companies, but for the most part, they are in charge of everything that comes with their services. They may have one, all, or more of the tasks below.
- Planning Capacity
- Availability
- Performance
- Monitoring
- Responding to incidents
- On-call help
- Post-Mortem
The job of an SRE is to be good at many things. If you work as an SRE, you might have to set up storage in AWS or talk to customers or write Python code for a new project. A lot of it comes down to what day it is.
SREs tools
From one company to the next, the tools and software that site reliability engineers use can vary a lot. Larger companies usually have more people on their SRE teams. This means that the responsibilities and scope for each SRE would be split up among the team members, which would make their jobs more focused. This would also cut down on the number of tools and platforms they would be able to use. If you work for a big company, for example, an SRE might just work in Jenkins all day, every day.
Site reliability engineers who work for a smaller company may have to wear many different hats because they don't have as many people. This means that their toolset would have to include everything from configuration management platforms and automated incident response systems to monitoring tools and analytics tools. It's possible that you already know some of the tools that an SRE works with, such as Docker, Terraform, Prometheus, and Kibana.
Where can I learn more about Site Reliability?
It was Ben Treynor Sloss who came up with the term "Site Reliability Engineer." He is now a Vice President of Engineering at Google. In 2003, he was asked to set up and manage a team of seven engineers, which led him to come up with the new title. Many great online resources written by Ben and other members of the Google engineering team cover everything from the principles and tenets of SREs to how the role of Site Reliability Engineering has changed over time and how it fits into DevOps today. There's no better way to learn about site reliability engineering than from the person or group who came up with the job in the first place.
To conclude: What is a Site Reliability Engineer (SRE)?
We have talked about how an SRE is more than just a typical operations or system administrator. When an SRE has a lot of experience and knowledge, they can help automate and make their software services and company more efficient. A good SRE is someone who, by and large, is a good problem solver. These people don't have to be the best at everything they do, but it's important that they know how to deal with problems when they happen. When they work on tasks and projects, they also need to know how different people in their company work together. It's like putting together a big, complicated puzzle all the time, but with a lot more pieces. There are times when it is very difficult and frustrating, and sometimes parts of it go missing. When you finish it, there is a lot of pride and happiness.