Hrushikesh shares his journey into SRE and his thoughts on the future of this space

In This Article:

Our Products

1. How did you become an SRE?

When I joined Vuclip in 2015, I was involved with a project that worked on designing a completely new platform for the product. During this time, I got more involved with automation, infrastructure planning and monitoring. This is when I felt that I was not just working with two teams but with two completely different mindsets.

I was later approached by the Head of Operations to join his team. He was on a mission to implement SRE as a culture within the organization. This was when I was first introduced to the term SRE.

We then worked together to take small and incremental steps towards building an SRE culture by implementing monitoring tools, finding and automating some tasks and templatizing services among others.

2. What's the most challenging part of your job?

The most challenging part of SRE is to get people to understand that there is an issue with the way things are today in the ops world. This cultural shift can happen only when they understand the power of automation and how we can make processes more reliable with the same.‍

3. What process, tools and techniques you can't live without?

If you are planning to move into the SRE space, some things to keep in mind would be

To segregate between mission-critical and value-added functionalities.
To validate all the services which are mission-critical and raise the clear risk items to the product/ project manager and their respective engineering teams.
To define realistic uptime goals and map out potential risks by collaborating with the product / project managers and engineering teams.

4. What according to you is the future of SRE?

According to me, SRE is solving a lot of issues around traditional operations thinking that companies face such as:

Addressing the silos between development and operations: Business and product teams think that reliability is the responsibility of the developers and operations. Developers think that reliability is the responsibility of the operations team. Operations team thinks that the developers should also be responsible for the reliability of the systems they build. There’s a lot of hassle around just making changes to the system when its reliability is at stake.

SRE as a culture eliminates this problem by enabling the product and engineering team with tools and techniques and putting reliability as part of the product requirements. When reliability is part of product requirements, engineers take more responsibility to code the ship.

Usually, in traditional operations, downtime, reliability and quality of service are completely based on assumptions. SRE helps quantify uptime and provides a process to link it with feature releases. Solving these issues will lead to reducing downtime and helping organizations look at scaling their businesses instead of just maintaining it.

Clear ownership, reliability and the ability to be more process driven are few reasons I see SRE catching up even with smaller business and startups.

5. Any productivity hacks that you would give to new SREs?

Segregate the functionalities of your product into minimum viable products (MVP). Also make sure you have fallbacks for any functionalities you can’t do without.

Make sure you have tight SLOs (99.99% uptime) for services that your MVPs are dependent on, to ensure that your SLAs are not breached.

Designing fallbacks should be an integral part of designing and developing your service.

For your value added services, the flow should be separated from the MVP flows. Like, site load metrics and logging services should be completely isolated from the main application to ensure that any potential downtime with logging service should not impact site load.

6. What are some of the things people get wrong about this role?

There is a common notion that if you know how to automate things, you are automatically an SRE. Automation as a skill is not limited to just SREs.

Automation as an approach to solving technical problems has been popular with many engineering and DevOps teams. Site Reliability best practices defines a methodology for most of these actions and uses it as a way to avoid linear increase in operations as systems scale with business growth. And this can be learnt. Anyone can be an SRE with a mindset to create scalable systems reliably.

The other myth in this space is that only the SRE team is responsible for the uptime of a product or service. However, site reliability engineering provides a way to collaborate with product and development teams and ensure that reliability is kept in mind from the start of the design and development phase.

Follow the journey of more such inspiring SREs from around the globe through our SRE Speak Series.

Written By:

Prakya Vasudevan

March 5, 2020

Prakya Vasudevan

March 5, 2020

SRE Speak

SRE

Share this blog: