The views discussed in this article are personally held by the author/interviewee and does not in any way represent their employer
Nishant Singh is an SRE at LinkedIn based in Bangalore. Currently, he is working towards building and maintaining applications that improve the overall MTTD (Mean time to detect) and MTTR (Mean time to recover) of the site. He likes to build services and play with the latest technologies. Before LinkedIn, Nishant worked for a few companies in the security and e-commerce domain as a DevOps engineer where he was primarily responsible for building infrastructure, deployment pipelines and security.
Since my early school days, I developed a keen interest in computers. The first language that I learned while in school was C++ which fascinated me to explore the fields of computers. Initially, I started to fiddle with my Windows PC using C++/C and then got a taste of shell scripting, which took me to the world of basic computer hacking and networking. I started with writing exploits and logic bombs, which mostly ended up crashing my system. I developed a taste for backend and distributed systems during my initial days at college. Cut to a few years later, I got my first job - an internship with a security company that dealt with AWS and a multi-tenant system for its customers. My boss’s boss who interviewed me had revolutionary ideas for the company as a whole and I was at the center of it all. I basically saw the whole shift towards the DevOps mindset in this organization and understood how essential it was for a company to keep technology aligned with the business. As I put in the time to learn more about the craft, I learned about SRE practices from Google’s way of running their production system. All of this played a huge part in my zeal towards finding the right people and place to work with, which led me to LinkedIn.
The technology world is vast and SRE makes up for a large subset. There is an abundance of great problems to solve and the solutions are just as interesting. One can’t assign a set duration to pick up all of this. Today, there is no foolproof course that helps you graduate and feel like a rockstar SRE. It helps a lot to learn on the job and keep grinding towards becoming cloud agnostic and learning more about application development, maintenance and scaling infrastructure over any cloud. Another interesting challenge is constantly making sure that the production environment remains stable. In most cases, even though the SRE is responsible for the service’s reliability, it is the application developers who own the actual application logic. The downside of this is that you may miss out on minor details that change with every release of the application since you don’t directly contribute to the actual code. Ultimately, it comes down to the SRE to learn the application and business logic which will then help you pitch ideas in the design phase of the application development.
Automation is one of the core processes that play a central role in my life. Right now, I am mostly dependent on Python for most things. Apart from that, I spend a good amount of time with Terraform, Ansible, Azure, K8S on the daily grind.
I believe that having a good monitoring stack backed by an effective logging system is a huge add-on. Oftentimes, teams do not invest too much in logging systems because of the overhead of maintaining the system from an operational perspective. However, if logging systems are configured correctly, it can help reduce the MTTR quite significantly.
The craziest on-call I was a part of had a misbehaving NIC card that was triggered due to a configuration issue on the top rack switch. This disrupted the service for an entire region.
We then narrowed down the problem to an issue in the network configuration (and no, it’s not the DNS) that caused the application communication issue. The interesting part about this story was that it was never observed during the usual manual debugging. This experience taught me that it can help to think from a machine’s perspective while debugging, however hard it may be.
I think the role of SRE in the future will get more specialized & streamlined with newer applications and technologies.Few technologies to keep your eyes on are machine learning, neural network, image processing, etc. The future will require the SRE function to adopt more skills than just software engineering and operational knowledge.
The world is moving to a more skill based economy and this would mean that an SRE will be expected to compete with the skills of a domain expert and help them architect their stack more efficiently along with what is already expected today.
Prioritize your stuff at the beginning of each day. I generally have a list of tasks that I write on a piece of paper to keep as a reminder of things that I need to finish that day. This helps me focus on the important tasks at hand.
The other thing that you must learn is working with multiple displays and organizing your terminal. Choose a tool like tmux or anything else you prefer especially if you spend most of your time ssh’ing over boxes. Also, stick with an IDE of your choice which you find really fun and effective to use.
SRE definitely involves writing code and at times a lot of it depends on the team you are a part of. A common misconception is that it’s more focused on operations than actually writing code. A reflection of a good SRE culture involves the right balance of writing code and doing operational work.
At first, for someone coming from a pure operations background, code looks daunting, but I think one should focus on just the logic of solving things rather than looking into how to actually do something in a specific language. Languages come and go, it's the logic underneath that should help you get to solving it in any language you write code in.
Some of the best practices I have learned are:
Apart from the regular tech books, one book which I recently read was called the “The Phoenix Project”. The book talks about the journey of a company and the challenges involved in the overall process of building and maintaining the numerous departments from Security, IT, engineering, etc.
The book definitely gives you the reason for a DevOps cultural shift. This is explained by taking you through the fictional working environment with a focus on breaking down silos. For anyone who is new to the role, I definitely recommend reading this book. Another great book is “Clean code” by Robert C Martin aka Uncle Bob, which is highly recommended to learn the basics of writing production-grade clean code.
I think a good SRE is someone who is systematic in his approach while trying to solve a problem but a great SRE is someone who stays focused while trying to solve problems. It takes time to reach the latter and usually takes years of experience.
The initial years are spent mostly fighting the fires & getting panic attacks. Although, all of these hard experiences will push you to actually develop the skill to do it efficiently.
Along with this, it is important for an SRE to be a team player and push to inspire everyone in the team to do the ‘right’ things even in tough situations.
Follow the journey of more such inspiring SREs from around the globe through our SRE Speak Series.