Mark Henderson has been a Site Reliability Engineer at Stack Overflow since 2015. Before this he worked as the sole systems administrator at a small software company in Sydney, Australia. These days, he lives in South Australia and works from home with his wife and two children.
I started off in Australia with a typical IT career: doing retail, call center and help desk work at a variety of companies, including Cisco. When I graduated from university, I told my current employer that I was planning on leaving and getting a “real” job. They referred me to the person that designed their help desk software and I worked for him for around 9 years. I started by doing application development and eventually building out a small datacenter. I was the sole employee when I started, but just one of a larger team when I left.
I moved to New York City to join Stack Overflow in 2015, which was my first job that had “SRE” in the title. I went from being the sole systems administrator in a very small company to being a part of one of the most efficient SRE teams in the world. I got to learn from some of the best in the industry: George Beech, Tom Limoncelli, Kyle Brandt, Nick Craver, and many others. I’ve worked on virtually every part of Stack Overflow - the public infrastructure that serves over 40 million developers with less than one rack of hardware. The logging infrastructure that ingests and analyses over half a terabyte of logs every day. The CI/CD pipelines that keep Stack Overflow updated and in check.
Currently, I work on the Azure infrastructure and tooling for Stack Overflow Enterprise which is a totally private version of Stack Overflow that we can run for you for your proprietary code questions that you can’t ask on the internet.
SREs love to work on SLAs, SLO, monitoring, and metrics. Measure everything is one of the tenets of SRE work - but it’s just one. It’s hard to get out of the mindset of just measuring everything and starting to look at the other things SREs should be doing, such as working on reducing organisational silos. It’s very easy to fall into the trap of just working on moving easily monitored metrics (such as a latency budget) instead of the intangible metrics (increasing cross-team collaboration).
There is no single tool that I can’t live without. Ask 5 SREs what their toolsets are and you’ll get 5 different answers. The fact is that the tools we use are secondary to the goals we’re trying to achieve. However, one thing that is not negotiable to me is having a quiet space to work. Right now I work from home - which for me is wonderful. But even when I worked in an office, having a private office with a door that closes is worth everything. Having a private space means not having to fight against the cacophony of an open-plan office or the dull drabness of a cubicle. Either working from home or an office with a door that closes are non-negotiables for me. Particularly in western culture, we need to get out of the perspective of thinking “Private office == more status or higher rank” and more into “Private office == ability to focus” and giving people access to the work environment that makes them the most productive.
Use your calendar to its full potential. This isn’t really an SRE-specific hack, just generally good life advice. It makes scheduling things so much easier with your coworkers. Don’t be afraid to schedule a meeting with yourself on your calendar to give yourself some actual work time if you start to get overwhelmed with meetings.
Put your personal items on the calendar too - even which recycling bin goes out onto the street on which day (mark them as private so your coworkers can’t see the details if you wish). If you have coworkers in other time zones, add additional timezones to your calendar so you can see at a glance what time it is there.
To give some actual SRE advice: validate that your SLOs are meaningful. If you find out that your SLO was pulled out of thin air, then perhaps your error budgets or latency budgets are needlessly strict. Find out what they should actually be and you might find out that your job becomes a lot easier.
SRE should not be a silo on its own. SRE is not just a drop-in replacement for traditional systems administration. It is not a replacement for DevOps. SRE should be compassing the pillars of DevOps by sharing responsibilities with developers, working in small batches, not placing blame.
If you have an existing systems administration team, you can’t just rebrand them as “SRE”, hand them a copy of the Google SRE book and let them loose. Migrating from traditional systems administration to the SRE mindset requires organisational change. Doing SRE successfully means that your SREs need to start breaking down walls with the development teams that they are supporting. Working with the devs, and just as importantly have the devs work with the SREs. It’s not an overnight change, and although there are many wrong ways to do SRE there is no one right way.
Secondly, you do not need to be an amazing developer to do SRE. A large part of SRE is automation - but that does not mean you have to write everything from scratch yourself. If you understand how to read a JSON file, you can write a Terraform configuration. If you understand how to run write a Powershell cmdlet then you can implement Octopus Deploy. You have to be willing to do some coding, but you do not need to be a world-class developer.
It sounds like nepotism but I really enjoy the talk given by Tom Limoncelli (who happens to be my manager) on “DevOps Where You Wouldn't Have Expected”. He uses some critical thinking regarding the DevOps principles that we use so often in SRE and applies those principles to places outside of systems administration and SRE - such as new employee onboarding.
Of course, there’s always the venerable The Phoenix Project. It really is a must-read for anyone working in SRE.
On a more recent front “Retrospectives for Humans (a Crash Course)” by Courtney Eckhardt was probably my favourite talk at SRECon 2019 Asia/Pacific (the most recent conference I attended). She gives an excellent perspective on retrospective/post-mortem analysis by comparing them to a real-world disaster investigation and the language that was used to describe the cause of the accident. But because of the nature of the English language, the real cause of the disaster was never actually found. By being aware of the faults in our tools (in this case study, the tool was the English language) we can reach much more useful conclusions.
Follow the journey of more such inspiring SREs from around the globe through our SRE Speak Series.