Making your organization more transparent is not always an easy process. In our latest blog post, Adam Hammond, shares some tips and tools that can help you get started when it comes to keeping your teams and customers in the loop during downtime.The core message is that you need to make communication a cultural pillar of your organization.
Communication is key. This is true in all aspects of life, and it especially applies to managing critical incidents in your company. Managing communication effectively with your customers can ensure good-will is maintained and they continue to use your product; alternately, failing to keep your customers informed can result in a loss of business and angry customers. Building and maintaining good communication channels within your company and with your customers is key to ensuring your product continues to be patronised. Done properly, when (inevitable) outages occur, the impact (both technologically and emotionally) is limited.
In a technical sense, proper communication channels help technical teams who may be unaware of how each other operate can work together efficiently to action and resolve problems quickly. In an emotional sense, business teams, management, and customers will be happy to know that their data and technology is in the hands of people who know what they’re doing. Levels of comfort among these stakeholders are enhanced by being included in the process, having their concerns proactively acknowledged, and, being treated as equals by technical teams.
Many “techies” enjoy communicating using complex, domain-specific language when managing their services, and this usually never poses a problem in the day-to-day of their jobs. However, when a service-impacting incident is underway, it’s not just you and your team members who are fixing it: everyone is. In the same way spectators at a football game cheer on their home team, your managers and customers are there to provide you with the support you need to resolve issues. But when you speak to them in complex language that is difficult to understand, you unintentionally gate-keep which prevents those same people from assisting you because they just don’t understand what you’re saying. In the same way, product managers and marketing have a tendency to “sanitise” public communication to customers; by the time information makes its way into their inboxes it’s functionally useless.
Consider the following: “our internet service provider has published incorrect routing information, which means that everything on our internal network does not know how to reach the internet. We can fix it by temporarily overriding the incorrect routing information, but our ISP will need to correct their configuration”. This explanation uses very clear, common phrases to explain the issue: internet service provider, internal network, internet, route. It contains no overtly technical information, and it also provides methods for resolution. Ironically, I have very rarely seen an example of clear communication from most technical staff. Usually, a manager comes along with a question like “why is the network down?”, and the following answers are given:
The first response lacks any clarifying information and doesn’t provide any additional context to the question. The last two cases are so technical that without domain-specific knowledge, anyone who is not on your team including those people in other technical teams will not understand it. This example was adapted from the service outage review conducted by Cloudflare for their outage on the 17th of July 2020. I’m a customer with Cloudflare and the way they clearly communicated their understanding of the issue, steps they took to resolve, and post-mortem of the issue gave me the confidence to continue being their customer. If they had responded with “the service is out, we’re looking into it”, I would have moved to a better provider. This is usually what happens when these simplistic, pointless updates are given because people lose trust in a service provider to actually do their job.
Hot tip: customers already know the service is out, they don’t need reassurance that the outage is occurring, they need assurance the service is being fixed.
These pointless updates are usually caused by technical teams who provide little-to-no-context to product managers and external communications. In turn, these teams do the same for customers. Conversely, it is possible to be “too communicative”, whereby you notify customers of outages to individual infrastructure that is redundant or will not impact the customer in a material way.
There are a few keys to communicating with your customers that will ensure that they remain customers:
These four behaviours ultimately result in a better experience overall, it develops and enhances the relationship you have with your customers. It changes the dynamic from “us vs. them” to “we are in this together”.
Now that we understand what we need to communicate, we need to know how we can use that to convey that information to those who need it. During an outage, there are four methods of communication, each building on the last, to provide each audience with the appropriate information they need to do their jobs.
Whether this is in-person, via chat, or conference, this communication happens directly between the people “on the ground” fixing the issues. This should be highly technical to give technical staff the information they need to fix technical problems.
A War Room is a place where technical and non-technical staff come together to provide updates and discuss a critical outage. Generally, updates should only be provided when the status of an incident changes (e.g. the cause is discovered or service restoration is beginning). An Incident Commander (IC) should also provide non-technical updates in the incident notes to ensure that appropriate communications can be drafted for all stakeholders.
A public Status Page should be made available to all customers and potential customers of your service. This is vital as it ensures that your service is fully transparent, and it also provides a central place for your customers to find information in the case of an outage. The information on here should be non-technical in nature and provide customers with the information they need to make critical decisions regarding their own services.
Within 48 hours of an incident being resolved, you should provide a full postmortem of the incident on your blog and/or status page. This provides your customers with a full understanding of why an incident has occurred, how it was resolved, and actions that can be taken to limit similar outages in the future. This is a blend of non-technical and technical, with a business summary at the start followed by a technical analysis. Customers should also be invited to ask questions about the incident on social media channels to ensure that any concerns they have, are addressed.
Squadcast has many features that can enable you to keep your customers engaged and informed during an outage. Implementing these into your incident management process is very simple, and integrate into your existing Squadcast usage.
Incident Notes (previously War Room) is an excellent tool for keeping everyone up-to-date. Use this effectively to drive inter-team communication by having an Incident Commander (IC) who can translate between technical and non-technical staff. Ensure that all communications are addressed to teams or individuals so that nothing is missed. Finally, be sure to send your account managers, support staff, and management to War Rooms for updates; your IC should be providing non-technical updates as the status of your incident changes.
StatusPage is Squadcast’s tool for providing public updates to your customers. StatusPage allows you to provide updates from within Squadcast’s Incident Page, reducing the need for your team to jump between tools to provide customer updates. Users can simply select the option to Update the StatusPage, provide a status and message for the incident and publish it to customers. Having such an easily accessible solution for support staff means that communication processes can be augmented without adding burden or extra work. It’s all conveniently located in one central place.
The Incidents Page should be your one-stop-shop for all information pertaining to an incident. Your post-mortem should derive all of its information from this page, and staff should be encouraged to ensure that technical and non-technical updates are adequately managed within an incident. By doing this, technical staff can be easily removed from the external communications process (which they probably find boring) and communications staff know they can rely on the information they can obtain via the incident.
Making your organisation more transparent is not always an easy process, but using some of the tips and tools we’ve provided in this article will give you an idea on how to begin. The core message is that you need to make communication a cultural pillar for your organisation. Don’t just write a procedure that says that “staff should communicate with each other”, encourage communication in every part of your organisation. When outages occur, get everyone on a phone call. Have your communications teams sit with technical staff to understand how the business runs. Encourage customers to follow up with your team for information following an outage. There are many things you can do to get started, but the most important thing is that you do something!
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.