Life On The Critical Path

Arran McCabe

Sometimes, teams tell me that they don’t need to focus on availability yet because they are "moving fast and breaking things". This can be the right strategy early on. You don’t have many customers; many know they are working with an early-stage startup, so have set their expectations. This approach breaks down when your service sits on your customer's critical path. If your outage impacts their revenue or brings their business to a halt, they won’t be so understanding. Your customer's trust is hard-won and easily lost. Repeated outages can drive customers to your competitors because ultimately, we all want to do business with people we can depend on.

Are You On The Critical Path?

How long would it take your customers to notice if we turned off your infrastructure right now? What would be the material impact for them? If the answer is measured in hours and the impact on the end user is nothing more than a mild inconvenience, then you are firmly in the “move fast and break things” category. Very few companies are in this position. Even those who are, need to manage the reputational consequences of having repeated outages. We have all seen it; customers take to Twitter and Reddit whenever an outage happens to answer the question “Is it just me?”. The following day there is a good chance the media will pick it up. You want to make the front page of Hacker News or TechCrunch because of how cool your new feature is, not how severe and how long your outage was.

Let’s consider the matrix below. These examples attempt to capture where these companies sit concerning customer impact and time-to-detection. Netflix, in the top left quadrant, has millions of users all over the world. If they have an outage, users would know within seconds. Movie night would be ruined, but users would be otherwise unaffected. As mentioned above, the principal impact is on Netflix’s reputation. Fiix however, in the bottom right quadrant is used to schedule maintenance for industrial equipment. If they failed to notify engineers of a critical task it could take days to detect and could impact the physical safety of the equipment operators. Although these businesses are extremely different, both have incentives to drive down their outage minutes.

Most B2B SAAS offerings fall towards the centre-left of the grid much like Stripe. These services are under continuous load, and their downtime usually translates into lost revenue for the impacted customers. These markets are also some of the most competitive. Stripe has dozens of competitors. How many outages would it take before a customer began to evaluate these alternatives? Stripe may have a superior product but if they can not be depended on, customers will sacrifice features and UX for availability every time.

We Are On The Critical Path; What Now?

As we discussed above, if your customers depend on you to run their business, you have every reason to minimize your monthly outage minutes. There are two key metrics in this space; MTTR (mean time to resolution) and MTBF (mean time between failures). Simply put, we can shorten outages and reduce their total number. The following is a non-exhaustive list of steps to drive down customer impact while maintaining your innovation rate. We will explore each of these in more detail in later posts.

Measure Customer Experience

Ensure you collect telemetry that reflects how customers experience your product (Service Level Indicators). This may include error rates, latencies, throughput etc. Some of these metrics can be collected server-side, while others are best measured from the client. Canaries/Synthetics are a great way to collect this data without instrumenting your clients and processing the associated data streams. This involves issuing simulated requests against your system at regular intervals.

Once this data is being collected, you have a reliable mechanism to identify customer impact minutes. This allows you to set availability goals with your team (Service Level Objectives). These goals are usually expressed as percentage uptime per calendar month. eg 99.9% availability (43 impact minutes). These goals help all the stakeholders get on the same page and identify troubling trends. They are also a precursor to Service Level Agreements. SLAs are formal contracts between customers and vendors that outline how the customer will be compensated if the vendor does not maintain the agreed level of uptime. These agreements are commonplace in industries such as telecoms and cloud computing.

On-call and Paging

It's important that once an issue is detected that engineers are engaged to address it as soon as possible. An important step towards this is creating an on-call rota. When operations is everyone's job, it's nobody's job. Creating explicit on-call shifts ensures folks can plan their work and personal lives accordingly so that someone is always ready to answer a page. It also ensures everyone has time to recover after a busy shift.

Regarding paging, I have noticed a trend of smaller teams relying exclusively on Slack/MS teams hooks to achieve this. Although chat messages have a place in operations, they cant be relied upon to wake someone up at 3 am to look into an issue. Services like PagerDuty use a companion app that can bypass silent and do not disturb modes. Although this is a horrible way to wake up, reducing time to engagement is critical for reducing MTTR.

Runbooks

An incredible amount of time is lost during an event searching for documentation on how the issue was addressed last time or instruction for how a particular action is performed. This information is usually spread across GitHub, Slack, emails, Google docs etc. Runbooks are a consolidated knowledge base for your system. Ensuring the team all know where to find information when it's needed.

Post-mortems and Ops meetings

When outages inevitably happen, their pathology needs to be deeply understood to prevent future occurrences. Post-mortem documents are a great way to achieve this. The team with the most context produces a document outlining the event's timeline, its impact on customers, how it was resolved and what can be done to prevent a recurrence. The document is then reviewed with the broader set of stakeholders so that the learnings can be shared with the rest of the organization.

Operations meetings are a great place for this type of socialization. A weekly meeting where operational health is the primary focus. The agenda will vary from team to team, but broadly it will address;

Postmortems since the last meeting
Review of outstanding action items
On-call shift review
Ticket queue metrics

A great way to drive participation and engagement in this meeting is to rotate the meeting lead weekly. The outgoing on-call engineer often has the greatest context so is well placed to chair the meeting.

Change Management

A disproportionate number of outages are caused by some form of change to the system in question. This includes deployments, migrations and maintenance. Once this is known steps can be taken to reduce the risk surrounding change.

During my time at AWS Networking, the golden rule regarding change management was “If in doubt, roll back”. This means restoring the system to the last known stable configuration at the first sign of trouble. To achieve this, a plan needs to be prepared prior to the change outlining the steps to take should the worst happen. This is particularly important for code deployments as they will happen most frequently. If possible, bad deployments should be detected and automatically reverted to minimize their impact.

If your system has multiple production instances, in multiple regions for example, you should establish a policy for staggering changes to minimize potential blast radius. If you can work from your least utilized instance to your most, a bad change can be halted, reducing the total number of customers impacted.

Most importantly, review high-risk changes with your team before they are executed. Every script or command issued against a production system should be treated like a code change and reviewed accordingly.

Conclusion

Building and operating highly available systems is no mean feat. It requires world-class architecture and robust organizational processes. availabl is a platform that captures many of these best practices in a single place so your team can focus their time where it matters, adding customer value. If you would like to chat about your team's availability my DMs are always open.

AI Devs, A Double-Edged Sword

November 9, 2023

The Future Of Reliable Software Systems

Arran McCabe