Link Search Menu Expand Document

On Duty & On Call

This page is all about our On Duty (aka Support) and On Call setups

On Duty

In order to let team members focus on project work, but also for us as a team to be able to respond to the support need of our customers, Storio’s Engineering Teams, we operate an On Duty rota. This rota operates Mon-Fri during normal business hours.

Structure

  • Each Team Member takes a week of On Duty work
  • If you’re on holiday, you can just straight swap with another individual
  • PagerDuty is used as a tool for managing who is on the rota
  • While we are normalising, we’ll operate two versions of the rota, the main and an additional buddy rota.
  • When a former UK person is in the main rota, a former FR person should be present in the buddy rota and visa versa.
  • The buddy will help the main person with problems outside of their familiar space by pairing with them.
  • We’ll aim to shift to one rota by end of Peak 2022.
  • Others are welcome to dive in on topics they may be familiar with, however ensure you put :eyes: on the topic so the On Duty person knows.

On Duty Interface & Expectations

Being On Duty shouldn’t be a horrendous experience and we’re not here to answer every request immediately; we’re partners with our engineering colleagues and our primary reason for being On Duty is to respond to big problems such as site outages or problems that are blocking engineers from working.

  • Primarily, the #pbx-sre channel should be the point of contact for engineers who need assistance.
  • Additionally, PagerDuty should be monitored by the On Duty engineer for any incidents or alerts.
  • SLA for first response is within 30 minutes, however we should aim to respond faster than that.
  • If you’re investigating something, put :eyes: on the message/alert so as your colleagues know you’re looking at it.
  • Feel free to ask non-urgent requests to wait, or if you have many urgent requests, ask your teammates for help.

On Call

Outside Peak

  • SRE provides on-call cover during peak trading times:
    • Saturday & Sunday
    • 14:30 to 21:00 UTC, offset by daylight savings when appropriate.
  • SLA for first response is 30 minutes.

Peak

  • During Peak we provide cover:
    • From start of business day until end of business day
    • From 18:30 to 00:00 UTC Mon-Fri
    • From 07:30 to 00:00 UTC Saturday and Sunday
  • SLA for first response is 30 minutes during October and 15 minutes from mid-November until last order date.

On Call Interface & Expectations

  • Primarily contact is via PagerDuty, responding to alerts.
  • During Peak, the #pbx-on-call is a secondary contact method
  • The organisation also appreciates summaries that all is well during busy trading days, including things like k8s pod count, Babel instance levels, etc.
  • A telephone directory of everyone’s mobile is maintained during Peak in case of emergencies.

PagerDuty Tips & Tricks

PagerDuty is our scheduling and alert routing tool. We use it to track who is on call and to notify them.

  • Each SRE should access PagerDuty via Okta and set up a profile
  • It’s a good idea to install PagerDuty to your phone. If you’re not happy having PD on your personal phone, the company can provide a work mobile for you.
  • Register your telephone number with PagerDuty
  • Configure Notification Rules in your PagerDuty Profile. The defaults result in you being emailed, pushed, texted and phoned immediately when each alert is created. A more practical configuration could be:
    • Leave email notifications set to immediately
    • Leave push notifications set to immediately
    • Set a 10 minute delay on SMS notifications
    • Disable or set a 15 minute delay on telephone calls
  • A WebCal/iCal feed can be created from your On Call shifts in the User Settings section - this can then be plumbed into Google Calendar so you can see when you’re on call.
  • Incidents will automatically be assigned to you when you’re on call. Use PagerDuty to manage the lifecycle of these alerts, such as:
    • Acknowledging incidents when you start investigating them - this will stop the alert from re-firing or escalating to other responders.
    • Reassigning incidents when a colleague takes over looking after it, or when you go off call if an incident is not resolved
    • Closing incidents that don’t have automatic resolution once you’ve dealt with them. A great example of this is the Storio Incidents flow, which will create a PD Incident and route it to the SRE on duty/on call so we know an Incident has been raised. Once we know, these types of alerts can be closed.