Sleeping with one eye open: Experiences with production support
You wake up and two phones, your iPad and watch are ringing. It’s 2am and you see there are messages on all the devices and eventually you become conscious enough to realise something is wrong. You roll out of bed, answer the work phone, and realise you’ve been called to assist in a live production issue. What now? Now, it’s time to shine.
I’ve been doing production support for apps for over seven years, the last three for a large B2C application. At any given point in the day we could have had 100,000 people using the application live. It’s my intention to share some of my learnings in a manner that will be a benefit to others. Whether you’re supporting a shiny new app or a long running system, hopefully some of the experiences I’ve had may be of help to someone just getting started.
I remember being one of the 30 million customers affected by the O2 certificate issue in December 2018, and I looked on at TSB in absolute horror when a botched IT upgrade led to over 1.9 million customers not having access to their accounts from the 20th of April to the 20th of May 2018. These incidents are not isolated, and it’s my hope that we can take a page out of Monzo’s book, where they actually managed to get good press from an outage by being open, honest and non-confrontational with their customer base.
I am going to divide the discussion into a series of parts:
- Put on the right hat: Assuming the role of a support engineer
- Ask questions: Getting to know the problem
- Have your tools ready: Being ready for analysis
- Trust your team: Code reviews, quality assurance and best practice are your friends
- Take downtime: Ensure you recover
- Reflections: What does the ideal production support structure look like?
Comments
No comments found for this article.
Join the discussion for this article on github. Comments appear on this page instantly.