Limiting “Software Infant mortality” rate — decoding GO-JEK’s Deployment Checklist
Every time you deploy a piece of software, you’re unknowingly taking an infant on a crazy ride — you expect the deployed software to become an adult as soon as it hits all app servers — it works sometimes and sometimes it doesn’t.
Deployments are a central theme for any organisation. At GO-JEK, our deployments are completely automated. As of today, we can’t deploy a single service manually; the services complexities and inter-service communication is so complicated, it’s better for machines to sort that out for us. But even though we have completely automated deployment and we gear towards extreme automation, we still forget few things. “Deployment at wish” is main-stream, and hence the famous quote “with great power, comes great responsibility”. We modified it a bit ;)
“Great flexibility comes with great caution, then comes great power, and thus followed by great responsibility”
I tweeted about it a few weeks ago, and people asked about my checklist and why? Well, what can this checklist be apart from database changes? And why would you have a deployment checklist when everything is automated? There are two reasons for us to have some sort of sanity in our crazy world of pair programming, TDD and extreme hands on requirements.
- We do trunk based development
- We don’t restrict a deployment trigger to specific people. As soon as you are done, go ahead.
But sometimes multiple commits go to a single deploy. If we do deploy after few builds, we’ll have changes piled up and it’s easy for someone to omit the cache update change or some config change required by the previous build, which hasn’t been deployed yet. And this bundles up the present build. Well, we aren’t crazy all the time, and we try not to deploy during high traffic time. Sometimes within those two or three hours, we might end up with 2, 3 or even 5 builds, because developers are committing their daily work.
Given this, we started experiencing stupid mistakes, and every time we went “oh sh*t, why didn’t we think about that?” — When you have these “Oh sh*t” moments more than 2 times are per week, it was time for us to rethink the whole process.
GO-JEK is a classic product engineering org. We don’t believe in processes and hierarchies — they stay in CEOs offices and don’t ever see the light of the day. Introducing draconian approval processes goes against our very culture, but we also wanted to make sure we are far more disciplined.
“Forced practices are like the food you don’t like, but you are made to eat it, that kind of takes out whole fun of having food”
The same goes with developers/engineers. If they don’t like it, they will start delaying deployments — and I have experienced enough horror stories about monthly deploys.
So we came up with simple rule — here is a checklist, go answer all the questions in it — like a 15 minutes exam, file a deployment ticket, we will go over it and you will have a green signal to push the button. If your deploy fails, promptly follow up with RCA for that failure, and after reading RCA*, I would silently go and add that missing check into our deployment checklist. And once you fail a build, then every team member in your team has to do deployments and go through the deployment checklist.
Before we introduced checklist, we used to have around 80% success on our deployments; not because software was faulty, but because we forgot some small things — usually when two or three builds are combined.
But once the checklist was introduced, the results were astonishing. Across 300+ micro-services and some ~2500+ deploys a month, we have a maximum of 6 to 7 failures, compared to the earlier 75% success rate. We have now moved to have a 99.99% deployment success rate.
I am not going to put the entire checklist here, because this varies from business to business. But I would like to put a rough outline. Hope this helps.
[explain overall impact and feature overview, bug details]
List all committers, Project Lead
Acceptance Signed off
- GO-JEK Core
- GO-PAY Core
List all services which have been impacted and why? What kind of test has been performed for each one of the them?
- Please provide javadoc/rubydoc/godoc api documentation for your service client or service uri.
- Please provide URL for your wiki page for this specific change.
- Please provide datadog dashboard URL for your service
- Please provide newrelic dashboard URL for your service
- What’s your rollback strategy?
- Do you switch rollback with feature flag or does it require complete deployment again?
Service Component — Database
- List down all the database changes, if you added any columns, removed any columns, added or removed any tables.
- What kind of indexes do you have?
- If you added new column does it require index? If yes, why? If no, why not?
- Do you make changes to records? Do you do frequent deletes or updates?
Service Component — Security
- Are you making changes to the public API? If changes are directly impacting the public API end point, then please get approval from the Security Team and a sign off from Sheran — Please also include security ticket link over here.
Service Component — Monitoring
- List all the services you own and list down each server monitoring parameters
- Alerts for service uptime
- Alerts for service performance degradation
- Alerts for service machine disk/cpu/memory — what’s the threshold and how pager duty is triggered
- Please include today’s screenshots for each of them. We need to make sure that you have proper monitoring in place.
These are few of the checks we do, sometimes these don’t take more than 15 minutes, and sometimes they force you to think, and make sure that you are covered. We have many more checks regarding proxies, caches, load balances, queues, application components, customer apps and metrics calculations, data aggregation.
This checklist has certainly helped us. It has put the trust back in the system and automation. It has been an amazing journey from no automation, to complete automation to ruthless deployments to breaking systems everyday to balanced, scaled and automated deployment system.
What a journey!
The title was inspired by Peter Pronovost’s Checklist
- Hospitals in the US – still happy with checklist
- Recently a checklist saved hundreds in Uttar Pradesh, India – helping infant mortality
RSA actions result in 4 things — 1. What happened? 2. What was the impact? 3. Why did it happen and why it took time to remedy? Why not auto-heal? 4. Action items. Usually, starts like this: At around 10:15 PM the deployment progressed on to machines and we started seeing errors on load balancers and proxies