Blue Green Deployment is a deployment pattern that reduces the downtime during deployment by running two identical production setups called - Blue and Green.
During deployment when we reboot the API servers there are chances that the incoming request fail because the server is unresponsive for a short period. Also, it might happen that the release had a major bug and we need a quick rollback.
How can we achieve both of them in one shot? The answer is Blue Green Deployment.
Implemention
Blue Green deployment is implemented by having a separate fleet of infrastructure for the old version - Blue and the new version - Green. The new infrastructure is identical to the old one.
The deployment flow
- the new deployment artifact is tested and kept ready to be deployed
- a parallel infrastructure is set up identical to the existing
- the new version is deployed on the new fleet - Green
- the correctness of the setup is validated
- the proxy is re-configured to now forward 100% of traffic from the Blue (old) setup to the Green (new) setup
- a final sanity test is run on the new fleet
- the blue fleet is now shut down
Pros of Blue Green Deployment
- rollbacks are just a config change and hence quick
- downtime during deployment is minimal
- deployment is just a flip of a switch
- disaster recovery is simple given we already have the automation to build a parallel setup
- deployments can now happen during the working hours
- debugging a failed deployment is simple as we have the infrastructure with the debug information handy
Possible challenges
- during the deployment the infrastructure cost shoots 2x
- the stateful application would need to rebuild the state on new servers
- the database would have to be shared between the fleets
- any schema migration on the database needs to be backward and forward compatible
- the API responses have to be forward and backward compatible
- setting up this deployment strategy for the first time is difficult
When to use Blue Green Deployment?
- when you need zero downtime deployment
- your infrastructure can tolerate 100% traffic switch
- you can bear the 2x cost of infrastructure during deployment
Points to remember
- have a solid automation test suite to validate the correctness
- ensure forward and backward compatibility of API and schema changes
- infra cost will shoot up hence minimizing the time for which you are running 2x infra