Skip to main content

Cluster Operations

From time to time, we will have disruptive events in our clusters. This can either be voluntary or involuntary, but your workloads need to be able to handle this to ensure high availability. An example is that our nodes need to be rebooted for OS upgrades and security patches. But there can also be involuntary disruptions such as hardware failure, node being out of resources, etc. In general, your workloads need to run with minimum 2 replicas to handle this.

Common disruptions

  • Test environment clusters are being automatically updated on a non-fixed schedule, but are only done during midnight hours. (Follow AKS releases here or in the Slack channel #releases-aks)
    • Read more about it here. Clusters are currently set to node-image upgrade channel
    • This will drain nodes which can create downtime if the app is not set up to handle disruptive events
    • Occurs pretty frequently (weekly)
    • This feature will be enabled for production environments in the future (announcement will be made)
  • Manual upgrades or other maintenance work
  • You can read more about disruptions to be aware of here

Mitigations

  • Always run your application with minimum 2 replicas. You can follow the guide here
  • Consider specifying a Pod Disruption Budget for your application
  • Build your application with disposability in mind