👍 Win of the day: Open multiple sessions at the same time, then after 5 minutes close the boring one
👎 Loss of the day: Sometimes you close them all :(
👍 Win of the day: the evening talks are not really populated, you have pretty good chance getting your question answered fast
👎 Loss of the day: Video quality of the talks is too low to watch any code-demo talks
“DevOps Patterns and Antipatterns for Continuous Software Updates”
Speakers: Baruch Sadogursky (JFrog), Kat Cosgrove (JFrog)
The talk title is a bit misleading. I would call it instead “Famous failure stories when someone decided not to update their software or when the update went horribly wrong”
To set the scene
In 2017 a number of hospitals in the UK were brought down by Ransomware. Reason: they were still using Windows XP at that time.
Again in 2017: the rollout of new version of Google OnHub went wrong and resetted the Wifi to factory settings – without any chance to fix it remotely.
Two years later: a massive Jaguar I-Pace recall because of a software error in braking system.
Another small not very well-known automotive company, Tesla, had a bug with “phantom braking”. Surprisingly they released a fix for it, but the customers didn’t do the update. Why? Because the release highlight of that patch were “We improved the Chess game + minor fixes”. One of those “minor fixes” was the fix for phantom braking.
And in the same year an error in regular expression was deployed on Cloudflare servers and killed half of the Internet.
Finally, the year is 2012, the infamous Knight Capital deployment destroys the company’s 400 million dollar assets in 45 minutes. The source was an error in the Manual deployment procedure. Engineers forgot to update one of the six servers and didn’t have any monitoring and rollback procedure.
The lessons learned from all these cases
Main motivations of the updates are security and new features. Meaning it’s good to update frequently.
Lesson 1: If you can, go for the Auto-update pattern. It works fine for low-risk things (e.g. Chrome browser). Does not work for something like Phone OS, you need the user confirmation first. But the user is thinking like this:
so Lesson 1.5: build a trust first.
Lesson 2: If you are accumulating tons of data, go for Testing-in-production pattern, because your test-systems will likely not be enough.
Lesson 3: Can the update go in a way that you can’t recover from it? Like the Google’s OnHub failure? If yes, figure out how to enable local rollbacks. A brilliant example is when you change the monitor resolution on Windows – you get a dialog window that resets the screen back in 15 seconds assuming something went wrong. Note however that reverting an update might lead to reverting state.
Lesson 4: The Tesla phantom bug was due to the batch-updates – the critcal feature had to wait together with a non-important feature until customer decides to install it. You don’t have this problem in Continuous Delivery.
Lesson 5: Lesson learned from the Knight Captial Disaster – people suck at repetitive tasks. Automate everything!
“Autoscaling and Cost Optimization on Kubernetes: From 0 to 100”
Speakers: Guy Templeton (Skyscanner), Jiaxin Shan (Amazon)
Horizontal Pod AutoScaler (HPA)
HPA automatically creates more pods if there is too much load (you define rules for CPU and memory) and removes pods if there’s not enough load.
in k8s 1.18 got some nice features:
Now it’s possible to scale down to 0 with
HPAScaleToZero alpha feature.
Now you can scale down slowly – e.g. tell to scale down at most 5% of the pods in 5 minutes.
In versions older than 1.18 it’s possible to simulate the last feature using
metricsQuery but that’s quite complex. There is an example in the slides if you’re interested.
Vertical Pod Autocaler (VPA)
Re-creates the pod with more memory/CPU if detects too much load or with less resources if there’s not enough.
If you’re going to use it, keep in mind that:
- It needs to restart the pod on every scale
- It’s incompatible with HPA
- It’s tricky to set it with JVM-based services in regards to memory scaling. As I understand it’s because you also need to somehow tell JVM that there is more memory now with the XmX flag.
Cluster Autoscaler (CA)
Adds more nodes to the cluster or removes idle nodes depending on the load.
The scaling down process looks like this:
Scaleup logic is a bit more tricky and is controlled by expanders.
Flags to Look At:
There is some ongoing projects in this area to make the autoscaling more flexible.
The slides are here: https://sched.co/Zemi
“Deep Dive into Helm”
Speakers: Paul Czarkowski (VMWare), Scott Rigby
If you need to migrate from Helm 2 to Helm 3, use the Helm2to3 plugin. It explains what you should do.
Conftest – test helm config against some (e.g. security) rules https://github.com/instrumenta/helm-conftest
“The Common Configuration Scoring System for Kubernetes Security”
Speakers Julien Sobrier (VMWare)
Just an advertising of the kube-scan tool and its visualizer kccss:
“Still Writing SQL Migrations? You Could Use A (Schema)Hero”
Speakers: Marc Campbell (Replicated)
Another “self-advertising” talk. This time with an idea that you can store Database schema objects (Database and Table) as two new Kubernetes Custom Resources. And then apply GitOps to manage migrations.
So when you need to change the db table, you just change the Table resource definition in git (e.g. write a new field’s name and type) and then roll it up into k8s and the SchemaHero controller figures out what to change to reach the target state. So
db migration = k8s resource reconciliation .
The idea is pretty cool, but from my point it’s unrealistic for real production systems because there are lots of cases when you need to e.g. change the data together with changing the db structure and eventually you’ll hit one of them. In this case flyway is probably better way to go.
There are other problems as well e.g. how do you handle renaming a field?
Also keep in mind that for the controller to work you need to keep the Secret with very powerful db credentials in the cluster. Meaning that the default base64 encoding for Secrets is dangerous from security standpoint and you should use something like an external KeyVault to fix that.