Last week in Santa Clara, CA, SREcon14 happened. I attended. I took notes.
The conference had an keynote and close talk, and 4 other talk spots each with 2 talks, 1 panel, and 1 free room to discuss. I attended the keynote, a panel on releasing at scale, a talk on cascading failures, a panel on load shedding, a talk on find and resolving problems, and the closing talk. In general it was a damned good conference and was worth the sleepless night on the airplane back to the East Coast. The general things I feel like I took away from the talk (the TL;DR) are to, monitor, gracefully degrade, empower your devs, and beware retries.
I should note that it’s been a week since SREcon14 and part of what follows come from my notes and the rest from my brain. I apologize for any errors that arise due to me incorrectly remembering a detail.
Keynote by Ben Treynor of Google
- SRE is what happens when you ask Software Engineers to run production.
- Reliability is important. Would you rather go to Gmail as it was in 2010 or Gmail 500 Error? Reliability is measured in the absence of errors. By the time users start to notice things are wrong multiple systems have failed and it will take a long time to make the system reliable again.
- Development wants to launch cool new features (and drive there competition to drinking), and Operations doesn’t want things to blow up. These two goals are in conflict. You end up with releases gated by lengthy processes such as launch reviews, deep-dives, release checklists, etc. Dev typically counters by downplaying the significance of the change, feature addition, flag flip, UI change, etc. General problem is that Ops knows the least about the code but has the strongest incentive to stop it’s release.
- Google solves this with Error Budgets. EB = 1 – SLA. So if your SLA is .999, then 1-SLA is 0.001 then you can be down for almost 9 hours a year. If you go outside of your EB then a code freeze is issued until you can run within your SLA. This works because most companies rarely need 100% SLA (a pacemaker is a good exception). Also because it’s self regulating. Devs don’t like it when their awesome feature can’t go out because of another devs broken feature.
- Google also has a common staff pool between it’s SE and SRE teams. You want to keep the devs in rotation because you want them aware of the SRE work. An SRE’s time should be caped at 50% Ops work. They’re SRE not just Ops and you want them doing SRE stuff. If work becomes greater than 50% then the overflow goes to devs. Nothing builds consensus on bug priority like a couple of sleepless nights. Also, the more time devs spend working on Ops, the less time they are developing, which self regulates the number of changes that can make things worse.
- SREs should be portable between projects and into SE because employee happiness is important.
- Outages will happen and that’s OK. When outages happen minimize the damage and make sure it doesn’t happen again. Minimize the time between the start and end of an outage. NOC is too slow. Outage has been happening for several minutes by the time you notice. Need alerting but alerts also need good diagnostic information. The more information available in the alert the less time someone has to hunt down that information in order to fix the problem.
- Practice outages but make it fun. Make someone the dungeon master of the outage.
- Write honest and blame free (assume everyone is doing their best) postmortems. If you don’t have time for a postmortem then you are agreeing that the issue is going to happen again.
Releasing at Scale Panel
- Panel Members:
- Chuck Rossi of Facebook
- Helena Tian of Google (Google+ I believe)
- Jos Boumans of Krux Digital
- Daniel Schauenberg of Etsy
- Make tools available to your Devs to do and know (monitoring) about releases.
- Web is solved by mobile is an unsolved problem. Check Rossi really was not happy with the state of mobile deployment.
- Decouple your code and configs so a config change doesn’t get held up by code.
- There is a tension between quick features and stability.
- Degrade gracefully.
- Facebook want more open source because they don’t want to solve this by themselves.
- It isn’t a question of if something breaks but when and you want to know before the client.
- If something does go wrong then you can pay for it in latency, accuracy, etc. Again, gracefully degrade.
- Measure everything, fail gracefully, run like production in dev, and your deploy tools are production.
Cascading Failures by Mike Ulrich of Google
- Stop positive feedback loops from getting out of control. An example is 1 instance failing causing increased load over other instances leading to additional failures. “Lasering a backend to death.”
- To prevent, reduce the time to recover, gracefully degrade instead of failing, and load test. Limit the number of in flight requests, drop what you can afford to if needed. Triggers include, load pattern shifts, unexpected increases in traffic (networking hiccup causes a pause for 20 seconds and then everything crashes in at once), operational failures. Failure of a service is bad because you are both not serving requests and having to wait for the service to recover. Know your QPS and total failure QPS.
- Loop retries can make things much worse but randomized, exponential back-offs can mitigate some of that risk. Limit your retries. Remember that you are holding the request in memory while retrying is happening. Push that to the user side if practical.
- Recognize that your systems have limits. Don’t trust the internal components of your system. Try to limit positive feedback. Test your assumptions.
Load Shedding Panel
- Panel Members:
- Manjot Pahwa of Google
- Jos Boumans of Krux Digital
- Nick Berry of LinkedIn
- Bruce Wong of Netflix
- Identify there is a problem and identify what can be dropped. Understanding the business is critical to understanding what is important and must be kept.
- End users’s happiness is critical.
- A general takeaway from this panel is that everything depends on your business. There are a lot of questions without generalized answers. Do you degrade everyone or stop a % of users? How do you know how much to shed and what?
- Retries can save or kill you.
- Provide tools to empower your Devs.
- Automatic shedding has pros and cons. You should start with manual controls as those will progress towards automated systems. Have a button to turn off the automated system.
- Cloud auto-scaling can mask issues. You have to be disciplined to find and fix issues. Cloud gives extra issues with disk, network (can’t run a wire between two VMs), etc. If small, you can spend money to cover peaks. That doesn’t work as well when you’re really large.
- Understand your user’s expectations. Know the layers of your app and where you can shed. Isolate.
Finding and Resolving Problems at Scale by Ben Maurer of Facebook
- Fail fast.
- Queuing amplifies slowness. Good queuing handles a burst of load where bad queuing leaves you in a state where there is always a queue. In a controlled queue the queue always becomes empty. He mentioned something about Adaptive LIFO and a product they have with it but I can’t seem to find information on that.
- Cap the number of concurrent requests going to your system.
- Examine the tcp_abort_on_overflow option. If the service is overloaded, retries won’t help.
- Seek out problems, don’t wait for monitoring. There are things you don’t know you don’t know. Try to catch issues before problems are big.
- The TCP stack is a rich source of metrics. Resets, retransmits, open sockets, etc.
- Metric grouping is important. Time, location, type, etc.
- Visualization is important for readability. Recommended Cubism.js by Square. Many questions raised by the audience about readability of that tool.
- During an incident, communication is critical. IRC is nice because it has a log. Allows engineers joining in to read the history and catch up on the problem. Having an incident manager on call, who know how to organize and foster communication, is nice. Someone needs to think about the big picture and not just the details.
- Have incident reviews to learn from incidents. DERP, Detection, Escalation, Remediation, Prevention. Was your monitoring good? where the right people involved? how did you fix it and could you automate? it will happen again so how can it be safer? Think bigger then your own team.
- You can learn a lot about SRE from attending incident reviews.
How Silicon Valley’s SREs Saved Healthcare.gov by Michael Dickerson
He didn’t want anyone to record or take pictures during the presentation, so I’m not going to say much about his talk. It was a very good and amusing talk.
- He discussed the importance of monitoring and communication.
- Mentioned a company named New Relic which provides automated cloud monitoring for common languages and frameworks.
- Do science until you find the problem.
- To say something, “should work like X,” then you need a clear model of the system in your mind. Pretty much all systems are too big for that.