Everyone loves JSON. I love JSON. JSON is great at storing complex data structures to be written and read by programs. It can even be written such that it’s fairly easily human readable. Yet, the JSON format is missing comments. Those things we mostly agree we need for anything non-trivial. This is completely fine for anything written by and for programs –I might have named this article, “Don’t use JSON for your config files which humans might have to edit,” but I opted for brevity –but not so great when our meat computers are involved.

Some context is in order. Lately, I have been working on some CloudFormation templates. CloudFormation is a way to textually describe to AWS what you want to build. These files might contains things like a VPC object, and a couple subnet objects, and a couple objects bringing the VPC and subnets together, and then you fill that full of security groups, and autoscaling groups or EC2 instances, and you end up with a very long file. These files can easily become hundreds of lines long. I would really love to leave some comments for 3-months-from-now me, but I can’t, because it’s JSON.

I was going to spend a few paragraphs discussing the history of comments in an effort to explain why they are important, but there really isn’t much to discuss. If you look at the popular, early  programming languages, COBOL, LISP, FORTRAN, they have comments. Assembly has comments. If you examine Ada Lovelace’s notes on Babbage’s Analytical Engine, you find that she describes note G (regarded as the first program). I will take this evidence as indicating the self evidence of comments.

The question then, is if comments are so prevalent, why doesn’t JSON have them? The answer is because JSON is a data encoding format meant to encode complex data structures for storage or transfer between programs (javascript <-> web server). It wasn’t designed for scripting. So lets please stop using a screw driver to hammer in a nail. Our future selves will appreciate it.

Might I suggest YAML instead?

SREcon14 badge.

Last week in Santa Clara, CA, SREcon14 happened. I attended. I took notes.

The conference had an keynote and close talk, and 4 other talk spots each with 2 talks, 1 panel, and 1 free room to discuss. I attended the keynote, a panel on releasing at scale, a talk on cascading failures, a panel on load shedding, a talk on find and resolving problems, and the closing talk. In general it was a damned good conference and was worth the sleepless night on the airplane back to the East Coast. The general things I feel like I took away from the talk (the TL;DR) are to, monitor, gracefully degrade, empower your devs, and beware retries.

I should note that it’s been a week since SREcon14 and part of what follows come from my notes and the rest from my brain. I apologize for any errors that arise due to me incorrectly remembering a detail.

Keynote by Ben Treynor of Google

  • SRE is what happens when you ask Software Engineers to run production.
  • Reliability is important. Would you rather go to Gmail as it was in 2010 or Gmail 500 Error? Reliability is measured in the absence of errors. By the time users start to notice things are wrong multiple systems have failed and it will take a long time to make the system reliable again.
  • Development wants to launch cool new features (and drive there competition to drinking), and Operations doesn’t want things to blow up. These two goals are in conflict. You end up with releases gated by lengthy processes such as launch reviews, deep-dives, release checklists, etc. Dev typically counters by downplaying the significance of the change, feature addition, flag flip, UI change, etc. General problem is that Ops knows the least about the code but has the strongest incentive to stop it’s release.
  • Google solves this with Error Budgets. EB = 1 – SLA. So if your SLA is .999, then 1-SLA is 0.001 then you can be down for almost 9 hours a year. If you go outside of your EB then a code freeze is issued until you can run within your SLA. This works because most companies rarely need 100% SLA (a pacemaker is a good exception). Also because it’s self regulating. Devs don’t like it when their awesome feature can’t go out because of another devs broken feature.
  • Google also has a common staff pool between it’s SE and SRE teams. You want to keep the devs in rotation because you want them aware of the SRE work. An SRE’s time should be caped at 50% Ops work. They’re SRE not just Ops and you want them doing SRE stuff. If work becomes greater than 50% then the overflow goes to devs. Nothing builds consensus on bug priority like a couple of sleepless nights. Also, the more time devs spend working on Ops, the less time they are developing, which self regulates the number of changes that can make things worse.
  • SREs should be portable between projects and into SE because employee happiness is important.
  • Outages will happen and that’s OK. When outages happen minimize the damage and make sure it doesn’t happen again. Minimize the time between the start and end of an outage. NOC is too slow. Outage has been happening for several minutes by the time you notice. Need alerting but alerts also need good diagnostic information. The more information available in the alert the less time someone has to hunt down that information in order to fix the problem.
  • Practice outages but make it fun. Make someone the dungeon master of the outage.
  • Write honest and blame free (assume everyone is doing their best) postmortems. If you don’t have time for a postmortem then you are agreeing that the issue is going to happen again.

Releasing at Scale Panel

  • Panel Members:
    • Chuck Rossi of Facebook
    • Helena Tian of Google (Google+ I believe)
    • Jos Boumans of Krux Digital
    • Daniel Schauenberg of Etsy
  • Make tools available to your Devs to do and know (monitoring) about releases.
  • Web is solved by mobile is an unsolved problem. Check Rossi really was not happy with the state of mobile deployment.
  • Decouple your code and configs so a config change doesn’t get held up by code.
  • There is a tension between quick features and stability.
  • Degrade gracefully.
  • Facebook want more open source because they don’t want to solve this by themselves.
  • It isn’t a question of if something breaks but when and you want to know before the client.
  • If something does go wrong then you can pay for it in latency, accuracy, etc. Again, gracefully degrade.
  • Measure everything, fail gracefully, run like production in dev, and your deploy tools are production.

Cascading Failures by Mike Ulrich of Google

  • Stop positive feedback loops from getting out of control. An example is 1 instance failing causing increased load over other instances leading to additional failures. “Lasering a backend to death.”
  • To prevent, reduce the time to recover, gracefully degrade instead of failing, and load test. Limit the number of in flight requests, drop what you can afford to if needed. Triggers include, load pattern shifts, unexpected increases in traffic (networking hiccup causes a pause for 20 seconds and then everything crashes in at once), operational failures. Failure of a service is bad because you are both not serving requests and having to wait for the service to recover. Know your QPS and total failure QPS.
  • Loop retries can make things much worse but randomized, exponential back-offs can mitigate some of that risk. Limit your retries. Remember that you are holding the request in memory while retrying is happening. Push that to the user side if practical.
  • Recognize that your systems have limits. Don’t trust the internal components of your system. Try to limit positive feedback. Test your assumptions.

 Load Shedding Panel

  • Panel Members:
    • Manjot Pahwa of Google
    • Jos Boumans of Krux Digital
    • Nick Berry of LinkedIn
    • Bruce Wong of Netflix
  • Identify there is a problem and identify what can be dropped. Understanding the business is critical to understanding what is important and must be kept.
  • End users’s happiness is critical.
  • A general takeaway from this panel is that everything depends on your business. There are a lot of questions without generalized answers. Do you degrade everyone or stop a % of users? How do you know how much to shed and what?
  • Retries can save or kill you.
  • Provide tools to empower your Devs.
  • Automatic shedding has pros and cons. You should start with manual controls as those will progress towards automated systems. Have a button to turn off the automated system.
  • Cloud auto-scaling can mask issues. You have to be disciplined to find and fix issues. Cloud gives extra issues with disk, network (can’t run a wire between two VMs), etc. If small, you can spend money to cover peaks. That doesn’t work as well when you’re really large.
  • Understand your user’s expectations. Know the layers of your app and where you can shed. Isolate.

Finding and Resolving Problems at Scale by Ben Maurer of Facebook

  • Fail fast.
  • Queuing amplifies slowness. Good queuing handles a burst of load where bad queuing leaves you in a state where there is always a queue. In a controlled queue the queue always becomes empty. He mentioned something about Adaptive LIFO and a product they have with it but I can’t seem to find information on that.
  • Cap the number of concurrent requests going to your system.
  • Examine the tcp_abort_on_overflow option. If the service is overloaded, retries won’t help.
  • Seek out problems, don’t wait for monitoring. There are things you don’t know you don’t know. Try to catch issues before problems are big.
  • The TCP stack is a rich source of metrics. Resets, retransmits, open sockets, etc.
  • Metric grouping is important. Time, location, type, etc.
  • Visualization is important for readability. Recommended Cubism.js by Square. Many questions raised by the audience about readability of that tool.
  • During an incident, communication is critical. IRC is nice because it has a log. Allows engineers joining in to read the history and catch up on the problem. Having an incident manager on call, who know how to organize and foster communication, is nice. Someone needs to think about the big picture and not just the details.
  • Have incident reviews to learn from incidents. DERP, Detection, Escalation, Remediation, Prevention. Was your monitoring good? where the right people involved? how did you fix it and could you automate? it will happen again so how can it be safer? Think bigger then your own team.
  • You can learn a lot about SRE from attending incident reviews.

How Silicon Valley’s SREs Saved by Michael Dickerson

He didn’t want anyone to record or take pictures during the presentation, so I’m not going to say much about his talk. It was a very good and amusing talk.

  • He discussed the importance of monitoring and communication.
  • Mentioned a company named New Relic which provides automated cloud monitoring for common languages and frameworks.
  • Do science until you find the problem.
  • To say something, “should work like X,” then you need a clear model of the system in your mind. Pretty much all systems are too big for that.

We use elasticsearch as part of a centralized logging system (logstash, elasticsearch, kibana). Unfortunately we didn’t give the ES machine ES much disk space, and thus, ran out of space. After cleaning up some space and starting ES, it starts writing lots of warnings and stack traces like:

  • sending failed shard for
  • received shard failed for
  • failed to parse
  • failed to start shard
  • failed to recover shard

The disks were filling up again with error logs and the CPU was pegged. Thankfully I found!topic/elasticsearch/HtgNeUJ5uao that forum post. A few posts in Igor Motov suggests deleting the corrupted translog files. The idea is that because the server ran out of disk space it didn’t complete writing to the translogs, and because the translogs were incomplete files, ES couldn’t read them to bring the indices back into correct states. If you delete those files then you may loose a few queries that had yet to be written into the indices but at least the indices will work again.

To fix this you need to look in the ES Logs, /var/log/elasticsearch/elasticsearch.log for CentOS, and find the error lines above. On those lines you’ll see something like

[<timestamp>][WARN ][cluste.action.shard] [<wierd name>] [logstash-2014.05.13][X]

where X (shard) is some number, likely (0,1,2,3,4), and the block before that, logstash-date for me, and you if your doing centralized logging like we are, is the index name. You then need to go to the index location, /var/lib/elasticsearch/elasticsearch/nodes/0/indices/ on centos. In that directory you’ll be able to find the following structure, logstash-date/X/translog/translog-<really big number>. That’s the file you’ll need to delete, so:

  1. sudo service stop elasticsearch
  2. sudo rm /var/lib/elasticsearch/elasticsearch/nodes/0/indices/logstash-date/X/translog/translog-blalblabla
  3. repeat step 2 for all indices and shards in the error log
  4. sudo service start elasticsearch

Watch the logs and repeat that process as needed until the ES logs stop spitting out stack traces.

On Friday I ran into a problem that very much looked like a bug in PHP. From what I can tell a function was not called that should have been. Here is an oversimplification of the code:

function foo() {
   ... some lines of code ...
   ... couple lines of code ...
   if(some stuff)
      error log that happened
   ... some more lines of code ...

function bar() {
   ... some lines of code ...
   error log that didn't happen
   $this->someState = a new state
   ... some lines of code ...

I ran this code once and the state of the class didn’t change and the error log line from bar() was not in the logs but the error log from foo() was in the logs. The error log in bar() isn’t any any kind of if or other flow control nor is there any place to return from bar() before the error log line. In foo() the call to bar() is also not in any if or similar block. Basically, if foo() was called, which I know if was because of the log line, and I know it was that log line because it’s unique, then bar() must have been called, and if bar() was called the log from bar must have been generated.

I saw that, scratched my head for awhile, and then ran the code again. This time I saw both log lines and the objects state changed as it is supposed to when bar() is run. The code was not changed between executions. The same data was provided for the execution each time. The log line in bar() was new but I had run the code with that change a few times before with different data and bar() properly wrote to the error log.

This (Invalid Apache directory – unable to find httpd.h under /usr/include/httpd/) error has been a thorn in my side while building a php rpm with the –with-apache php config argument. I’m running CentOS 6 and have the httpd-devel package installed which places the apache header files (including httpd.h) in /usr/include/httpd. Perms on the /usr/include/httpd directory are 755 and the files inside are 644.  Everything looked good.

Turns out that –with-apache builds the static apache 1.x module which isn’t really what you want if you have installed apache in the last 10 years. What you want instead is –with-apxs2 which will build the apache 2.x shared library.

One of the problems I recently needed to solve at work was moving (and transforming) a lot (TBs) of data from an old system to a new system. Solved it, and things were good. Unfortunately we (co-workers and I) have noticed that the script just stops working. It doesn’t crash it just stops doing anything. Through a bit of luck I discovered this was due to network IO (or lack thereof).

The problem I faced over the last few days is that the script stops transferring data over the network but doesn’t hit a network timeout. The transfer of data grinds to a halt and doesn’t trigger any of our monitoring. I’ve solved this with an inotifywait loop.

While the script is busy copying data it spits information out into a log file. It writes to this file a few times per second with occasional pauses of up to 25ish seconds to collect a new collection of work to transfer. It so turns out that inotifywait has a -t input which is timeout in seconds. If inotifywait gets a notify event it exits with code 0 and if it times out it exits with code 2. With that bit of knowledge in hand, I wrote a wrapper script that launches the aforementioned script in the background and then goes into an infinite loop. It then sets up an inotifywait with -t 60 and -e modify on the log file. If the exit code is 2 then it runs a ps aux | grep to get the pid, kills the script, and relaunches it in the background. In pseudo-code:

./ &
   inotify -t 60 -e modify /path/to/my/logfile
   if exitCode == 2 then
      pid = ps aux | grep myScript | awk '{print $2}'
      kill pid
      ./ &

With that in place the script runs until it stops outputting to the log file for 60 seconds which this wrapper script interprets as it failing. In that event the wrapper script kills the script, restarts it, and we’re back up and running again. Not something I would consider a long term solution but this isn’t a long term problem.

Here are some articles I found interesting from this week.

Automatic Deploys At Etsy by rlerdorf
 In which a deployment strategy designed for 0 interruption time is discussed.

Design Patterns after Design is Done by Jim Bird
Frames refactoring and code legibility in terms of design patterns discussing what works and what does not work.

6 Warning Signs That Your Technology is Headed South by Christopher Taylor
Discusses technological and personal costs of using old technology.

The Date Question by George Dinwiddie
Discusses software deadlines trying to get to the root of the question, “When will this software be done?”

Ambient occlusion for Minecraft-like worlds by Mikola Lysenko
Discusses Ambient Occlusion in voxel based games using Minecraft as a specific example.

Erlang at Basho, Five Years Later by Justin Sheehy
Sheehy talks about the challenges that were expected using Erlang and the challenges that were actually encountered.

15 workplace barriers to better code by Peter Wayner
A list of things that annoy programmers and get us out of “the zone.”

Why Javascript by Alex Russell
Russell defends javascript on the web and confronts frequent arguments against it’s use.

How Clutter Affects Your Brain (and What You Can Do About It) by Mikael Cho
A not exactly minimal article on minimalism and what effect clutter can have on your ability to focus and be creative.