Generally, there are 2 ways to have microservices communicate with each other [1]. One technique uses REST to have 1 service directly query or act on another service. The other technique has services pass messages over a message bus. There is plenty of information online about what makes a good REST endpoint, but what makes a good message? There are two ways to think about this question. What does a good message look like? What is a good message about?

A good message doesn’t necessarily need much and is definitely a less is more type of situation. A message needs some identifying information. Who sent it? What’s it about? Some verb? A message also needs a payload. If you separate the what from the content you can then encrypt the message itself without impacting the consumers ability to determine if it’s a message worth paying attention to or not.

What goes in a good message? This is a bit more interesting to me and probably a lot more subjective. I believe this is the true difference between using REST or message passing. In REST one service is asking something of another service, but in message passing one service is telling all the others that something happened. In REST, service A would ask service B for some information, but in message passing service B would send a message letting everyone know about the new state of the world which happens to be consumed by service A.

The downside, and hidden upside, to this is that information will be duplicated. When service B sends an event saying a resource was added and service A consumes that event, service A likely needs some place to store that information. This means duplication of information which is unfortunate. Though a positive is that service A isn’t depending on service B running which increases the stability of your system. Another positive is that service A can store the information in a way that suites it’s needs (I would go so far as to say that it is a bad smell if a resource looks exactly the same in 2 different services).

By keeping all messages in the system informative, instead of action requests, it should be faster to create a system with faster response times. Because all services have centralized the information they need to do what they need, they will not send requests to other services, cutting out extra network time.

What makes a good message? Keep the message itself separate from information about the message. Keep it simple. Stick to messages that inform other services of events that occurred. Avoid messages that ask another service to do something.



Recently, I found myself having some significant issues writing unit tests. I couldn’t quite get the class I was testing into the correct state for some method calls. The class in question, through initializing with request data, was used to determine if a request was authorized or not. At about the time I started googling ways to mock or disable the constructor I realized my issue was that my constructor did too much. This got me thinking, what makes a good constructor?

Wikipedia says that a constructor, “prepares the new object for use, often accepting arguments that the constructor uses to set required member variables.” One of the books on my shelf, which I either learned java from or simply collected over the years, Java Software Solutions by Lewis and Loftus, 3rd edition, says a constructor, “is similar to a method that is invoked when an object is instantiated. … we often use a constructor to initialize the variables associated with each object.” Simula 67, the first language with classes, doesn’t so much have a constructor as allow the programmer to place logic in the class body to be executed. Smalltalk, the first OO language, doesn’t have constructors, though there are conventions. A quote from the first edition of The C++ Programming Language, by Bjarne Stroustrup, may be able to add to this narrative, though I don’t have a copy. Regardless, perhaps history is not be best way to solve this issue.

Thankfully, the quote from the wikipedia article and from Lewis and Loftus, come to an agreement, that a constructor is for setting up object variables. This sounds straightforward enough, but what about my previous example. Is the authorized status of the request a property that should be set by the constructor? Clearly not based on my experience. Perhaps authorized isn’t even a property, but merely an opportunistic caching of a calculation that isn’t going to change, once determined, during the lifetime of the object? I’m inclined to say not; that authorized is a state of the object, which indicates the existence of an object variable to hold said state.

If that is the case, then what states can my object exist in? For my stripped down example, my request object has the states of authorized and unauthorized. Which is where I got into trouble as determining whether a request was authorized required calling a healthy proportion of the object’s methods, and therefore, made it hard to unit test. Do I mock half of the classes methods, nearly all of which are private? find a way to disable the constructor? no is the answer to both of those questions. So I came up with a 3rd state, a base state where I don’t yet know if the request is authorized or not. Now my constructor is nothing but assignment statements and it’s easy to set an object up for unit testing. Unfortunately I’ve had to push logic down into various other functions to handle situations when the information isn’t known and throw appropriate errors.

Woops, turns out that the state of not knowing isn’t a class invariant. Requests aren’t quantum, they actually are either authorized or not. And now I’ve got a problem. Do I make a class that creates objects that always obey their invariants but are not testable, or do I allow what amounts to having 2 methods to fully construct the object but allow it to be fully tested? The later is annoying but the clear winner. An alternate solution I haven’t considered is passing this authorized status to the constructor as an optional parameter or override.

What does the mean for the role of the constructor? It means that the constructor’s role is to set up a state of the object, meaning setting the objects variables, in a way that obeys the class invariants. In a perfect world it would anyway. More practically, it means picking the maximal subset of expert advice and best practices, where maximal is based on the legibility and maintainability of the class. Which is what wikipedia said all along, “prepares the new object for use,” emphasis on prepare.

Everyone loves JSON. I love JSON. JSON is great at storing complex data structures to be written and read by programs. It can even be written such that it’s fairly easily human readable. Yet, the JSON format is missing comments. Those things we mostly agree we need for anything non-trivial. This is completely fine for anything written by and for programs –I might have named this article, “Don’t use JSON for your config files which humans might have to edit,” but I opted for brevity –but not so great when our meat computers are involved.

Some context is in order. Lately, I have been working on some CloudFormation templates. CloudFormation is a way to textually describe to AWS what you want to build. These files might contains things like a VPC object, and a couple subnet objects, and a couple objects bringing the VPC and subnets together, and then you fill that full of security groups, and autoscaling groups or EC2 instances, and you end up with a very long file. These files can easily become hundreds of lines long. I would really love to leave some comments for 3-months-from-now me, but I can’t, because it’s JSON.

I was going to spend a few paragraphs discussing the history of comments in an effort to explain why they are important, but there really isn’t much to discuss. If you look at the popular, early  programming languages, COBOL, LISP, FORTRAN, they have comments. Assembly has comments. If you examine Ada Lovelace’s notes on Babbage’s Analytical Engine, you find that she describes note G (regarded as the first program). I will take this evidence as indicating the self evidence of comments.

The question then, is if comments are so prevalent, why doesn’t JSON have them? The answer is because JSON is a data encoding format meant to encode complex data structures for storage or transfer between programs (javascript <-> web server). It wasn’t designed for scripting. So lets please stop using a screw driver to hammer in a nail. Our future selves will appreciate it.

Might I suggest YAML instead?

We, being the company I work for, recently set up a mysql galera cluster and haproxy to load balance connections between the nodes. Haproxy has a mysql health check, but it only logs into the server, and we wanted a bit more than that (galera’s rsync option puts the server that is being synced from in read_only mode). What I didn’t want to do was install apache or similar because I wanted to leave as much of the systems resources available to mysql. I solved the problem with a perl script.

Before I move on, I should mention that I don’t like perl. Other languages, such as Go, provide just as easy of a solution, but perl is installed on pretty much all linux distros, and, therefore, was less setup. The backbone of the script is the HTTP::Simple::Server:CGI package. My version of the script weighs in at a whole 26 lines of code. Here it, mostly, is.


use File::Pid;

package MyWebServer;

use HTTP::Server::Simple::CGI;
use base qw(HTTP::Server::Simple::CGI);

    sub handle_request {
        my ($self, $cgi) = @_;
        $isFine = 0;
        //do your checking logic here
        if($isFine) {
            print "HTTP/1.0 200 OK\r\n";
            print $cgi->header;
        } else {
            print "HTTP/1.0 503 Service Unavailable\r\n";
            print $cgi->header;

my $pidfile = File::Pid->new();
if(!$pidfile->running) {
    my $server = MyWebServer->new(12345);
    $server->host('YOUR SERVERS IP GOES HERE');
    my $pid = $server->background();

The above code checks to see if a running PID for the script exists and exits if it does (the if block towards the bottom). It then sets up the server to listen on part 12345, use whatever port you want. The next line tells it to listen on a specific ip address, I set that from chef as part of the .erb that builds this script, you could pass a parameter to the script if you don’t want to do that ($ARGV[0]). It then creates the server in the background and writes the PID file.

Of course, the real action is in the handle_request function in the package. That function gets called every time the script receives an http request. All mine does, and you could do a lot more here, is collect some information about the state of the server, a bit more on that in a second, and either returns a status of 200 or 503 which is all haproxy cares about. If your load balancer checks for actual content in the response then, you would add some prints after the $cgi->header calls.

As I mentioned in the first paragraph, the reason we set this up was to discover if the server happens to be in read_only mode. Thus, all my check does it shell out to mysql with a -e option to show global variables, and then runs a regex over that for read_only being set to off.

I’ve also set cron up to run the script ever minute, which is why the PID stuff is in there. Pretty simple really.

With the reliance most web apps place on databases making sure they are always available is important for improving your reliability. I consider MySQL to be finicky (a manual master to slave fail-over at 5am is not my idea of fun), but it is what I’m stuck with. Here is my understanding of the options for keeping your app up when mysql isn’t.

Master – Slave

This is a pretty basic and common pattern. You have one server that gets all of the writes (the master) and another server that is replicating those writes (the slave). The replication is asynchronous and; therefore, can fall behind. Generally, if you keep the load under control on the slave it should keep up. You can send all of your traffic to the master or you can send reads to the slave distributing your read performance. Writing to the slave will probably break replication (definitely break it if you insert into an auto incrementing column).

Cleaning up a broken slave can be difficult. You either need to hunt down and undo any changes made to the slave or pull a dump from the master and import it. Care must be taken when importing to make sure that you know what position in the binlog to resume replication from.

If the master fails you can fall over to the slave though this is a manual process. You’ll need to stop anything being written to the master (if it isn’t completely dead), stop the slave process on the slave, tell your application to write to the slave, get the master back up and running, get the master up to date (you’re probably not going to know the binlog position which means you’ll likely need to do a full import), get what was the master set up to be the slave, and start it replicating.

The benefit here is that it’s easy to set up, mysql is pretty stable so you’ll rarely have to fix it, and as long as 1 server can give you enough write throughput, you can be reasonably happy. If you ever need more read throughput you can add additional slaves. The replication overhead on the master is low.

Master – Master

Similar to master-slave but now both servers are configured to be slave and master for each other. This allows you to read and write to either server because it will be replicated the other. Of course all of the same problems regarding repairing broken slaves and known binlog positions, etc still apply. Both servers can’t attempt to get the same id from an auto increment so you’ll need to do something in the lines of configuring one server to only use even numbers and the other to only user odd. If servers come under load they might start to fall behind in replication. The order updates and inserts are applied in might be different for each server which might lead to them having different data.

One way to resolve some of these problems is to only send traffic to one server.

With a VIP

If you are running on linux then you can use a virtual IP. This requires a bit of network wizardry. What you end up with is a system where the passive server (the not in use server) is polling the active server (the in user server) to make sure it’s alive. If it discovers that the active server is down then it steals the VIP and in doing so promotes itself to being active. Your application won’t need to know about it as the ip it’s connecting to never changes, just the destination. You’ll still need to figure out what was wrong with the broken server and get it working again but in theory they can be no downtime to the user.

With a distributed file system

The idea here is that the file system the servers write to is shared between them (NAT or SAN or the like). This is really more an active-passive solution than master-master as one of you’re servers will need to be turned off or you risk corrupting data. You can also combine this with the VIP method though you’ll need something to mount the drive (depending on sharing method) and start mysql. What you get with this is the knowledge that the data on the active and passive node will be the same (it’s the same mount) at the expensive of a little downtime while the passive mysql starts.


This is mysql’s cluster. It’s it’s own engine so you can’t use innodb or myisam, it has many moving parts, and it requires at least 3 servers, but it gives you a system where you can read from and write to any node without any of the data integrity complications inherent in the previous patterns. The system is composed of API nodes (generally mysql), data nodes, and master nodes. These processes can live on distinct machines or all on the same machine. Unlike the previous examples, all of your data does not live on all of your servers but is distributed across the cluster. This has the benefit of increasing your throughput as the number of nodes increases though individual query performance can be impacted.

Data can be mirrored between the storage nodes meaning the loss of any individual node will not result in the loss of data. Nodes can be inserted and dropped without fuss or harm. For example, NDB updates itself through a rolling update were each node, one at a time, is dropped out of the pool, updated, and entered back into the pool.

You will probably need to make some application changes in order to use NDB. One set of concerns relates to security as NDB is innately insecure, requiring proper use of DMZs. Please take a look at mysqls documentation for more information. Large join, sort, etc performance can also be bad as the rows involved will likely be spread across the data nodes.


Galera is a solution for MariaDB or Percona, which are forks of mysql. Galera is also a clustered solution that replaces mysql’s asynchronous replication with synchronous replication. Galera combines recent research and Percona’s XtraDB fork of innodb (myisam is in beta I believe but isn’t production ready) to provide solid performance for synchronous replication. As with NDB, Galera allows you to read and write to any node, and add or remove nodes with ease. When you add a node the cluster will automatically get it synced with the rest of the cluster.

Unlike NDB, all data lives on all nodes. This has benefits and drawbacks for performance. Read performance is fast and joins, sorts, etc are fast as everything is on the node getting the request. Inserts and updates speed will be depending on the speed of the slowest node in your cluster. This is important to consider given that you will likely run a Galera cluster on commodity hardware. You can find benchmark data at this webpage.

I favor this solution but suspect that it isn’t suitable for solutions that require a lot of mysql servers to meet throughput demands. Using NDB with a caching layer to speed up frequent reads might be a better solution in that scenario.


Like Galera, Tungsten is a cluster solution that replaces mysql’s built in replication. It allows for complex replication strategies and replication between different versions of mysql or different databases altogether. Replication happens by Tungsten Replicator which is a separate process from mysql, so the solution is not as simple as Galera but probably makes up for it in it’s flexibility.


Amazon’s RDS (relational database service) is part of it’s AWS offerings. You define what type of throughput you need and they handle the rest. The only drawback I know of here is that they don’t yet support encryption at rest so if you have PHI or other data you need to encrypt you are SOL. If you are in AWS and don’t require encryption at rest this is probably the right place to start.

I just spent an embarrassing amount of time trying to figure out why some resources in one of my puppet class where not being executed. All of these were inside of an if branch that was checking if another class was defined. It was defined (wouldn’t be much of a blog post if that wasn’t the case), there were not errors, it just wasn’t happening. Turns out that defined does not work the way I thought it did. From Puppet’s documentation (

Checking whether a given resource has been declared is, unfortunately, dependent on the parse order of the configuration

Because defined happens during the parse step and not another step the order in which your resources are declared matters. To give you a concrete example, I have two classes, php and imagemagick. The imagemagick class installs the imagick php extension for you if php is defined. How nice of it, right? This has always worked perfectly. Until my most recent manifest where I had something like:

class{'imagemagick': }
  before => Class['imagemagick'],

Which looks good. Php has to happen before imagemagick so it should be defined. But, because imagemagick is parsed before php, php isn’t actually defined for imagemagick, so nothing inside of my if was run during the apply. To make this actually work it needed to like like:

before => Class['imagemagick'],
class{'imagemagick': }

And now it works. Ridiculous. Hope this saves someone some time.

Recently, I needed to add https support to our dev installs of our web app. The app itself needed to know it was using https, to generate proper urls and the like, so terminating the ssl connection at the proxy was not a viable solution for me. HAProxy added support for SSL in 1.5 but this article isn’t about that because I’m using CentOS and therefore am stuck with HAProxy 1.4.

First up, how not to solve this problem. My first thought was that if I put HAProxy in tcp mode it shouldn’t know anything about whether the connection was SSL or not. This did not work. Unfortunately my notes don’t say why this didn’t work but I assume either HAProxy was spitting out BADREQ with PR– in the logs or the payload was getting messed up and causing errors in negotiations.

Enter stunnel. Stunnel is an SSL tunnel and is what I used to handle the https request. Stunnel can be configured in either a server mode, which terminates SSL connections, or as a client, which initiates SSL connections. This solution uses both. The general solution, which I found here, is to have the https connection received by a stunnel server, who forwards the now http connection to haproxy, who forwards the http connection to a stunnel client, who changes it back to https and forwards it along to server.

–https–> stunnel server –http–> haproxy –http–> stunnel client –https–> web server

Not pretty, but effective, and because all of the traffic between stunnel and haproxy is on localhost, it’s relatively fast.

The first thing needed is to get stunnel installed. It’s in yum.

Now to set up stunnel. I made a folder at /etc/stunnel to hold my configs and .pem file, which is the .key and .crt file concatenated together. I placed the .pem file in that folder. Now you will need a config file for the stunnel server and client. I named my server.conf and client.conf. You might be able to do this with 1 config file, I’m not that familiar with stunnel. In both config files you will need/want 5 global settings defined:

cert=<path to your .pem file>
pid=/var/run/stunnel_(server|client).pid or something similar
output=/var/log/stunnel_(server|client).log or something similar

Basically, define where the pem lives because we need that for the SSL handshake, define a pid file and a log file because those are handy, and two lines that are basically gibberish to me but seem important (all of the google results seemed to have them). I would also suggest you add in foreground=yes while you test the config files so that you can easily see what is happening and kill (ctrl-c) the process to make changes. The next bits are define what the stunnel client is actually going to be doing. For the server:


That basically says to listen on port 443 and if you get something there forward the connection to 8081. Port 443 is important but you could change 8081 to be something different. The client will look at lot the same:

connect=<server ip address>:443

There we tell stunnel that it’s to operate in client mode (client=no is default which is why it wasn’t in the server config), to listen on 8082 (which you could change to something else), and to connect to our webserver on 443. If you had multiple web servers you could put multiple connect lines in and it will round robin the connections.

The last thing we need to change is our HAProxy config. This, at it’s most basic, would look something like:

frontend my_ssl_webpage
bind :8081
default_backend my_ssl_webpage_backend

backend my_ssl_webpage_backend
reqadd X-Forwarded-Proto:\ https
server stunnel1

Now, if you restart HAProxy, and start stunnel for both config files (sudo stunnel <path to config file>), you should have https requests arriving on your webserver.


Found a great manual for HAProxy 1.4. Here is the link:

An hour before I was going to leave work for the long holiday weekend I broke replication to a MySQL slave database. Go see fireworks? No thanks, I’ve made my own.

The first thing I’ve learned is that changing to master_host field in a CHANGE MASTER TO command will reset your master log and position values. This will destroy your replication or corrupt your data or both! Probably both. I had recently changed an LAN IP of the master and wanted to point the slave at that, figured if I didn’t change the master log and position that everything would be fine. Nope. What I should have done, and feel foolish for not doing, was write down the current master log file and position after issuing a STOP SLAVE. That’s a really good idea whenever you do something with slave data just in case something goes wrong. Specifically here, after seeing that the file and position where different I could have put the correct numbers back in before restarting the slave. So, there is the first lesson,

Before a CHANGE MASTER TO, run a SHOW SLAVE STATUS\G and record the Master_Log_File and Exec_Master_Log_Pos.

The second thing I’ve learned is that the –master-data option on a mysqldump does not do what I thought it would. It records the master data, which is what it says it does, but that’s the SHOW MASTER STATUS data not the SHOW SLAVE STATUS data. It’s the numbers you need if you want to scp a snapshot to a potential slave and get it runnig. It is not the numbers you need if you do a backup on the slave and want to possibly recover in the case of a failure. I figured this out after uncompressing, importing, and attempting to START SLAVE. This did not make me happy. To recover from this, I ended up running a backup off the master, something I would rather not do for performance reasons but holiday weekend, and importing that backup. I haven’t looked into a long term fix for this yet. It can wait for Monday. So, there is the second lesson,

The –master-data option on mysqldump is the SHOW MASTER STATUS equivalent and not the SHOW SLAVE STATUS equivalent.

The third thing I’ve learned, rediscovered really, is that database imports from mysqldump take a long time to import. A really long time. Seriously.  For this I wrote a quick script that makes things a little bit faster. The time required to import a db from mysqldump is the sum of time required to import each individual table. My script parses the dump file and splits it into a bunch of individual table files which reduces the time required to that of the longest table import. The script is hacked together, written in perl (which I’m not the best at), and missing prompts, help, or safety measures, but here it is in case you want it. You’ll need to edit the mysql command in the function in order to connect to mysql and use the correct database.


use threads;

#requires the path for the db gzipped dump
my $gzipped = $ARGV[0];

`mkdir ./dbbackupimport`;
`cp $gzipped ./dbbackupimport/backup.sql.gz`;
`gzip -d ./dbbackupimport/backup.sql.gz`;

#break the dump file up into files per table
open(my $dumpfh, "<", "./dbbackupimport/backup.sql");
open(my $currentfh, ">", "./dbbackupimport/backupheader.sql");
my @tables = ();
while(<$dumpfh>) {
 $line = $_;
 if($line =~ /^DROP TABLE IF EXISTS \`([\w]+)\`.*/) {
 $table = $1;
 print "Found table $table\n";
 push @tables, $table;
 if($currentfh) {
 close $currentfh;
 #include the header so the imports can disable keys
 `cat ./dbbackupimport/backupheader.sql > ./dbbackupimport/$table.sql`;
 open($currentfh, ">>", "./dbbackupimport/$table.sql");
 print $currentfh "\n";
 print $currentfh $line;
close $currentfh;

#spawn threads to import the data
#each thread will execute this function
sub import_thread {
 my $table = @_[0];
 print "mysql < ./dbbackupimport/$table\n";
 `mysql < ./dbbackupimport/$table.sql`;
 return $table;
#spawn the threads
@threads = ();
for my $table (@tables) {
 push @threads, threads->create('import_thread', $table);
#collect the threads
for my $thread (@threads) {
 my $table = $thread->join();
 print "Finished importing $table\n";
Hourly Drop

Drops at 14:00 and 15:00.

For a long time puppet apply lines were added to server crontab files and we were content with the trade-off between timely updates and server resources used. More recently, while looking at some of our monitoring graphs, we noticed that one of our core business activities took a noise dive every hour, on the hour. When did puppet run? Every hour, on the hour. Time for a change.

My goal was the make so that there was a big red button to press whenever we wanted to run puppet, which is largely whenever we want to push code. Automatic deploys would be really nice but FDA regulations pretty much force something similar to waterfall. Anyway, Jenkins play button is close enough to the big red button so I went with that – also already had it installed. That left the following problems to overcome: running commands in production, getting around the DMZ, running puppet with sudo privileges, and doing it all in a timely manner. Turns out, this actually wasn’t terribly difficult. Here’s what I had to do.

  • Set up Puppet. I’m assuming you’ve already got this part.
  • Set up Jenkins. Again, assuming you’ve done this, but if not, it’s probably in your package manager.
    • Bonus, secure Jenkins so random web traffic can’t access it. I used HAProxy rules to only forward the request to Jenkins if it originated inside of the office.
  • Install the Publish Over SSH plugin. This will allow you to ssh into whatever server you already have punched through the DMZ. I’ll refer to this server as Smuggler here for short.
  • Install pdsh on Smuggler.
  • Run ssh-keygen on Smuggler if you haven’t already.
  • In Jenkins, add Smuggler (Manage Jenkins -> Configure System) with it’s public key.
  • Distribute Smuggler’s public key to every DMZed server you want to run puppet on. You can do this with ssh-copy-id or just copy the key into .ssh/authorized_keys on the servers you want to log into.
  • On ever server you want Smuggler to log into, run visudo – sudo visudo is a weird sort of command – and change the following:
    • Comment out “Defaults requiretty”. This makes it so you can run sudo from ssh in one command.
    • Add a line with, “user    ALL = (ALL)    NOPASSWD: /usr/bin/puppet”. Replace user with the correct username. That gives that user the ability to run puppet as sudo without a password prompt.
  • Create or edit a project in Jenkins and add a new ssh build step.
    • Select your server to connect to.
    • In the execute command part of the ssh step you want to set up a pdsh command. You’ll need two parameters for this,
      • -R exec. This basically tells pdsh to execute a command for every server. If you don’t have any atypical ssh options, you could say -R ssh.
      • -w <targets>. This is where you specify what to log into. You’ll need a comma separated list hosts or IPs. Thankfully you can use a range in the form of [01-16]. To ssh into IPs to and to you would say, “-x 192.168.0.[1-5],[01-15]”.
    • The last part of the pdsh command you need is the command to run. If you used exec you’ll end up with something like, “ssh %h sudo puppet apply myManifest.pp”, or “ssh %h sudo puppet agent –no-daemonize –onetime”, or something like that. The %h substitues the servers host into the command.
    • All together, you end up with something in your execute field that looks something like:
      • pdsh -R exec -w 192.168.0.[1-5],[01-15] ssh %h sudo puppet apply myManifest.pp

With that in place, we are no longer wasting cycles having puppet accomplish nothing. As a bonus we can also modify puppet modules, stage code (we build rpms and use yum to deploy our code), or whatever else we need to do without fear of puppet sending it out before we’re ready. Double bonus, if we screw up we don’t have to wait for the next puppet run to deploy our fix.

SREcon14 badge.

Last week in Santa Clara, CA, SREcon14 happened. I attended. I took notes.

The conference had an keynote and close talk, and 4 other talk spots each with 2 talks, 1 panel, and 1 free room to discuss. I attended the keynote, a panel on releasing at scale, a talk on cascading failures, a panel on load shedding, a talk on find and resolving problems, and the closing talk. In general it was a damned good conference and was worth the sleepless night on the airplane back to the East Coast. The general things I feel like I took away from the talk (the TL;DR) are to, monitor, gracefully degrade, empower your devs, and beware retries.

I should note that it’s been a week since SREcon14 and part of what follows come from my notes and the rest from my brain. I apologize for any errors that arise due to me incorrectly remembering a detail.

Keynote by Ben Treynor of Google

  • SRE is what happens when you ask Software Engineers to run production.
  • Reliability is important. Would you rather go to Gmail as it was in 2010 or Gmail 500 Error? Reliability is measured in the absence of errors. By the time users start to notice things are wrong multiple systems have failed and it will take a long time to make the system reliable again.
  • Development wants to launch cool new features (and drive there competition to drinking), and Operations doesn’t want things to blow up. These two goals are in conflict. You end up with releases gated by lengthy processes such as launch reviews, deep-dives, release checklists, etc. Dev typically counters by downplaying the significance of the change, feature addition, flag flip, UI change, etc. General problem is that Ops knows the least about the code but has the strongest incentive to stop it’s release.
  • Google solves this with Error Budgets. EB = 1 – SLA. So if your SLA is .999, then 1-SLA is 0.001 then you can be down for almost 9 hours a year. If you go outside of your EB then a code freeze is issued until you can run within your SLA. This works because most companies rarely need 100% SLA (a pacemaker is a good exception). Also because it’s self regulating. Devs don’t like it when their awesome feature can’t go out because of another devs broken feature.
  • Google also has a common staff pool between it’s SE and SRE teams. You want to keep the devs in rotation because you want them aware of the SRE work. An SRE’s time should be caped at 50% Ops work. They’re SRE not just Ops and you want them doing SRE stuff. If work becomes greater than 50% then the overflow goes to devs. Nothing builds consensus on bug priority like a couple of sleepless nights. Also, the more time devs spend working on Ops, the less time they are developing, which self regulates the number of changes that can make things worse.
  • SREs should be portable between projects and into SE because employee happiness is important.
  • Outages will happen and that’s OK. When outages happen minimize the damage and make sure it doesn’t happen again. Minimize the time between the start and end of an outage. NOC is too slow. Outage has been happening for several minutes by the time you notice. Need alerting but alerts also need good diagnostic information. The more information available in the alert the less time someone has to hunt down that information in order to fix the problem.
  • Practice outages but make it fun. Make someone the dungeon master of the outage.
  • Write honest and blame free (assume everyone is doing their best) postmortems. If you don’t have time for a postmortem then you are agreeing that the issue is going to happen again.

Releasing at Scale Panel

  • Panel Members:
    • Chuck Rossi of Facebook
    • Helena Tian of Google (Google+ I believe)
    • Jos Boumans of Krux Digital
    • Daniel Schauenberg of Etsy
  • Make tools available to your Devs to do and know (monitoring) about releases.
  • Web is solved by mobile is an unsolved problem. Check Rossi really was not happy with the state of mobile deployment.
  • Decouple your code and configs so a config change doesn’t get held up by code.
  • There is a tension between quick features and stability.
  • Degrade gracefully.
  • Facebook want more open source because they don’t want to solve this by themselves.
  • It isn’t a question of if something breaks but when and you want to know before the client.
  • If something does go wrong then you can pay for it in latency, accuracy, etc. Again, gracefully degrade.
  • Measure everything, fail gracefully, run like production in dev, and your deploy tools are production.

Cascading Failures by Mike Ulrich of Google

  • Stop positive feedback loops from getting out of control. An example is 1 instance failing causing increased load over other instances leading to additional failures. “Lasering a backend to death.”
  • To prevent, reduce the time to recover, gracefully degrade instead of failing, and load test. Limit the number of in flight requests, drop what you can afford to if needed. Triggers include, load pattern shifts, unexpected increases in traffic (networking hiccup causes a pause for 20 seconds and then everything crashes in at once), operational failures. Failure of a service is bad because you are both not serving requests and having to wait for the service to recover. Know your QPS and total failure QPS.
  • Loop retries can make things much worse but randomized, exponential back-offs can mitigate some of that risk. Limit your retries. Remember that you are holding the request in memory while retrying is happening. Push that to the user side if practical.
  • Recognize that your systems have limits. Don’t trust the internal components of your system. Try to limit positive feedback. Test your assumptions.

 Load Shedding Panel

  • Panel Members:
    • Manjot Pahwa of Google
    • Jos Boumans of Krux Digital
    • Nick Berry of LinkedIn
    • Bruce Wong of Netflix
  • Identify there is a problem and identify what can be dropped. Understanding the business is critical to understanding what is important and must be kept.
  • End users’s happiness is critical.
  • A general takeaway from this panel is that everything depends on your business. There are a lot of questions without generalized answers. Do you degrade everyone or stop a % of users? How do you know how much to shed and what?
  • Retries can save or kill you.
  • Provide tools to empower your Devs.
  • Automatic shedding has pros and cons. You should start with manual controls as those will progress towards automated systems. Have a button to turn off the automated system.
  • Cloud auto-scaling can mask issues. You have to be disciplined to find and fix issues. Cloud gives extra issues with disk, network (can’t run a wire between two VMs), etc. If small, you can spend money to cover peaks. That doesn’t work as well when you’re really large.
  • Understand your user’s expectations. Know the layers of your app and where you can shed. Isolate.

Finding and Resolving Problems at Scale by Ben Maurer of Facebook

  • Fail fast.
  • Queuing amplifies slowness. Good queuing handles a burst of load where bad queuing leaves you in a state where there is always a queue. In a controlled queue the queue always becomes empty. He mentioned something about Adaptive LIFO and a product they have with it but I can’t seem to find information on that.
  • Cap the number of concurrent requests going to your system.
  • Examine the tcp_abort_on_overflow option. If the service is overloaded, retries won’t help.
  • Seek out problems, don’t wait for monitoring. There are things you don’t know you don’t know. Try to catch issues before problems are big.
  • The TCP stack is a rich source of metrics. Resets, retransmits, open sockets, etc.
  • Metric grouping is important. Time, location, type, etc.
  • Visualization is important for readability. Recommended Cubism.js by Square. Many questions raised by the audience about readability of that tool.
  • During an incident, communication is critical. IRC is nice because it has a log. Allows engineers joining in to read the history and catch up on the problem. Having an incident manager on call, who know how to organize and foster communication, is nice. Someone needs to think about the big picture and not just the details.
  • Have incident reviews to learn from incidents. DERP, Detection, Escalation, Remediation, Prevention. Was your monitoring good? where the right people involved? how did you fix it and could you automate? it will happen again so how can it be safer? Think bigger then your own team.
  • You can learn a lot about SRE from attending incident reviews.

How Silicon Valley’s SREs Saved by Michael Dickerson

He didn’t want anyone to record or take pictures during the presentation, so I’m not going to say much about his talk. It was a very good and amusing talk.

  • He discussed the importance of monitoring and communication.
  • Mentioned a company named New Relic which provides automated cloud monitoring for common languages and frameworks.
  • Do science until you find the problem.
  • To say something, “should work like X,” then you need a clear model of the system in your mind. Pretty much all systems are too big for that.