An hour before I was going to leave work for the long holiday weekend I broke replication to a MySQL slave database. Go see fireworks? No thanks, I’ve made my own.

The first thing I’ve learned is that changing to master_host field in a CHANGE MASTER TO command will reset your master log and position values. This will destroy your replication or corrupt your data or both! Probably both. I had recently changed an LAN IP of the master and wanted to point the slave at that, figured if I didn’t change the master log and position that everything would be fine. Nope. What I should have done, and feel foolish for not doing, was write down the current master log file and position after issuing a STOP SLAVE. That’s a really good idea whenever you do something with slave data just in case something goes wrong. Specifically here, after seeing that the file and position where different I could have put the correct numbers back in before restarting the slave. So, there is the first lesson,

Before a CHANGE MASTER TO, run a SHOW SLAVE STATUS\G and record the Master_Log_File and Exec_Master_Log_Pos.

The second thing I’ve learned is that the –master-data option on a mysqldump does not do what I thought it would. It records the master data, which is what it says it does, but that’s the SHOW MASTER STATUS data not the SHOW SLAVE STATUS data. It’s the numbers you need if you want to scp a snapshot to a potential slave and get it runnig. It is not the numbers you need if you do a backup on the slave and want to possibly recover in the case of a failure. I figured this out after uncompressing, importing, and attempting to START SLAVE. This did not make me happy. To recover from this, I ended up running a backup off the master, something I would rather not do for performance reasons but holiday weekend, and importing that backup. I haven’t looked into a long term fix for this yet. It can wait for Monday. So, there is the second lesson,

The –master-data option on mysqldump is the SHOW MASTER STATUS equivalent and not the SHOW SLAVE STATUS equivalent.

The third thing I’ve learned, rediscovered really, is that database imports from mysqldump take a long time to import. A really long time. Seriously.  For this I wrote a quick script that makes things a little bit faster. The time required to import a db from mysqldump is the sum of time required to import each individual table. My script parses the dump file and splits it into a bunch of individual table files which reduces the time required to that of the longest table import. The script is hacked together, written in perl (which I’m not the best at), and missing prompts, help, or safety measures, but here it is in case you want it. You’ll need to edit the mysql command in the function in order to connect to mysql and use the correct database.

#!/usr/bin/perl

use threads;

#requires the path for the db gzipped dump
my $gzipped = $ARGV[0];

`mkdir ./dbbackupimport`;
`cp $gzipped ./dbbackupimport/backup.sql.gz`;
`gzip -d ./dbbackupimport/backup.sql.gz`;

#break the dump file up into files per table
open(my $dumpfh, "<", "./dbbackupimport/backup.sql");
open(my $currentfh, ">", "./dbbackupimport/backupheader.sql");
my @tables = ();
while(<$dumpfh>) {
 $line = $_;
 if($line =~ /^DROP TABLE IF EXISTS \`([\w]+)\`.*/) {
 $table = $1;
 print "Found table $table\n";
 push @tables, $table;
 if($currentfh) {
 close $currentfh;
 }
 #include the header so the imports can disable keys
 `cat ./dbbackupimport/backupheader.sql > ./dbbackupimport/$table.sql`;
 open($currentfh, ">>", "./dbbackupimport/$table.sql");
 print $currentfh "\n";
 }
 print $currentfh $line;
}
close $currentfh;

#spawn threads to import the data
#each thread will execute this function
sub import_thread {
 my $table = @_[0];
 print "mysql < ./dbbackupimport/$table\n";
 `mysql < ./dbbackupimport/$table.sql`;
 return $table;
}
#spawn the threads
@threads = ();
for my $table (@tables) {
 push @threads, threads->create('import_thread', $table);
}
#collect the threads
for my $thread (@threads) {
 my $table = $thread->join();
 print "Finished importing $table\n";
}
Hourly Drop

Drops at 14:00 and 15:00.

For a long time puppet apply lines were added to server crontab files and we were content with the trade-off between timely updates and server resources used. More recently, while looking at some of our monitoring graphs, we noticed that one of our core business activities took a noise dive every hour, on the hour. When did puppet run? Every hour, on the hour. Time for a change.

My goal was the make so that there was a big red button to press whenever we wanted to run puppet, which is largely whenever we want to push code. Automatic deploys would be really nice but FDA regulations pretty much force something similar to waterfall. Anyway, Jenkins play button is close enough to the big red button so I went with that – also already had it installed. That left the following problems to overcome: running commands in production, getting around the DMZ, running puppet with sudo privileges, and doing it all in a timely manner. Turns out, this actually wasn’t terribly difficult. Here’s what I had to do.

  • Set up Puppet. I’m assuming you’ve already got this part.
  • Set up Jenkins. Again, assuming you’ve done this, but if not, it’s probably in your package manager.
    • Bonus, secure Jenkins so random web traffic can’t access it. I used HAProxy rules to only forward the request to Jenkins if it originated inside of the office.
  • Install the Publish Over SSH plugin. This will allow you to ssh into whatever server you already have punched through the DMZ. I’ll refer to this server as Smuggler here for short.
  • Install pdsh on Smuggler.
  • Run ssh-keygen on Smuggler if you haven’t already.
  • In Jenkins, add Smuggler (Manage Jenkins -> Configure System) with it’s public key.
  • Distribute Smuggler’s public key to every DMZed server you want to run puppet on. You can do this with ssh-copy-id or just copy the key into .ssh/authorized_keys on the servers you want to log into.
  • On ever server you want Smuggler to log into, run visudo – sudo visudo is a weird sort of command – and change the following:
    • Comment out “Defaults requiretty”. This makes it so you can run sudo from ssh in one command.
    • Add a line with, “user    ALL = (ALL)    NOPASSWD: /usr/bin/puppet”. Replace user with the correct username. That gives that user the ability to run puppet as sudo without a password prompt.
  • Create or edit a project in Jenkins and add a new ssh build step.
    • Select your server to connect to.
    • In the execute command part of the ssh step you want to set up a pdsh command. You’ll need two parameters for this,
      • -R exec. This basically tells pdsh to execute a command for every server. If you don’t have any atypical ssh options, you could say -R ssh.
      • -w <targets>. This is where you specify what to log into. You’ll need a comma separated list hosts or IPs. Thankfully you can use a range in the form of [01-16]. To ssh into IPs 192.168.0.1 to 192.168.0.5 and 192.168.0.100 to 192.168.0.115 you would say, “-x 192.168.0.[1-5],192.168.0.1[01-15]”.
    • The last part of the pdsh command you need is the command to run. If you used exec you’ll end up with something like, “ssh %h sudo puppet apply myManifest.pp”, or “ssh %h sudo puppet agent –no-daemonize –onetime”, or something like that. The %h substitues the servers host into the command.
    • All together, you end up with something in your execute field that looks something like:
      • pdsh -R exec -w 192.168.0.[1-5],192.168.0.1[01-15] ssh %h sudo puppet apply myManifest.pp

With that in place, we are no longer wasting cycles having puppet accomplish nothing. As a bonus we can also modify puppet modules, stage code (we build rpms and use yum to deploy our code), or whatever else we need to do without fear of puppet sending it out before we’re ready. Double bonus, if we screw up we don’t have to wait for the next puppet run to deploy our fix.

SREcon14 badge.

Last week in Santa Clara, CA, SREcon14 happened. I attended. I took notes.

The conference had an keynote and close talk, and 4 other talk spots each with 2 talks, 1 panel, and 1 free room to discuss. I attended the keynote, a panel on releasing at scale, a talk on cascading failures, a panel on load shedding, a talk on find and resolving problems, and the closing talk. In general it was a damned good conference and was worth the sleepless night on the airplane back to the East Coast. The general things I feel like I took away from the talk (the TL;DR) are to, monitor, gracefully degrade, empower your devs, and beware retries.

I should note that it’s been a week since SREcon14 and part of what follows come from my notes and the rest from my brain. I apologize for any errors that arise due to me incorrectly remembering a detail.

Keynote by Ben Treynor of Google

  • SRE is what happens when you ask Software Engineers to run production.
  • Reliability is important. Would you rather go to Gmail as it was in 2010 or Gmail 500 Error? Reliability is measured in the absence of errors. By the time users start to notice things are wrong multiple systems have failed and it will take a long time to make the system reliable again.
  • Development wants to launch cool new features (and drive there competition to drinking), and Operations doesn’t want things to blow up. These two goals are in conflict. You end up with releases gated by lengthy processes such as launch reviews, deep-dives, release checklists, etc. Dev typically counters by downplaying the significance of the change, feature addition, flag flip, UI change, etc. General problem is that Ops knows the least about the code but has the strongest incentive to stop it’s release.
  • Google solves this with Error Budgets. EB = 1 – SLA. So if your SLA is .999, then 1-SLA is 0.001 then you can be down for almost 9 hours a year. If you go outside of your EB then a code freeze is issued until you can run within your SLA. This works because most companies rarely need 100% SLA (a pacemaker is a good exception). Also because it’s self regulating. Devs don’t like it when their awesome feature can’t go out because of another devs broken feature.
  • Google also has a common staff pool between it’s SE and SRE teams. You want to keep the devs in rotation because you want them aware of the SRE work. An SRE’s time should be caped at 50% Ops work. They’re SRE not just Ops and you want them doing SRE stuff. If work becomes greater than 50% then the overflow goes to devs. Nothing builds consensus on bug priority like a couple of sleepless nights. Also, the more time devs spend working on Ops, the less time they are developing, which self regulates the number of changes that can make things worse.
  • SREs should be portable between projects and into SE because employee happiness is important.
  • Outages will happen and that’s OK. When outages happen minimize the damage and make sure it doesn’t happen again. Minimize the time between the start and end of an outage. NOC is too slow. Outage has been happening for several minutes by the time you notice. Need alerting but alerts also need good diagnostic information. The more information available in the alert the less time someone has to hunt down that information in order to fix the problem.
  • Practice outages but make it fun. Make someone the dungeon master of the outage.
  • Write honest and blame free (assume everyone is doing their best) postmortems. If you don’t have time for a postmortem then you are agreeing that the issue is going to happen again.

Releasing at Scale Panel

  • Panel Members:
    • Chuck Rossi of Facebook
    • Helena Tian of Google (Google+ I believe)
    • Jos Boumans of Krux Digital
    • Daniel Schauenberg of Etsy
  • Make tools available to your Devs to do and know (monitoring) about releases.
  • Web is solved by mobile is an unsolved problem. Check Rossi really was not happy with the state of mobile deployment.
  • Decouple your code and configs so a config change doesn’t get held up by code.
  • There is a tension between quick features and stability.
  • Degrade gracefully.
  • Facebook want more open source because they don’t want to solve this by themselves.
  • It isn’t a question of if something breaks but when and you want to know before the client.
  • If something does go wrong then you can pay for it in latency, accuracy, etc. Again, gracefully degrade.
  • Measure everything, fail gracefully, run like production in dev, and your deploy tools are production.

Cascading Failures by Mike Ulrich of Google

  • Stop positive feedback loops from getting out of control. An example is 1 instance failing causing increased load over other instances leading to additional failures. “Lasering a backend to death.”
  • To prevent, reduce the time to recover, gracefully degrade instead of failing, and load test. Limit the number of in flight requests, drop what you can afford to if needed. Triggers include, load pattern shifts, unexpected increases in traffic (networking hiccup causes a pause for 20 seconds and then everything crashes in at once), operational failures. Failure of a service is bad because you are both not serving requests and having to wait for the service to recover. Know your QPS and total failure QPS.
  • Loop retries can make things much worse but randomized, exponential back-offs can mitigate some of that risk. Limit your retries. Remember that you are holding the request in memory while retrying is happening. Push that to the user side if practical.
  • Recognize that your systems have limits. Don’t trust the internal components of your system. Try to limit positive feedback. Test your assumptions.

 Load Shedding Panel

  • Panel Members:
    • Manjot Pahwa of Google
    • Jos Boumans of Krux Digital
    • Nick Berry of LinkedIn
    • Bruce Wong of Netflix
  • Identify there is a problem and identify what can be dropped. Understanding the business is critical to understanding what is important and must be kept.
  • End users’s happiness is critical.
  • A general takeaway from this panel is that everything depends on your business. There are a lot of questions without generalized answers. Do you degrade everyone or stop a % of users? How do you know how much to shed and what?
  • Retries can save or kill you.
  • Provide tools to empower your Devs.
  • Automatic shedding has pros and cons. You should start with manual controls as those will progress towards automated systems. Have a button to turn off the automated system.
  • Cloud auto-scaling can mask issues. You have to be disciplined to find and fix issues. Cloud gives extra issues with disk, network (can’t run a wire between two VMs), etc. If small, you can spend money to cover peaks. That doesn’t work as well when you’re really large.
  • Understand your user’s expectations. Know the layers of your app and where you can shed. Isolate.

Finding and Resolving Problems at Scale by Ben Maurer of Facebook

  • Fail fast.
  • Queuing amplifies slowness. Good queuing handles a burst of load where bad queuing leaves you in a state where there is always a queue. In a controlled queue the queue always becomes empty. He mentioned something about Adaptive LIFO and a product they have with it but I can’t seem to find information on that.
  • Cap the number of concurrent requests going to your system.
  • Examine the tcp_abort_on_overflow option. If the service is overloaded, retries won’t help.
  • Seek out problems, don’t wait for monitoring. There are things you don’t know you don’t know. Try to catch issues before problems are big.
  • The TCP stack is a rich source of metrics. Resets, retransmits, open sockets, etc.
  • Metric grouping is important. Time, location, type, etc.
  • Visualization is important for readability. Recommended Cubism.js by Square. Many questions raised by the audience about readability of that tool.
  • During an incident, communication is critical. IRC is nice because it has a log. Allows engineers joining in to read the history and catch up on the problem. Having an incident manager on call, who know how to organize and foster communication, is nice. Someone needs to think about the big picture and not just the details.
  • Have incident reviews to learn from incidents. DERP, Detection, Escalation, Remediation, Prevention. Was your monitoring good? where the right people involved? how did you fix it and could you automate? it will happen again so how can it be safer? Think bigger then your own team.
  • You can learn a lot about SRE from attending incident reviews.

How Silicon Valley’s SREs Saved Healthcare.gov by Michael Dickerson

He didn’t want anyone to record or take pictures during the presentation, so I’m not going to say much about his talk. It was a very good and amusing talk.

  • He discussed the importance of monitoring and communication.
  • Mentioned a company named New Relic which provides automated cloud monitoring for common languages and frameworks.
  • Do science until you find the problem.
  • To say something, “should work like X,” then you need a clear model of the system in your mind. Pretty much all systems are too big for that.

We use elasticsearch as part of a centralized logging system (logstash, elasticsearch, kibana). Unfortunately we didn’t give the ES machine ES much disk space, and thus, ran out of space. After cleaning up some space and starting ES, it starts writing lots of warnings and stack traces like:

  • sending failed shard for
  • received shard failed for
  • failed to parse
  • failed to start shard
  • failed to recover shard

The disks were filling up again with error logs and the CPU was pegged. Thankfully I found https://groups.google.com/forum/#!topic/elasticsearch/HtgNeUJ5uao that forum post. A few posts in Igor Motov suggests deleting the corrupted translog files. The idea is that because the server ran out of disk space it didn’t complete writing to the translogs, and because the translogs were incomplete files, ES couldn’t read them to bring the indices back into correct states. If you delete those files then you may loose a few queries that had yet to be written into the indices but at least the indices will work again.

To fix this you need to look in the ES Logs, /var/log/elasticsearch/elasticsearch.log for CentOS, and find the error lines above. On those lines you’ll see something like

[<timestamp>][WARN ][cluste.action.shard] [<wierd name>] [logstash-2014.05.13][X]

where X (shard) is some number, likely (0,1,2,3,4), and the block before that, logstash-date for me, and you if your doing centralized logging like we are, is the index name. You then need to go to the index location, /var/lib/elasticsearch/elasticsearch/nodes/0/indices/ on centos. In that directory you’ll be able to find the following structure, logstash-date/X/translog/translog-<really big number>. That’s the file you’ll need to delete, so:

  1. sudo service stop elasticsearch
  2. sudo rm /var/lib/elasticsearch/elasticsearch/nodes/0/indices/logstash-date/X/translog/translog-blalblabla
  3. repeat step 2 for all indices and shards in the error log
  4. sudo service start elasticsearch

Watch the logs and repeat that process as needed until the ES logs stop spitting out stack traces.

On Friday I ran into a problem that very much looked like a bug in PHP. From what I can tell a function was not called that should have been. Here is an oversimplification of the code:

function foo() {
   ... some lines of code ...
   $this->bar()
   ... couple lines of code ...
   if(some stuff)
      error log that happened
   ... some more lines of code ...
}

function bar() {
   ... some lines of code ...
   error log that didn't happen
   $this->someState = a new state
   ... some lines of code ...
}

I ran this code once and the state of the class didn’t change and the error log line from bar() was not in the logs but the error log from foo() was in the logs. The error log in bar() isn’t any any kind of if or other flow control nor is there any place to return from bar() before the error log line. In foo() the call to bar() is also not in any if or similar block. Basically, if foo() was called, which I know if was because of the log line, and I know it was that log line because it’s unique, then bar() must have been called, and if bar() was called the log from bar must have been generated.

I saw that, scratched my head for awhile, and then ran the code again. This time I saw both log lines and the objects state changed as it is supposed to when bar() is run. The code was not changed between executions. The same data was provided for the execution each time. The log line in bar() was new but I had run the code with that change a few times before with different data and bar() properly wrote to the error log.

Yesterday I wrote an article that mostly explained how to install and configure haproxy. Today I want to describe the specific solution I’ve come up with for handling a development environment with multiple services running on multiple servers. My goal is to simply things. Specifically our networking and configuration. Complicating factors include:

  • not wanting to change some domains/url (though I do want to remove the ports)
  • minimize ip usage
    • A proper dev or qa install includes several VMs load balanced together.
    • DHCP works but messes up load balancing when renews happen.
    • Hand picking IPs is a bit burdensome.
  • Performance would be nice.

The solution I’ve come up with is to create a tree topology out of haproxyed servers. Basically, one server sets at the top and all port 80 traffic gets forwarded to it from the router. We’ll call it Lancelot. Lancelot’s haproxy rules are configured to search out the hdr_beg for domains like wiki. and jira. and forward those along to the appropriate servers. Say we have two additional servers, Arthur and Galahad, were we set up virtual environments. Lancelot also has hdr_end lines for arthur.domain.com and galahad.domain.com which forward the requests on to those servers. Galahad has virtual environments purity and sword. Arthur has virtual environments excalibur, lwr (large wooden rabbit), and hhg (holy hand grenade). Galahad’s haproxy is configured with hdr_beg lines for purity. and sword. which forward requests onto VirtualBox private networks. Arthur’s haproxy is configured with hdr_beg lines for excalibur., lwr., and hhg. which forward requests onto VirtualBox private networks.

Setup like the above, a request for excalibur.arthur.domain.com would:

  1. Get sent to Lancelot by the router (port 80 forwarding rule)
  2. Trigger Lancelots hdr_end rule for arthur.domain.com and get forward to Arthur
  3. Trigger Arthur’s hdr_beg rule for excalibur and get forwarded to a 192.168 that corrisponds to excalibur’s load balancer
  4. The request gets handled and winds it’s way back through the proxies to your web browser.

This satisfies most of my goals. Domains for things like wiki.domain.com remain the same because haproxy is forwarding the request directly to it’s appropriate server. Because installs like excalibur and purity use only private VM networking IPs from the office as a whole aren’t used and I have a static IP to load balance with.  Performance could be better but screw it, it isn’t production.

At work we have 1 ip address and many web services to run (ticket management, a wiki, a small army of dev and qa installs of our application, etc). Our solution has been to use dns names when the thing is running on the same server (wiki.domain.com, dev1.domain.com, qa1.domain.com, etc) and different ports to send requests to different servers. This works be requires use to remember the magic port number for everything.

Enter haproxy. Haproxy is a layer 4 and 7 (the important bit) load balancer. It’s really easy to install, really easy to set up, and allows you to load balancing or proxy requests based on domain name or url. To install haproxy all that needed to be done was, “yum install haproxy.” I’m sure you can substitute apt-get or your distros package manager of choice just as easily. All of the configuration happens in /etc/haproxy/haproxy.cfg. This file came pre-loaded with global and defaults sections which I kept and frontend and 2 backend sections that I deleted. The basic setup is that you create a frontend section the corresponds to requests, and backend sections that correspond to servers which handle the requests. For example:

frontend domain
    bind *:80
    acl wiki hdr_beg(host) -i wiki.
    acl dev1 hdr_beg(host) -i dev1.
    use_backend wiki_back if wiki
    use_backend dev1_back if dev1

backend wiki_back
    server wiki1 192.168.0.50

backend dev1_back
    server dev1-1 192.168.0.51

The above says that a request for wiki.domain.com (assuming domain.com was being forwarded to the server we set this up on) will be sent to 192.168.0.50. Specifically it’s saying:

  • frontend domain
    • I have a frontend named domain.
  • bind *:80
    • the front end is listening to requests on any ip on port 80
  • acl wiki hdr_beg(host) -i wiki.
    • I’m listening for a hostname beginning (hdr_beg(host)) with wiki (-i wiki.)
  • use_backend wiki_back if wiki
    • If the request is for wiki then I’m going to use some backend named wiki_back
  • backend wiki_back
    • I have some backend named wiki_back
  • server wiki1 192.168.0.50
    • The backend has a server with ip 192.168.0.50 that’s named wiki1

This isn’t really using a lot of the power of haproxy. For one we could set up load balancing on the backends with a line like, “balance roundrobin”, in a backend block. This is also only looking at the beginning of hostnames. We could set acl that listens for domain.com with, “acl domainDefault hdr_end(host) -i domain.com”. We could set up a acl that listens based on the url such as, “acl logs path_beg /kibana”. And that covers my knowledge of haproxy but haproxy has many more features such as the ability to deal with and change headers.