We use elasticsearch as part of a centralized logging system (logstash, elasticsearch, kibana). Unfortunately we didn’t give the ES machine ES much disk space, and thus, ran out of space. After cleaning up some space and starting ES, it starts writing lots of warnings and stack traces like:
- sending failed shard for
- received shard failed for
- failed to parse
- failed to start shard
- failed to recover shard
The disks were filling up again with error logs and the CPU was pegged. Thankfully I found https://groups.google.com/forum/#!topic/elasticsearch/HtgNeUJ5uao that forum post. A few posts in Igor Motov suggests deleting the corrupted translog files. The idea is that because the server ran out of disk space it didn’t complete writing to the translogs, and because the translogs were incomplete files, ES couldn’t read them to bring the indices back into correct states. If you delete those files then you may loose a few queries that had yet to be written into the indices but at least the indices will work again.
To fix this you need to look in the ES Logs, /var/log/elasticsearch/elasticsearch.log for CentOS, and find the error lines above. On those lines you’ll see something like
[<timestamp>][WARN ][cluste.action.shard] [<wierd name>] [logstash-2014.05.13][X]
where X (shard) is some number, likely (0,1,2,3,4), and the block before that, logstash-date for me, and you if your doing centralized logging like we are, is the index name. You then need to go to the index location, /var/lib/elasticsearch/elasticsearch/nodes/0/indices/ on centos. In that directory you’ll be able to find the following structure, logstash-date/X/translog/translog-<really big number>. That’s the file you’ll need to delete, so:
- sudo service stop elasticsearch
- sudo rm /var/lib/elasticsearch/elasticsearch/nodes/0/indices/logstash-date/X/translog/translog-blalblabla
- repeat step 2 for all indices and shards in the error log
- sudo service start elasticsearch
Watch the logs and repeat that process as needed until the ES logs stop spitting out stack traces.