Posted: Nov. 22, 2009
Posted by: burnside
We had a pretty bad wind & rain storm here last night (in Portland).
The power flickered on and off several times. One of
the flickers/surges seems to have caused issues with
the enclosure for many of the HOME3 drives, causing
enough drives to be dropped from the array that
performance has been cut to less than half normal.
We have been working through the night, monitoring the
rebuild and doing what we can to optimize things, but
the file server is having a very difficult time
rebuilding while under load. Our initial expectation
was for things to be back by 8:00 AM PST, but as the
morning dawn hit, the load on the fileserver rose
exponentially. Our current ETA (to 100% fileserver
performance) is 5:00 PM PST.
As a result of this incident, we'll be engineering a
solution to keep this from happening again. Our gut
feeling is that a transition over to SSD's for the
array is probably going to be our path of choice, but
we haven't ruled out just doubling the array's drive
count. We'll keep everyone posted as things develop.
and then this morning....
Posted: Nov. 23, 2009
Posted by: burnside
The drive resync finished last night around 5 pm as
expected, but performance was still horrible. As of
9:00 AM PST this morning, we believe we have things
back to normal on the file server. Some web servers
have already been rebooted to help them catch up,
others are in progress.
We spent the night last night trying to improve things
with various performance tweaks, both to the
webservers, the file server software, and the file
server raid controllers. Nearly everything we tried
improved things, but it wasn't enough.
Finally, after much searching, it was determined that
we could get a ~20% performance improvement by undoing
last week's kernel upgrade. We're not sure why, but
according to what we read, the 2.6.31 kernel is not as
good as the 2.6.29 kernel for NFS services.
After reverting the kernel, and rebooting several web
servers (to help them catch up) the performance
difference has proven to be enough to keep up for now.
As previously mentioned, we will be looking into
improving performance further on this file server via
hardware improvements. Either by converting it over to
SSD, or by doubling the drive count. We're not sure
yet, but such an improvement should get us to where we
can eventually go back to the 2.6.31 kernel, which we
really need to be running for various reasons.
We appreciate everyone's patience and understanding
through this issue. As always, we will be doing
everything we can to keep this issue from cropping up
Sorry for the downtime to everyone, we're going to see what we do going forward. Any suggestions about hosting, etc... are welcome.