I don't have anything big to write about, but I've got a lot of small tips and trick and info things bookmarked in my "To Blog About" list. So, I thought I would get a few of them out of the way.
What I've been doing the past few months is implementing a new SAN and preparing to upgrade all 5 of my VMware ESX 4.1 hosts to version 5.0. Whenever I say that, someone will inevitably pipe up and say "why don't you just go to the newest 5.1 release?". The reply is that my servers aren't on the HCL for that version. They're not THAT old though - HP ProLiant DL385 G5's. I'll upgrade in a couple of years when my hosts get replaced as part of my 5-year cycle. These are doing just fine performance-wise.
The search for a new SAN has been really frustrating, because there are SO many options. In the end, I just wanted shared iSCSI storage that would meet some space and performance goals (Veeam ONE monitoring helped a great deal here), and that would replicate across a 1Gbps WAN fiber link to our other facility across town. Another consideration was that because I'm no storage expert, we needed something that wasn't too complicated and that was widely used, allowing me to ask questions in forums and get responses from people familiar with that storage platform.
I am no storage specialist, and comparing SAN solutions offered by different vendors was pretty challenging. I have to wear many hats and really can't specialize in anything in my current role. It was frustrating to know that there aren't a lot of solid metrics that you can use to compare different storage solutions, unless you get the hardware in-house and run your own tests. I was operating on IOPS and latency numbers until I learned that vendors can throw up whatever numbers they want and they would be true; you have to look at how the tests are run (random vs sequential, data block size, etc). Here are some really good reads I ran across while making my decision. They also got me up to speed on server storage in general:
Pointing out the IOPS fallacy:
This article outlines the folly of using RAID5 with a hot spare. As a result of this article, all of my local storage is RAID10 from now on, as it's the safest for my data.
This Techrepublic article got me up to speed on different types of drives and their performance difference:
In the end, it came down to EMC vs Dell. Price and usability were the main concerns. We decided we wanted to fortify our SAN performance with SSDs and auto-tiering, which automatically moves "hot" data blocks up to SSD storage for better performance. On an even pricing level, Dell offered over 2 TB worth of SSD space, while the VNX recommended to us only had 200GB. Another big difference between the two (and this is just my take) is that EMC SANs seem to be designed for use by an actual storage engineer. Sure, EMC will point you to the VNXe line, but we're past that in terms of performance/capacity/options. I want my SAN to be set it and forget it. In the end, we bought an Equallogic PS6500ES. It was installed last Friday.
I moved about 10 testing VMs onto the storage and was looking at performance in SAN HQ. SAN HQ is Dell's SAN monitoring software, which is very nice, and very easy to use. I wish I could dig a little deeper (as far as auto-tiering goes), but it is what it is. What I found was some pretty terrible latency numbers! With my migration less than a week away, I went into panic mode and called my Dell Storage reps to find out why my performance sucked. Here are the IOPS and latency graphs I was seeing (acceptable latency is below 20ms):
The Dell reps talked me down off the ledge and said that because my SAN wasn't doing anything (note the low IOPS - in production this thing will be humming along at 2-3000), the hard drives were having to fire up to serve my I/O requests from scratch. I fired up a few VMs with IOMeter, which allows you to do some pretty neat I/O benchmarking tests and followed this guide (second paragraph from the end, and it is just a quick how-to on IOMeter) to create a boatload of I/O. IOMeter is a really neat app. Not only does it create I/O, but you can specify the size of the blocks, read/write percentage, and whether they are random or sequential.
Sure enough, things changed quite a bit:
At around 6000 IOPS, I set off warnings that I had saturated my 3 1Gb iSCSI links. While you can't really tell because of the scale of the graph, my latency stabilized at around 12ms during heavy read and write activity. In case you're interested, this load was 60/40 read write with 64K blocks and all random.
Here's another nice How-To style article on IOMeter.