April 13, 2016
Solid State Drive Caching to Speed Up Your Spinning Drives
In this blog post, I'll take a look at how to use solid state drive (SSD) caching to speed up your spinning drives.
Why would you adopt this hybrid setup?
First of all, for cost and performance-driven reasons. Every IO operation has a "cost" for the system that generates it, called "IO penalty." This penalty depends on the RAID configuration used, but this will be further discussed later.
In almost every situation, the performance of a system is given by the slowest component - the disk.
If you use a slow 5400 rpm disk in a single disk configuration for a database application (usually 70% read and 30% write IOPS), in a short time, you will see that CPU utilization starts to rise. After an "iostat" analysis, you will find that it is because of high wait time (a queue is being built because of the slow disk). You could slightly improve the situation by using RAID 0, 1, or 10, or by using faster spinners like 7200 rpm, 10k rpm, or 15k rpm.
As a rule of thumb, the smaller the platter diameter, the smaller the response time; the faster it spins, the smaller the response time; the bigger the cache, the more it helps with the queue.
What do we do if any of the previous tricks are not enough and the applications are becoming more and more demanding in IOPS?
We can turn to using a Full SSDs setup in a RAID configuration, but SSDs are expensive and have a fixed number of writes they can handle before "sudden failure." Another drawback of SSDs is that they are not big enough and limited to 1 TB. Of course 1.8 TB ones exist as of this year but the cost of those is astronomical; there are also PCI-E ones that can reach 3.6 TB with an even higher cost, not to mention that every system has a very limited amount of PCI-E slots.
This is not a new subject and it has been debated many times; I still consider Hybrid Setup as an option, even if the price of commercial SSDs dropped to around $400-500/TB (datacenter grade are more expensive), compared to $60-80/TB of any other 2.5" spinning drive.
Summary of reasons to use Hybrid Setup:
- Much lower cost of operation than using 100% SSDs
- More data volume for the same amount of disks
- 3 to 20 times more IOPS than a regular spinning drive that is also using 2.5" as caching device
- 3 to 100 times more IOPS when using a PCI-E high performance caching device
- If the caching device fails, the performance is just degraded and no data is lost
2. Explaining the choices made
Using 2.5" drives instead of 3.5" drives
|Small footprint (higher disk density/1U)||Less data/drive|
|Smaller power consumption|
|Smaller access time, less latency|
|Using the same 2.5" format as SSDs|
Using hybrid model and not full SSD setup
|Much lower price*||onle 3 to 100 times IOPs vs. 200 times more|
|More data density|
*We will use an example of a storage node, Dell R720, that can hold 2 x 8 2.5" drives in a 2U enclosure:
- Full enterprise SSD setup: $8800/2U/16TB
- Full consumer SSD setup: $6400/2U/16TB
- Hybrid setup v.1: $3600/2U/32TB (using 2 x PCI-E 400 GB SSDs for caching and 16 x 2 TB WD Blue spinning drives)
- Hybrid setup v.2: $2800/2U/28TB (using 2 x 256GB SSDs for caching and 14 x 2TB WD Blue spinning drives)
- Hybrid setup v.3: $1800/2U/14TB (using 2 x 256GB SSDs for caching and 14 x 1TB WD Red spinning drives)
In the end, for roughly the same capacity, you end up saving 4.5 times as much in disk matters and more than decent IOPs.
Try scaling this up to 1 rack (42U) - $147k cheaper; 10 racks (42U) - $1.47m cheaper and so on.
Note: Estimated hardware prices are for January 2016 and do not include the storage node Dell R720.
Second note: There are 2U nodes that can handle 24 x 2.5" (2 TB) or 12 x 3.5" (8TB) drives and density is even higher:
- Hybrid setup v.1: $4200/2U/48TB (using 3 x PCI-E 400 GB SSDs for caching and 24 x 2 TB WD Blue spinning drives)
- Hybrid setup v.2: $4800/2U/96TB (using 2 x PCI-E 400 GB SSDs for caching and 12 x 8 TB Seagate Archive drives)
3. Comparison of different tested hybrid solutions for linux
Personally, I tested 3 caching solutions: lvm cache (included in kernel since 2015), bcache, and EnhanceIO (derivated from Facebook's Flashcache).
Easiest to use: lvm cache
Best performance: EnhanceIO
I will concentrate on the high performer EnhanceIO, which is a relatively older project and is derived from Facebook's Flashcache.
First of all we are using linux as OS and we need to compile the kernel modules.
How it was tested? using fio (Flexible I/O Rester)
Scenario used: database model
fio --randrepeat=1 --ioengine=libaio --direct=1 \
--gtod_reduce=1 --name=test --filename=test \
--bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
Note: Everything was tested with Write Trough as caching mode; for more performance (less reliable) use "Write Back". If you are using PCI-E SSDs for caching, IOPS and throughput is much higher.
ATTENTION! For bcache on CentOS/RHEL, a 7.x unrealistic CPU load is reported by the OS, like bcache is reserving one CPU core for each cached block device in use.
Make sure you have kernel-devel installed, along with gcc, make, and git.
Also if you are using this on CentOS/RHEL 7.x or a newer version of Ubuntu, you might need to tweak EnhanceIO/Driver/enhanceio/Makefile
Get the sources
git clone https://github.com/stec-inc/EnhanceIO
Compile everything and install
cd EnhanceIO/Driver/enhanceio && make && make install
Copy eio_cli to /sbin/
cp CLI/eio_cli /sbin/ && chmod 700 /sbin/eio_cli
Copy documentation to man database
cp CLI/eio_cli.8 /usr/share/man/man8/
modprobe enhanceio_fifo && modprobe enhanceio_lru && modprobe enhanceio
Partition the SSD used for caching
Depending on the size of your SSD and the number of spinners you would like to cache and purpose of the block devices, I came to the following conclusion. Considering that you already allocated 20GB of your SSD to the OS: there are at least 20GB for each block device you want to use cache (if you are using 4 spinners, use 4 x 20GB for cache).
parted -s /dev/sda mkpart extended 20GB 100GB
parted -s /dev/sda mkpart extended 20GB 40GB
parted -s /dev/sda mkpart extended 40GB 60GB
parted -s /dev/sda mkpart extended 60GB 80GB
parted -s /dev/sda mkpart extended 80GB 100GB
Note: 20 GB of cache for each block device is enough to also use Ceph journaling on the same disks.
Create the cached block devices
eio_cli create -d /dev/sdb -s /dev/sda5 -c enhanceio_01
eio_cli create -d /dev/sdc -s /dev/sda6 -c enhanceio_02
eio_cli create -d /dev/sdd -s /dev/sda7 -c enhanceio_03
eio_cli create -d /dev/sde -s /dev/sda8 -c enhanceio_04
Getting information about caches
Or if you would like more detailed statistics:
If you are using RHEL/CentOS 7.x, the only available rpm package is the one I created. This can be downloaded from: http://repo.zeding.ro/repo/zedlabs/centos/7/x86_64/EnhanceIO-1.0-1.el7.centos.x86_64.rpm
Tweaking cache mode
The default cache mode is "Write Through," which is the safest to use. However, if you want to squeeze the last drop of it you can use "Write Back" through:
eio_cli edit -c enhanceio_01 -m wb
More details about cache modes:
- Write-through: This is best in use cases where the data is written and then re-read frequently from the cache, resulting in low latency. This directs the IO write operations into the cache and through the final storage, which in our case are the spinning disks. This is the safest option because it does not confirm to the host until the data is fully written on the permanent storage.
- Write-around: This is similar to the write-through where data is written directly to permanent storage without being cached. It avoids cache flood with IO writes that will not be necessarily re-read, but has other disadvantages like "cache miss" for any recent writes. Also, data has to be read from slower disk, which results in high latency.
- Write-back: This is the high performer, where the IO write is directed to cache and acknowledged by the host. This is not the safest method because any cache failures might result in data loss, but it is used for very low latency and high throughput.
5. Other methods to increase even more the IOPs that can be used in parallel
If you want to boost the IOPs even more, use RAID0, 1, or 10. The other RAID models have too much IO penalty.
For our example, we will use block devices composed out of 4 physical devices. Note: the following example is theoretical.
|RAID||Write Penalty||Read Gain||Write Gain||Overhead|
|0||1||up to 4x||up to 4x||very low|
|1||2||up to 4x||1x||low|
|10||2||up to 4x||up to 2x||low|
|5||4||up to 3x||1x||high|
|6||6||up to 2x||1x||high|
This is done in memory block devices used for caching.
6. Use Cases
Let's say we want a fast, reliable, and cheap AWS instance with a lot of storage. What will we use?
1 x 20 GB SSD for OS
4 x 20 GB SSD for caching
4 x 1 TB Magnetic for storage
RAID 10 is considered as a balance between reliable and performance gains.
We do not take into account the instance type, but we do take into account the price for EBS.
Amazon EBS General Purpose (SSD) volumes: $0.10 per GB-month of provisioned storage
Amazon EBS Magnetic volumes: $0.05 per GB-month of provisioned storage and $0.05 per 1 million I/O requests
You might say that with a full SSD, you can go with 3x1TB in RAID 5 and the situation would be a little different:
|Full SSD RAID 5||Hybrid|
And yes, you are right, it is not a huge difference, but if you scale:
|Full SSD RAID 5||Hybrid||Monthly saving|
|Price per machine||$307.20||$212.95||$94.25|
|Price per 10x||$3072||$2129.50||$942.50|
|Price per 100x||$30720||$21295||$9425|
If you do not need constant, extremely high IOPs and want to save some operational costs, you should implement caching. Considering the previous AWS use case, $9425 is a decent amount of savings in operations.
This article has also been published on GitHub, and can be accessed here : https://github.com/vlaza/zedlab_docs/blob/master/ssd_caching.txt