In this blog post, I’ll take a look at how to use solid state drive (SSD) caching to speed up your spinning drives.
First of all, for cost and performance-driven reasons. Every IO operation has a “cost” for the system that generates it, called “IO penalty.” This penalty depends on the RAID configuration used, but this will be further discussed later.
In almost every situation, the performance of a system is given by the slowest component – the disk.
If you use a slow 5400 rpm disk in a single disk configuration for a database application (usually 70% read and 30% write IOPS), in a short time, you will see that CPU utilization starts to rise. After an “iostat” analysis, you will find that it is because of high wait time (a queue is being built because of the slow disk). You could slightly improve the situation by using RAID 0, 1, or 10, or by using faster spinners like 7200 rpm, 10k rpm, or 15k rpm.
As a rule of thumb, the smaller the platter diameter, the smaller the response time; the faster it spins, the smaller the response time; the bigger the cache, the more it helps with the queue.
We can turn to using a Full SSDs setup in a RAID configuration, but SSDs are expensive and have a fixed number of writes they can handle before “sudden failure.” Another drawback of SSDs is that they are not big enough and limited to 1 TB. Of course 1.8 TB ones exist as of this year but the cost of those is astronomical; there are also PCI-E ones that can reach 3.6 TB with an even higher cost, not to mention that every system has a very limited amount of PCI-E slots.
This is not a new subject and it has been debated many times; I still consider Hybrid Setup as an option, even if the price of commercial SSDs dropped to around $400-500/TB (datacenter grade are more expensive), compared to $60-80/TB of any other 2.5″ spinning drive.
|Small footprint (higher disk density/1U)||Less data/drive|
|Smaller power consumption|
|Smaller access time, less latency|
|Using the same 2.5″ format as SSDs|
|Much lower price*||onle 3 to 100 times IOPs vs. 200 times more|
|More data density|
*We will use an example of a storage node, Dell R720, that can hold 2 x 8 2.5″ drives in a 2U enclosure:
In the end, for roughly the same capacity, you end up saving 4.5 times as much in disk matters and more than decent IOPs.
Try scaling this up to 1 rack (42U) – $147k cheaper; 10 racks (42U) – $1.47m cheaper and so on.
Note: Estimated hardware prices are for January 2016 and do not include the storage node Dell R720.
Second note: There are 2U nodes that can handle 24 x 2.5″ (2 TB) or 12 x 3.5″ (8TB) drives and density is even higher:
Personally, I tested 3 caching solutions: lvm cache (included in kernel since 2015), bcache, and EnhanceIO (derivated from Facebook’s Flashcache).
Easiest to use: lvm cache
Best performance: EnhanceIO
I will concentrate on the high performer EnhanceIO, which is a relatively older project and is derived from Facebook’s Flashcache.
First of all we are using linux as OS and we need to compile the kernel modules.
How it was tested? using fio (Flexible I/O Rester)
Scenario used: database model
fio --randrepeat=1 --ioengine=libaio --direct=1 \
--gtod_reduce=1 --name=test --filename=test \
--bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
Note: Everything was tested with Write Trough as caching mode; for more performance (less reliable) use “Write Back”. If you are using PCI-E SSDs for caching, IOPS and throughput is much higher.
ATTENTION! For bcache on CentOS/RHEL, a 7.x unrealistic CPU load is reported by the OS, like bcache is reserving one CPU core for each cached block device in use.
Make sure you have kernel-devel installed, along with gcc, make, and git.
Also if you are using this on CentOS/RHEL 7.x or a newer version of Ubuntu, you might need to tweak EnhanceIO/Driver/enhanceio/Makefile
git clone https://github.com/stec-inc/EnhanceIO
cd EnhanceIO/Driver/enhanceio && make && make install
cp CLI/eio_cli /sbin/ && chmod 700 /sbin/eio_cli
cp CLI/eio_cli.8 /usr/share/man/man8/
modprobe enhanceio_fifo && modprobe enhanceio_lru && modprobe enhanceio
Depending on the size of your SSD and the number of spinners you would like to cache and purpose of the block devices, I came to the following conclusion. Considering that you already allocated 20GB of your SSD to the OS: there are at least 20GB for each block device you want to use cache (if you are using 4 spinners, use 4 x 20GB for cache).
parted -s /dev/sda mkpart extended 20GB 100GB
parted -s /dev/sda mkpart extended 20GB 40GB
parted -s /dev/sda mkpart extended 40GB 60GB
parted -s /dev/sda mkpart extended 60GB 80GB
parted -s /dev/sda mkpart extended 80GB 100GB
Note: 20 GB of cache for each block device is enough to also use Ceph journaling on the same disks.
eio_cli create -d /dev/sdb -s /dev/sda5 -c enhanceio_01
eio_cli create -d /dev/sdc -s /dev/sda6 -c enhanceio_02
eio_cli create -d /dev/sdd -s /dev/sda7 -c enhanceio_03
eio_cli create -d /dev/sde -s /dev/sda8 -c enhanceio_04
Or if you would like more detailed statistics:
If you are using RHEL/CentOS 7.x, the only available rpm package is the one I created. This can be downloaded from: http://repo.zeding.ro/repo/zedlabs/centos/7/x86_64/EnhanceIO-1.0-1.el7.centos.x86_64.rpm
The default cache mode is “Write Through,” which is the safest to use. However, if you want to squeeze the last drop of it you can use “Write Back” through:
eio_cli edit -c enhanceio_01 -m wb
More details about cache modes:
If you want to boost the IOPs even more, use RAID0, 1, or 10. The other RAID models have too much IO penalty.
For our example, we will use block devices composed out of 4 physical devices. Note: the following example is theoretical.
|RAID||Write Penalty||Read Gain||Write Gain||Overhead|
|0||1||up to 4x||up to 4x||very low|
|1||2||up to 4x||1x||low|
|10||2||up to 4x||up to 2x||low|
|5||4||up to 3x||1x||high|
|6||6||up to 2x||1x||high|
This is done in memory block devices used for caching.
Let’s say we want a fast, reliable, and cheap AWS instance with a lot of storage. What will we use?
1 x 20 GB SSD for OS
4 x 20 GB SSD for caching
4 x 1 TB Magnetic for storage
RAID 10 is considered as a balance between reliable and performance gains.
We do not take into account the instance type, but we do take into account the price for EBS.
Amazon EBS General Purpose (SSD) volumes: $0.10 per GB-month of provisioned storage
Amazon EBS Magnetic volumes: $0.05 per GB-month of provisioned storage and $0.05 per 1 million I/O requests
You might say that with a full SSD, you can go with 3x1TB in RAID 5 and the situation would be a little different:
|Full SSD RAID 5||Hybrid|
And yes, you are right, it is not a huge difference, but if you scale:
|Full SSD RAID 5||Hybrid||Monthly saving|
|Price per machine||$307.20||$212.95||$94.25|
|Price per 10x||$3072||$2129.50||$942.50|
|Price per 100x||$30720||$21295||$9425|
If you do not need constant, extremely high IOPs and want to save some operational costs, you should implement caching. Considering the previous AWS use case, $9425 is a decent amount of savings in operations.
This article has also been published on GitHub, and can be accessed here : https://github.com/vlaza/zedlab_docs/blob/master/ssd_caching.txt