Solid State Drive Caching to Speed Up Your Spinning Drives

In this blog post, I’ll take a look at how to use solid state drive (SSD) caching to speed up your spinning drives.

1. Introduction

Why would you adopt this hybrid setup?

First of all, for cost and performance-driven reasons. Every IO operation has a “cost” for the system that generates it, called “IO penalty.” This penalty depends on the RAID configuration used, but this will be further discussed later.

In almost every situation, the performance of a system is given by the slowest component – the disk.

If you use a slow 5400 rpm disk in a single disk configuration for a database application (usually 70% read and 30% write IOPS), in a short time, you will see that CPU utilization starts to rise. After an “iostat” analysis, you will find that it is because of high wait time (a queue is being built because of the slow disk). You could slightly improve the situation by using RAID 0, 1, or 10, or by using faster spinners like 7200 rpm, 10k rpm, or 15k rpm.

As a rule of thumb, the smaller the platter diameter, the smaller the response time; the faster it spins, the smaller the response time; the bigger the cache, the more it helps with the queue.

What do we do if any of the previous tricks are not enough and the applications are becoming more and more demanding in IOPS?

We can turn to using a Full SSDs setup in a RAID configuration, but SSDs are expensive and have a fixed number of writes they can handle before “sudden failure.” Another drawback of SSDs is that they are not big enough and limited to 1 TB. Of course 1.8 TB ones exist as of this year but the cost of those is astronomical; there are also PCI-E ones that can reach 3.6 TB with an even higher cost, not to mention that every system has a very limited amount of PCI-E slots.

This is not a new subject and it has been debated many times; I still consider Hybrid Setup as an option, even if the price of commercial SSDs dropped to around $400-500/TB (datacenter grade are more expensive), compared to $60-80/TB of any other 2.5″ spinning drive.

Summary of reasons to use Hybrid Setup:

  • Much lower cost of operation than using 100% SSDs
  • More data volume for the same amount of disks
  • 3 to 20 times more IOPS than a regular spinning drive that is also using 2.5″ as caching device
  • 3 to 100 times more IOPS when using a PCI-E high performance caching device
  • If the caching device fails, the performance is just degraded and no data is lost

2. Explaining the choices made

Using 2.5″ drives instead of 3.5″ drives

PROSCONS
Small footprint (higher disk density/1U)Less data/drive
Smaller power consumption
Smaller access time, less latency
Using the same 2.5″ format as SSDs
Less vibration

Using hybrid model and not full SSD setup

PROSCONS
Much lower price*onle 3 to 100 times IOPs vs. 200 times more
More data density
More reliable

*We will use an example of a storage node, Dell R720, that can hold 2 x 8 2.5″ drives in a 2U enclosure:

  • Full enterprise SSD setup: $8800/2U/16TB
  • Full consumer SSD setup: $6400/2U/16TB
  • Hybrid setup v.1: $3600/2U/32TB (using 2 x PCI-E 400 GB SSDs for caching and 16 x 2 TB WD Blue spinning drives)
  • Hybrid setup v.2: $2800/2U/28TB (using 2 x 256GB SSDs for caching and 14 x 2TB WD Blue spinning drives)
  • Hybrid setup v.3: $1800/2U/14TB (using 2 x 256GB SSDs for caching and 14 x 1TB WD Red spinning drives)

In the end, for roughly the same capacity, you end up saving 4.5 times as much in disk matters and more than decent IOPs.

Try scaling this up to 1 rack (42U) – $147k cheaper; 10 racks (42U) – $1.47m cheaper and so on.

Note: Estimated hardware prices are for January 2016 and do not include the storage node Dell R720.

Second note: There are 2U nodes that can handle 24 x 2.5″ (2 TB) or 12 x 3.5″ (8TB) drives and density is even higher:

  • Hybrid setup v.1: $4200/2U/48TB (using 3 x PCI-E 400 GB SSDs for caching and 24 x 2 TB WD Blue spinning drives)
  • Hybrid setup v.2: $4800/2U/96TB (using 2 x PCI-E 400 GB SSDs for caching and 12 x 8 TB Seagate Archive drives)

3. Comparison of different tested hybrid solutions for linux

Personally, I tested 3 caching solutions: lvm cache (included in kernel since 2015), bcache, and EnhanceIO (derivated from Facebook’s Flashcache).

Easiest to use: lvm cache

Best performance: EnhanceIO

I will concentrate on the high performer EnhanceIO, which is a relatively older project and is derived from Facebook’s Flashcache.

First of all we are using linux as OS and we need to compile the kernel modules.

How it was tested? using fio (Flexible I/O Rester)

Scenario used: database model

fio --randrepeat=1 --ioengine=libaio --direct=1 \

--gtod_reduce=1 --name=test --filename=test \

--bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75

SSDSpinnerlvm cachebcacheEnhanceIO
Read IOPs97731432872180
Write IOPs293545155737

Note: Everything was tested with Write Trough as caching mode; for more performance (less reliable) use “Write Back”. If you are using PCI-E SSDs for caching, IOPS and throughput is much higher.

ATTENTION! For bcache on CentOS/RHEL, a 7.x unrealistic CPU load is reported by the OS, like bcache is reserving one CPU core for each cached block device in use.

4. Installing

Prerequisites

Make sure you have kernel-devel installed, along with gcc, make, and git.

Also if you are using this on CentOS/RHEL 7.x or a newer version of Ubuntu, you might need to tweak EnhanceIO/Driver/enhanceio/Makefile

Get the sources

git clone https://github.com/stec-inc/EnhanceIO

Compile everything and install

cd EnhanceIO/Driver/enhanceio && make && make install

Copy eio_cli to /sbin/

cp CLI/eio_cli /sbin/ && chmod 700 /sbin/eio_cli

Copy documentation to man database

cp CLI/eio_cli.8 /usr/share/man/man8/

Load modules

modprobe enhanceio_fifo && modprobe enhanceio_lru && modprobe enhanceio

Partition the SSD used for caching

Depending on the size of your SSD and the number of spinners you would like to cache and purpose of the block devices, I came to the following conclusion. Considering that you already allocated 20GB of your SSD to the OS: there are at least 20GB for each block device you want to use cache (if you are using 4 spinners, use 4 x 20GB for cache).

parted -s /dev/sda mkpart extended 20GB 100GB

parted -s /dev/sda mkpart extended 20GB 40GB

parted -s /dev/sda mkpart extended 40GB 60GB

parted -s /dev/sda mkpart extended 60GB 80GB

parted -s /dev/sda mkpart extended 80GB 100GB

Note: 20 GB of cache for each block device is enough to also use Ceph journaling on the same disks.

Create the cached block devices

eio_cli create -d /dev/sdb -s /dev/sda5 -c enhanceio_01

eio_cli create -d /dev/sdc -s /dev/sda6 -c enhanceio_02

eio_cli create -d /dev/sdd -s /dev/sda7 -c enhanceio_03

eio_cli create -d /dev/sde -s /dev/sda8 -c enhanceio_04

Getting information about caches

eio_cli info

Or if you would like more detailed statistics:

cat /proc/enhanceio/enhanceio_0{1-4}/stats 

If you are using RHEL/CentOS 7.x, the only available rpm package is the one I created. This can be downloaded from: http://repo.zeding.ro/repo/zedlabs/centos/7/x86_64/EnhanceIO-1.0-1.el7.centos.x86_64.rpm

Tweaking cache mode

The default cache mode is “Write Through,” which is the safest to use. However, if you want to squeeze the last drop of it you can use “Write Back” through:

eio_cli edit -c enhanceio_01 -m wb

More details about cache modes:

  • Write-through: This is best in use cases where the data is written and then re-read frequently from the cache, resulting in low latency. This directs the IO write operations into the cache and through the final storage, which in our case are the spinning disks. This is the safest option because it does not confirm to the host until the data is fully written on the permanent storage.
  • Write-around: This is similar to the write-through where data is written directly to permanent storage without being cached. It avoids cache flood with IO writes that will not be necessarily re-read, but has other disadvantages like “cache miss” for any recent writes. Also, data has to be read from slower disk, which results in high latency.
  • Write-back: This is the high performer, where the IO write is directed to cache and acknowledged by the host. This is not the safest method because any cache failures might result in data loss, but it is used for very low latency and high throughput.

5. Other methods to increase even more the IOPs that can be used in parallel

RAID

If you want to boost the IOPs even more, use RAID0, 1, or 10. The other RAID models have too much IO penalty.

For our example, we will use block devices composed out of 4 physical devices. Note: the following example is theoretical.

RAIDWrite PenaltyRead GainWrite GainOverhead
01up to 4xup to 4xvery low
12up to 4x1xlow
102up to 4xup to 2xlow
54up to 3x1xhigh
66up to 2x1xhigh

 

This is done in memory block devices used for caching.

6. Use Cases

Let’s say we want a fast, reliable, and cheap AWS instance with a lot of storage. What will we use?

1 x 20 GB SSD for OS

4 x 20 GB SSD for caching

4 x 1 TB Magnetic for storage

RAID 10 is considered as a balance between reliable and performance gains.

We do not take into account the instance type, but we do take into account the price for EBS.

Amazon EBS General Purpose (SSD) volumes: $0.10 per GB-month of provisioned storage

Amazon EBS Magnetic volumes: $0.05 per GB-month of provisioned storage and $0.05 per 1 million I/O requests

Full SSDHybrid
Price$409.60$212.95

 

You might say that with a full SSD, you can go with 3x1TB in RAID 5 and the situation would be a little different:

 

Full SSD RAID 5Hybrid
Price$307.20$212.95

 

And yes, you are right, it is not a huge difference, but if you scale:

Full SSD RAID 5HybridMonthly saving
Price per machine$307.20$212.95$94.25
Price per 10x$3072$2129.50$942.50
Price per 100x$30720$21295$9425

 

7. Conclusions

If you do not need constant, extremely high IOPs and want to save some operational costs, you should implement caching. Considering the previous AWS use case, $9425 is a decent amount of savings in operations.

This article has also been published on GitHub, and can be accessed here : https://github.com/vlaza/zedlab_docs/blob/master/ssd_caching.txt

Victor Laza

Victor Laza

Senior DevOps Engineer

Victor Laza is a Senior DevOps Engineer for 3Pillar Global in the Timisoara offices. He has over 8 years of experience providing support and commitment for various projects including architecting, design, planning, management, and maintenance. He is skilled in RedHat, CentOS, RedHat Cluster Suite, Ubuntu, MaaS, Cloud Computing, OpenStack, CloudStack, Docker, Cisco, MySQL, Scrum, agile, VMWare, and others.

Leave a Reply

Related Posts

3Pillar CEO David DeWolf Quoted in Enterprise Mobility Excha... David DeWolf, Founder and CEO of 3Pillar Global, was recently quoted in a report by Enterprise Mobility Exchange on the necessity of understanding and...
High Availability and Automatic Failover in Hadoop Hadoop in Brief Hadoop is one of the most popular sets of big data processing technologies/frameworks in use today. From Adobe and eBay to Facebook a...
How the Right Tech Stack Fuels Innovation – The Innova... On this episode of The Innovation Engine podcast, we take a look at how choosing the right tech stack can fuel innovation in your company. We'll talk ...
The Road to AWS re:Invent 2018 – Weekly Predictions, P... For the last two weeks, I’ve been making predictions of what might be announced at AWS’ upcoming re:Invent conference. In week 1, I made some guesses ...
Building a Microservice Architecture with Spring Boot and Do... This is the fourth blog post in a 4-part series on building a microservice architecture with Spring Boot and Docker. If you would like to read the pre...