2 Jan 2014

A Better RAID

A More Flexible RAID

RAID has many limitations:

  • You need to choose a tradeoff between usable disk space, redundancy and performance ahead of time. One you make your choice, it’s baked in, and you can’t change it without deleting the entire array.
  • RAID doesn’t take advantage of free space to improve redundancy. Free space should be used opportunistically to increase the replication level.
  • Losing too many disks in a RAID array means losing all your data. This should not be the case — as long as you have at least one disk working, you should have that disk’s worth of data. For example, if you have a RAID-0 array consisting of two 4TB disks, and one fails, you should still have 4TB of data, in an ideal world. As another example, RAID-5 lets you lose one disk without losing data. But if you have a RAID-5 array consisting of three 4TB disks, and two fails, you should still have 4TB of data.
  • You can’t create a RAID volume out of existing disks, with data on them — you need to format them.
  • If your RAID device, like a NAS device, fails, you can’t remove your disk, plug it into another computer, and copy the data out. You need to rush and buy another RAID device, maybe even from the same vendor.
  • You can’t use disks of different sizes with RAID.

I think most of these flaws come from the fact that RAID is implemented below the filesystem layer.

One solution is ZFS, which integrates volume management into the filesystem. Combining what could be two separate layers is not ideal. For one, you may want to use a different filesystem for your disk array. ZFS doesn't work on OS X, Windows or Linux. Even if it did, you may want to use a different filesystem, for whatever reason, like features, performance for your use case, experience with that filesystem in production, etc.

Instead, we could a separate instance of the filesystem on each disk, keeping track of the files on that disk, with a thin layer on top unifying each of the constituent filesystems into one [1]. Files would be replicated between these disks. Note that the unit of replication is the file, not the block or a range of blocks. Rather than setting a fixed replication level, you'd leave it to the system to take advantage of free space to opportunistically replicate files [2]. These opportunistic replicas would be deleted as needed, when the filesystem runs low on space.

No matter the nominal replication level, the system would replicate files if there's free space. Imagine creating a RAID-0-like pool out of two 1TB hard disks. You'd have 2TB of usable space. And if you fill up the 2TB, you'd have no replication. But if you have only 1TB of data, you should have full redundancy — a disk failure should not result in any data loss. And as you keep adding data, the redundancy should go down. If you have 1.5TB of data on the 2TB pool, 0.5GB of that should be mirrored. Again, when you reach 2TB of data, there will be no redundancy.

Actually, we should use Reed-Solomon encoding rather than naively replicating the files. Reed-Solomon gives you the same reliability with less disk space used. That is, if you have a 1GB file, and you want to guarantee that N disk failures don't result in data loss, with replication, you'd have to use N +1 GB. Reed-Solomon will give you protection against N disk failures with a disk space usage of less than N + 1 GB. But, just to keep this post simple and ease to understand, I'll continue to frame this discussion in terms of replication. Please keep in mind that it needn't be just a byte-for-byte copy. So, to get back to the discussion:

When I say that 0.5GB of data should be mirrored, I really mean that 0.5GB of FILES should be mirrored. Mirroring only part of a file doesn’t help because, rather than losing 50% of the data of two files, you’d much rather lose one file and preserve the other one [3] [4]. So, we shouldn’t store just part of a file on a given disk [5], as is the case with striping (RAID-0).

When I said that if you have two 1TB hard discs in a pool, with only 1TB of data stored on it, a single disk failure should result in no data loss, it follows that if you have three disks, with the same 1TB of data stored on the pool, then losing two disks should not result in data loss, either.

At the other extreme, when you have two 1TB disks and 2TB of data, there’s no redundancy. But this is still better than RAID-0: if one disk crashes, only the files that happen to stored on that disk are lost, as opposed to RAID-0’s striping, where every file is stored partially on every disk, so losing any ONE disk doomes ALL your files [6].

This requires that the filesystem data structures needed to interpret the contents on a disk be stored on that disk itself. Otherwise, when you lose a disk, you’ll no longer be able to make sense of the data on the OTHER disk. This is why we need separate filesystems, with a meta-filesystem or a replication layer on top, as opposed to the traditional RAID setup of running a single filesystem and doing the aggregation at the block device layer.

The individual filesystems used on each disk could just be whatever filesystem you already use: HFS+ or NTFS [7]. There’s probably no need for a new filesystem, any more than RAID requires a new filesystem. If the sync layer requires metadata, it can just be stored as hidden files on top of any underlying filesystem.

We can probably do away with the need for metadata completely. This significantly simplifies the implementation, and eliminates at one stroke all problems caused by metadata getting corrupt in various interesting ways.

Here’s one way we can implement a sync layer without metada:
- When an application wants to read a file, say /foo/bar.txt, just look for /foo/bar.txt on all the discs. If there’s only one, use it. If there are multiple, use the one that was modified most recently [8].
- To delete a file or directory, we should delete it from all disks, so that future reads don’t find it. The OS has its Trash anyway and shouldn’t delete files unnecessary [9].
- When an app wants to create a file or directory, we create it on the disk that has the most free space [10].
- When an app wants to check if a file exists, check all disks and say that it exists if it exists on at least one of them.
- Requests to read attributes should again be serviced from the most recent version of the file. Or perhaps requests to write attributes can be propagated to all copies, so that reads can be served from any of the replicas.
- To list the contents of a directory, we just list the contents of that directory on all disks, and take a union of them, taking care to eliminate older versions of a file.

When a file is modified and saved, it should be written to a different disk each time [11], unless the underlying filesystem supports copy-on-write. That way, if you have a hard disk crash that takes out the latest version of the file, you’ll at least have an older version.

A similar case is copying a file or directory. Unless the underlying filesystem supports copy-on-write, make a copy on a different disk from the ones that contain the directory. If we’re going to waste space making copies of the same data, we might as well get some reliability in the bargain.

Combining this with the earlier idea of replicating files given sufficient free disk space, let’s save every version of every file on every disk, space permitting.

What are the downsides of this system?

For one, you can't have a single file that's bigger than the free space on any one disk. If I have a RAID-0 array consisting of two disks, each of which has 300GB free space, I can create a single 600GB file. Whereas this system requires that a file be stored in its entirety on one disk. Given hard disc sizes today, this shouldn't be a problem for most users.

RAID-0 also guarantees a high read or write speed (in MB/s) for a single file, because you're making use of the bandwidth of both disks. Whereas our system is limited to the read or write speed of a single disk [12]. On the other hand, the number of IOPS our system can sustain is the sum of the IOPS of all the constituent disks, whereas RAID's striping means that we're consuming the IOPS budget of multiple disks for a single write.

But reliability is absolutely better with our system — it makes use of free space to opportunistically replicate files, increasing reliability beyond RAID. Even with no free space, having separate filesystems means that no matter how many disk crashes you have, you'll still be able to access the data on the remaining disks, unlike RAID, where having more than a pre-decided number of disk crashes means that you'll lose ALL data.

The bottom line is that this is a system that does not force you into making an upfront choice between reliability, disk space and performance. Instead, this system is flexible, giving you as much performance and reliability as possible given the disk space you’ve used, and letting you vary these parameters as your needs change.

It's simple, flexible, and more reliable. And more performant in some cases.

[1] This is not a new idea. After writing this post, I discovered that a product that implements this idea.

[2] with limits on the minimum and maximum number of replicas, which the administrator can adjust. If you want a guarantee that N disk failures result in no data loss, you could set a policy that ensures that every file has at least N + 1 replicas. Alternatively, if you have an array of 10 disks,  making 10 replicas of a file may not serve any purpose, even if there was free space, so you could limit the number of replicas to 5, say.

[3] This holds for directories as well: I’d rather lose 10 complete directories than half the files in 20 directories. In the former case, I can easily restore those directories from backup. And you anyway need an offsite backup, because no RAID-like system can protect against theft, fire, floods, power surges destroying your equipment, etc.

A second example are packages in OS X — they appear to the user as files, but they are really a bunch of files in a directory. They are used as the document format in some apps, and apps themselves are packages. In these cases, losing some files in a package can be just as bad as losing the entire package.

As a third example, many apps hide the filesystem from you, and have you manage your data through the app itself. Examples of such apps are Simplenote, iTunes and Lightroom. These apps may use a database or flat files, but ultimately it boils down to a number of flat files. In such cases, losing one of the files may be as bad as losing the entire directory, because the app can’t use a damaged library. This is different from a directory you manage manually: if you put 10 Word documents in a directory, and half of them are lost, you still have the other 5.

[4] The system should also be smart enough to decide which files are important to replicate. There are many ways to do this — by directory (anything in /Users on a Mac is more important than files in /lib or /tmp), or by file type (PDF and DOCX are probably more important than .app or .exe), by when the file was last accessed, whether it was downloaded or created locally, etc. Whichever way we choose, the point is that if have the disk space to replicate only some files, we should replicate the files that matter most to the user.

[5] This does mean that a single file can’t be any bigger than a disk. Or, more precisely, bigger than the free space on any of your disks. If you have two 1TB drives in a pool, with 400GB free space on each, you can’t create a file bigger than 400GB. Whereas with RAID-0, you can create a file that’s as big as the combined free space on all your disks, which is 800GB in this example. This shouldn’t be a problem for most users, given hard disc sizes available today, and that multi-terabyte files are uncommon. But it does rule out some high-end use cases, like perhaps working with uncompressed video at high resolutions, like 4k.

[6] Which is a reason to fill all disks evenly. If you have 300GB of data on one disk and 100GB on the other, losing one disk could mean losing 75% of your data.

[7] This means that you can pool existing disks with data on them together to create the pool, rather than having to format each disk and lose existing data.

[8] Some OSs give apps control over the last modified timestamp, so we’ll have to deal with this by having our own timestamp, which we don’t let apps modify, or detecting a request to set the timestamp to an earlier date and first deleting all but the most recent instance of the file.

[9] Broken command-line tools like rm notwithstanding. Or, to a less degree, OS X’s Trash that doesn’t auto-empty the way Windows’s Recycle Bin does. This forces you to empty the entire Trash, which means that you can no longer recover those files. Whereas Windows lets you configure the Recycle Bin to a certain percentage of your hard disk, in which case that much data can be recovered at all times.

[10] You want to fill disks evenly, because otherwise a disk crash can take out a disproportionate amount of data. If you have two disks, then you want to guarantee that losing one disk causes no more than 50% data loss. If you have 1TB of data on one disk and 2TB on the second one, then losing the second one causes 67% data loss, which is bad.

If you have disks of different capacities, you want to fill them up evenly in absolute terms (GB), not as a percentage, for the same reason.

[11] With some limit on the file size beyond which you'll edit them in place. You don't want to copy a 100GB VM image to another disk to update one byte.

[12] Replicating incoming writes to all disks means that the write bandwidth of the pool doesn’t exceed that of a single disk (the slowest disk, actually). This is undesirable, so one solution is to always designate a primary disk for each file being written, and let the other disks catch up later. In other words, use asynchronous replication instead of synchronous replication. In addition to giving higher bandwidth, this also gives you lower latency, because waiting for two disks to acknowledge a write will always be slower than waiting for one. So, asynchronous replication both decreases latency and increases bandwidth. It does introduce a small window of unreliability. Perhaps there can be a knob for administrators to disable asynchronous replication.

For example, given two disks that can write at 100 MB/s and two files each being written at 100MB/s each, do each write on one disk. And replicate later. Alternatively, if one disk is capable of 200MB/s writes, it can absorb both writes simultaneously while the slower disk absorbs one write, and catches up wrt to the other write later. As a third example, if you have two writes at 100 and 200 MB/s, and two disks capable of 120MB/s and 200MB/s writes, do the faster write on the faster disk, and the slower write on the slower disk, and let them exchange files later.

This means that our system works as well as Apple’s Fusion Drive, which gives you the performance of an SSD and the disk space of a hard drive.

This falls short in cases where a single file is being written at a higher data rate than the fastest disk. For example, with two 100MB/s disks, and a single file being written at 200MB/s. This can happen for demanding applications like editing 4k or uncompressed video. A RAID-like system should be able to handle such demands. This can be done by splitting the incoming stream into 100MB chunks, say, and writing each chunk to which ever disk is free. This requires sparse file support from the underlying filesystems, so that we can efficiently fill in the holes later without copying all the already-written data.

Replicating later will also be needed because you can create a pool from disks with existing data, without having to format them, unlike RAID, which is again inflexible and demands a clean slate. When you join pre-existing disks to form a pool, those disks are not replicated. So we’ll have to have a background replication process, anyway. And if we do, we might as well reuse it in cases where disk bandwidth prevents real-time replication.

No comments:

Post a Comment