Minio, the ZFS of cloud storage
ZFS is best known for abstracting away the physical storage device boundaries by pooling them together. ZFS completely removed the need to manually handle physical storage or worry about their individual capacities. ZFS is also a pioneer in its ability to detect data corruption and recover if data redundancy is available.
However, as we already discussed, traditional filesystems can’t handle modern application requirements. Applications now need ways to handle unstructured data programmatically while they have very low tolerance for data loss or security lapses.
And clearly this is difficult to achieve using current filesystems. But, what if ZFS evolved into a cloud storage system? What would it look like? I am quite sure it would look something like what Minio is today. What makes me say this? Well, there are a lot of things. I will cover all those in this post.
Protection against silent data corruption aka bit rot
Silent data corruption or bit rot is a serious problem faced by disk drives where data stored therein can get corrupt without the user even getting to know about it. This can happen because of things like (but not limited to) current spikes, bugs in disk firmware, phantom writes, misdirected reads/writes, DMA parity errors between the array and server memory, driver errors, and accidental overwrites.
ZFS was probably the first open source file system to provide protection against these data corruption issues. ZFS uses a Fletcher-based checksum or a SHA-256 hash throughout the file system tree, with each block of data checksummed and the value saved in the pointer to that block — rather than at the actual block itself. This checksumming continues all the way up the file system’s data hierarchy to the root node.
Now, since ZFS stores the checksum of each block in its parent block pointer, the entire pool self-validates when a block is accessed, regardless of whether it is data or metadata. The validation happens when checksum is calculated and compared with the stored checksum value of what it should be.
On similar lines, Minio provides comprehensive protection against data corruption and bit rot issues, albeit with faster hashing algorithm and an extensive data recovery mechanism based on erasure coding.
To protect against bit rot, Minio verifies the integrity of a data chunk read back from disk by computing the hash of the chunk and comparing this to the initially computed hash (at the time of storing the data). Only in the case both hashes are identical it is guaranteed that the data has not been altered in any way. In case of any difference between the hashes the block itself is discarded and has to be reconstructed from the other data and parity chunks. We’ll come back to reconstruction in a bit, let's first understand hashing in Minio.
Minio uses an optimized implementation of the Highway hashing algorithm — highwayhash, written completely in Go. It can achieve hashing speeds over to 10 GB/sec on a single core on Intel CPUs.
As you can imagine, hashing has to run on almost every data read (to verify) and data write (to save initial hash), so it is important that hashing is fast and accurate. That is why we invested time and energy in
highwayhash, a perfect fit for such use cases. You can read more about it in this blog post.
An erasure code is an error correction code, which transforms a message of k symbols into a longer message (code word) with n symbols such that the original message can be recovered from a subset of the n symbols.
Minio uses Reed-Solomon code to strip objects into n/2 data and n/2 parity blocks. This means that in a 12 drive setup, an object is sharded across as 6 data and 6 parity blocks. Even if you lose as many as 5 ((n/2)–1) drive, be it parity or data, you can still reconstruct the data reliably from the remaining drives.
How does this compare to RAID?
Erasure code protects data from multiple drives failure unlike RAID or replication. For example, RAID 6 can protect against 2 drive failure whereas with Minio erasure code you can lose as many as half of the drives and still recover your data.
Further, Minio’s erasure code is at object level as each object is individually encoded with a high parity count. So, it can heal one object at a time. While, for RAID, healing can only be performed at volume level which translates into huge down time.
Additionally, Minio’s erasure coded backend is designed for operational efficiency and takes full advantage of hardware acceleration whenever available.
Another key aspect of ZFS is the ability to scale with your needs. ZFS is potentially scalable to zettabytes of storage, considering only software aspects. Minio on the other hand scales in a multi-tenant manner, where each tenant’s data is stored on a separate Minio instance. This decouples the scale to physical limits of the software. Whether you have 100s or 1000s or millions of tenants — as long as there is a single Minio instance catering to each tenant, the deployment is simple and can be scaled easily.
Not only such setup allows for easy maintainability (only few tenants down at a given time), it also keeps the deployment simple and easy to scale up and down.
However, if you prefer large deployments, Minio recently introduced large bucket support that lets you launch petabyte scale deployments from the get go. Read more here.