Follow the recommendation
We have seen many different EC profiles over the last 10+ years, but few that follow the official ceph.io recommendations.
We generally recommend min_size be K+2 or more to prevent loss of writes and data.https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-coded-pool-recovery
Erasure Coding vs RAID in production
Erasure coding and RAID (e.g. RAID5, RAID6,…) are often compared to data chunks and coding chunks because of the architecture.
However, they differ considerably from each other in productive use.
global rule vs data set rule
In software or hardware RAID, the number of hard disks, including hotspare, for storing all data is fixed.
In Ceph, however, e.g. with an EC profile of 8 + 3 and Failure Domain HOST, a total of 11 servers with one hard disk each are involved in storing a data set.
For the next data set, other servers or other hard disks are used.
Key facts for ceph recovery
If a hard disk fails, another hard disk is immediately allocated as the storage location.
The decisive factor for data security is how long it takes to restore the data and how high the probability is that other hard disks will fail during the recovery period.
Further failures extend the recovery period or if more than 3 hard disks fail, data loss occurs for the part of the data stored there.
The recovery time depends on physical components such as the number of available hard disks, their fill level and the throughput.
The recovery behaviour is also significantly influenced by the correct choice of Ceph configuration parameters.
Care should always be taken to ensure that the parameters are in relation to the physical hardware, e.g. priority of the recovery in relation to the response to client requests during operation:
- priority of the recovery in relation to the response to client requests during operation.
- optimal choice of PGs for the distribution of data
- distinction between SLA for read access and write access
Our general opinion on EC is
Originally I didn’t like it much and would like to avoid it whenever possible. Mainly because it’s much more complicated (more bugs), much harder to restore (“partial” restore is not possible) and performance is usually worse. But “saved space” sounds too tempting at first glance.
With that said, it is inevitable in the future and there are actually cases where it is fine and can even work better than a replicated pool, e.g. when storing large data such as backup tarballs or videos, or when the writes are aligned to the stripe width (i.e. the application needs to know how to write effectively).