9

Is a resilver as good as a scrub? If not, why?

Use case: during a scrub checksum errors come up. Instead of continuing with the scrub, stop it, replace drive and resilver. Did the resilver do some/all of the checking that a scrub would have done?

4 Answers4

7

A scrub reads all the data in the zpool and checks it against its parity information.

A resilver re-copies all the data in one device from the data and parity information in the other devices in the vdev: for a mirror it simply copies the data from the other device in the mirror, from a raidz device it reads data and parity from remaining drives to reconstruct the missing data.

They are not the same, and in my interpretation they are not equivalent. If a resilver encounters an error when trying to reconstruct a copy of the data, this may well be a permanent error (since the data can't be correctly reconstructed any more). Conversely if a scrub detects corruption, it can usually be fixed from the remaining data and parity (and this happens silently at times in normal use as well).

Kurankat
  • 171
2

If you are replacing a drive, it is beneficial to have the old drive still present if it hasn't completely failed as additional redundancy during the resilvering process. If you have no redundancy left, any further errors will result in some data loss in the affected files.

A resilver operation will read the minimum amount of data required to restore redundancy onto the replacement disk. A scrub operation will read ALL data, both primary and parity data.

So if you are resilvering a mirror or raidz1, they are equivalent as resilver has to read all the surviving data. If you are resilvering a 3-way mirror, raidz2 or raidz3, resilver will not read all of the surviving data, so in those cases, scrub and resilver are not equivalent.

0

Thank you very much for sharing the issue and for those that commented with details and suggestions. I am trying to take the concept one step back, I am searching for the most straightforward explanation in ZFS for:

  1. Scrub
  2. Resilvering

I understand that Scrub checks for errors and faults in the HDD, SSD, or NVME disk. Resilvering copies the data again in a device that is faulty or a device that has been replaced.

I look forward to your corrections and feedback.

In my case, looks like I will need to buy a new SSD:

# zpool status -v pvedata1
  pool: pvedata1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat May 13 18:28:30 2023
    1.20T scanned at 137M/s, 609G issued at 68.2M/s, 2.39T total
    389G resilvered, 24.93% done, 07:39:31 to go
config:
NAME                                  STATE     READ WRITE CKSUM
pvedata1                              DEGRADED     0     0     0
  mirror-0                            DEGRADED     0     0     0
    ata-CT4000MX500SSD1_2243E67E4CA0  FAULTED      3     8     0  too many errors
    ata-CT4000MX500SSD1_2243E67E4C9F  ONLINE       0     0     0

errors: No known data errors

Sincerely,

0

I know this is a very late answer, but I feel like this question could use a broader answer covering the subject.


The main scrubbing and resilvering processes in ZFS are essentially identical – in both cases records are being read and verified, and if necessary written out to any disk(s) with invalid (or missing) data.

Since ZFS is aware of which records a disk should have, it won't bother trying to read records that shouldn't exist. This means that during resilvering, new disks will see little or no read activity as there's nothing to read (or at least ZFS doesn't believe there is).

This also means that if a disk becomes unavailable and then available again, ZFS will resilver only the new records created since the disk went unavailable. Resilvering happens automatically in this way, whereas scrubs typically have to be initiated (either manually, or via a scheduled command).

There is also a special "sequential resilver" option that can be triggered using zpool attach -s or zpool replace -s – this performs a faster copy of all data without any checking, and initiates a deferred scrub to verify integrity later. This is good for quickly restoring redundancy, but should only be used if you're confident that the existing data is correct.

Finally there are some small differences in settings for scrub and resilver - in general a resilver is given a higher priority than a scrub since a scrub is less intensive and usually less urgent, though due to various factors this may not mean a resilver is faster than a scrub depending upon write speed, number of record copies available etc.


Specifically with regards to the original question:

If you know a disk has failed and want to replace it, then a sequential resilver + scrub (zpool replace -s) will be faster in terms of restoring redundancy and performance, but it'll take longer overall before you know for sure that the data was fully restored (without any errors). A regular resilver will take longer to finish copying data, but is verified the moment it finishes.

However, if you're talking about repairing data on a disk you still believe to be okay then a scrub is the fastest option, as it will only copy data which fails verification, otherwise the process is entirely reading and checking so it's almost always going to be faster.

In theory a resilver can be just as fast as a scrub, or even faster (since it's higher priority), assuming you are copying onto a suitably fast new drive that's optimised for continuous writing. In practice though that's usually not going to be the case.

Haravikk
  • 267