SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 00% 22598 306666512

Welp, I guess I'm getting a new 8TB HDD for Christmas. Luckily the failing one is still under its 2 years warranty!

Follow

@delroth Isn't this "just" a sector that failed and wasn't reallocated yet, because it wasn't written to?

@delroth Or even s/failed/is unreadable due to crc mismatch/

@robryk No clue. The disk is telling me it's broken, I'm not particularly interested in trying out how much it's broken before it eats my data. I'm guessing that an extended offline test would take care of reallocating if it could too.

@delroth No, it won't.

The sector won't be reallocated until it's written to. The reasoning behind that is that maybe the next read will actually succeed, and we should never trash that possibility without explicit instructions to do so.

@delroth Take a look at Reallocated_Sector_Ct (and Offline_Uncorrectable and Current_Pending_Sector) counters. If there are few remaining spare sectors, then the disk is really close to failure. This is indicated by Reallocated_Sector_Ct being marked as dangerously high.

Other than that, sectors that cannot be corrected with the error correction code happen at some rate. This rate can be increased by various issues that make the drive arguably broken, but it's nonzero even with totally operational drive.

@robryk [nix-shell:~]# dd if=/dev/zero of=/dev/sdh bs=512 seek=8896601104
dd: error writing '/dev/sdh': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 3.38237 s, 0.0 kB/s

Can't be written to at all, from what I can tell.

@delroth Huh. That's really surprising (the read error was not immediate, so it's not _totally_ borked, but then why it seems totally borked for writes? is that the read-errored sector that you're trying to write to?). Would you mind pasting `smartctl -a /dev/sdh` and the presentation of this error in dmesg for my curiosity?

@delroth

I think this sector would work if you wrote to it. Sadly, you end up reading from it first (probably due to some readahead/caching/other bullshit) -- see `failed command: READ FPDMA QUEUED`.

I remember having a similar problem with a PATA drive >10yrs ago, which I fixed by rebuilding a kernel that just never issued reads to HDD. I expect that there's some way to make block IO layer actually issue only a write with some flags to open (O_DIRECT?).

@robryk hmm indeed, oflag=direct seems to have cleared the failure. Nice, thanks.

Not sure how much I still trust this drive, and now I don't even have an excuse to get a warranty replacement :P

@delroth I see ~no reason to count this against that drive in light of Reallocated_Event_Count that was equal to 0 (so, IIUC no sectors were found not to be usable anymore yet).

@robryk FWIW Reallocated_Event_Count is still 0 now so uh... SMART being very accurate as usual I guess.

@delroth It might truly be terribly inaccurate. However, there might have been no reallocation: we only know that there was an error when reading that ECC could not correct. It's possible (and likely) that the sector was physically OK and was still writeable (i.e. after writing you'd read the same thing back). In that case we just keep using the same sector (I don't know how the detection of that works exactly; I'd imagine that writing to a known-uncorrectable sector would involve an immediate readback, but does the drive know that? we surely can't read back everything).

@delroth s/never issued/redirected all reads to one sector :P/

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.