failed command: WRITE FPDMA QUEUED - cause of server running slow?

Question

A headless Linux NFS fileserver has been running slow for the last couple days (based on subjective reports from users). I checked journalctl did not see any relevant errors.

However, when I connected a monitor, I was greeted by a screen full of these errors:

failed command: WRITE FPDMA QUEUED

Here's a photo:

What are some suggested next steps? Should I just replace the disk?

Could you do a lspci and find out what controller is running ata5.00? some marvell sata controllers have issues with linux — Bryan isthebest, Jun 07 '20 at 23:36

score 13 · Accepted Answer · answered Feb 03 '19 at 21:22

Only one disk is having issues, according to what is visible here. But it's not obvious whether it's the drive, or the cable, or the controller.

I would first reboot, to reset the hardware. It could just be that the controller was temporarily confused and literally turning it off and on again would help.

If the error comes back after rebooting, then I would plug that drive into a different SATA port, and if necessary plug some other drive into that port. If the error is still on the same port, then the problem is with the controller.

But if the error "moves" to the new port, then it's either the cable or the drive. At this point I would replace the SATA cable. If the error goes away, then it was a bad cable. If you still have the error, then it's the drive.

score 5 · Answer 2 · answered May 23 '20 at 03:12

5

Another thing you can try (which works remotely) is to run smartctl -a to see if the drive is reporting any errors, and perhaps smartctl -t short to run a self-test on it.

In my case it revealed that the WRITE FPDMA QUEUED failures were due to an ICRC (interface CRC) error, which means the data was corrupted between the drive and the controller, so the disk itself is ok and it's either the SATA cable, or the circuitry at either end that the SATA cable plugs into.

While I'm no expert, presumably in this case the command was retried and eventually got through the SATA cable without being corrupted, resulting in a functioning system but with a very slow disk due to all the retries.

answered May 23 '20 at 03:12

Malvineous

1,145

Could you elaborate on what part indicated a CRC error? It's not clear from the output. Also, was it an HDD or SSD? – jcollum Sep 06 '21 at 19:21
@jcollum I'm afraid I don't recall but the smartctl -a output came back with the word "ICRC" in many of the error messages which I looked up to get my answer. – Malvineous Sep 08 '21 at 02:31
1

@jcollum See "UltraDMA CRC Error Count" in Self-Monitoring, Analysis and Reporting Technology - Wikipedia. It showed up under UDMA_CRC_Error_Count for me. – davidvandebunte Jan 06 '23 at 21:05

Attila Lendvai · Answer 3 · 2023-10-16T22:02:02.513

I'm having the same issue (failed command: WRITE FPDMA QUEUED) on an Apacer AS350 1TB drive with an older i7 laptop (i.e. Intel chipset).

I found this on a kernel bug discussion:

https://bugzilla.kernel.org/show_bug.cgi?id=203475#c15

I believe if you see "WRITE FPDMA QUEUED" messages, the issue is with NCQ in general, and yes, you should try disabling it for the device. But if you see "SEND FPDMA QUEUED" as in the initial post, then you might've gotten away with disabling just the queued TRIM.

Adding libata.force=1:3.0G,1:noncq "solved" the problem for me, but of course it slowed down the drive. I think the 1:3.0G part is not important.

A firmware update on the SSD may solve such issues, but Apacer only provides Windows based firmware updates that I cannot easily apply and test.

failed command: WRITE FPDMA QUEUED - cause of server running slow?

3 Answers3

Linked