Everything was fine at first. However, I would experience random crashes when I do some heavy (or sometimes normal) activities that involve disk I/O. When it crashes, the disk LED indicator blinks about three times or more (it flickers depending on disk activity) before turning off. After that, the desktop is either frozen or working but cannot load anything from the disk when I open an application or terminal.
As the standard troubleshooting procedure, I investigated the system logs for clues on why the drive crashed but didn't find anything at all. Most modern disk drives have SMART capabilities, so I installed
smartmontools using pacman to access it and print a report for
/dev/nvme0n1. To my chagrin, the report also shows no error logs.
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.77-1-MANJARO] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: KINGSTON SA2000M8500G Serial Number: 50026B72826D0B6C Firmware Version: S5Z42105 PCI Vendor/Subsystem ID: 0x2646 IEEE OUI Identifier: 0x0026b7 Controller ID: 1 Number of Namespaces: 1 Namespace 1 Size/Capacity: 500,107,862,016 [500 GB] Namespace 1 Utilization: 60,623,048,704 [60.6 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 0026b7 2826d0b6c5 Local Time is: Sun Jan 3 01:12:39 2021 PST Firmware Updates (0x14): 2 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Maximum Data Transfer Size: 32 Pages Warning Comp. Temp. Threshold: 75 Celsius Critical Comp. Temp. Threshold: 80 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 9.00W - - 0 0 0 0 0 0 1 + 4.60W - - 1 1 1 1 0 0 2 + 3.80W - - 2 2 2 2 0 0 3 - 0.0450W - - 3 3 3 3 2000 2000 4 - 0.0040W - - 4 4 4 4 15000 15000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 30 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 1,318,500 [675 GB] Data Units Written: 1,782,547 [912 GB] Host Read Commands: 11,972,697 Host Write Commands: 16,505,085 Controller Busy Time: 190 Power Cycles: 108 Power On Hours: 242 Unsafe Shutdowns: 14 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, max 256 entries) No Errors Logged
The drive has 14 unsafe shutdowns to date due to the forced power-off needed to restart the machine.
If there aren't any error logs, then the SSD isn't broken. But I needed an idea on why or how the drive fails, and I can't use my new NUC for any serious work due to that fact. It is unwise to return the item because I didn't have any proof that it is indeed the hardware's fault.
I got a hunch that Linux may have compatibility issues with NVMe devices. So, I visited an Arch Wiki page about NVMe SSD and discovered that NVMe SSDs can save power through APST (Autonomous Power State Transition).
On a patch made by amluto for the Linux kernel 4.11, he fixed the APST (power saving) for NVMe devices. NVMe devices can save power by entering a low-power state when idle:
NVMe devices can advertise multiple power states. These states can be either "operational" (the device is fully functional but possibly slow) or "non-operational" (the device is asleep until woken up). Some devices can automatically enter a non-operational state when idle for a specified amount of time and then automatically wake back up when needed.
However, his Samsung 950 had issues with APST detection:
In theory, the device can expose "default" APST table, but this doesn't seem to function correctly on my device (Samsung 950), nor does it seem particularly useful. There is also an optional mechanism by which a configuration can be "saved" so it will be automatically loaded on reset. This can be configured from userspace, but it doesn't seem useful to support in the driver.
When I checked my SSD using
nvme-cli, APST was enabled. However, state 4 (non-operational) has a total (ent_lat + ex_lat) latency higher than the default 25 ms.
Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 9.00W - - 0 0 0 0 0 0 1 + 4.60W - - 1 1 1 1 0 0 2 + 3.80W - - 2 2 2 2 0 0 3 - 0.0450W - - 3 3 3 3 2000 2000 4 - 0.0040W - - 4 4 4 4 15000 15000
Based on the Arch Wiki:
If the total latency of any state (enlat + xlat) is greater than 25000 (25ms) you must pass a value at least that high as parameter
But if APST works correctly, that state wouldn't be used at all:
The maximum acceptable latency is controlled using
pm_qos_latency_tolerance_usin sysfs); non-operational states with total latency greater than this value will not be used. As a special case, setting the latency tolerance to 0 will disable APST entirely. On hardware without APST support, the sysfs file will not be exposed.
Nevertheless, I still set the
default_ps_max_latency_us for the
nvme_core kernel module to 30000 (30 ms) to match the total latency of my SSD's 4th state.
My SSD hasn't failed six days after the fix, even when building Docker images (a rather disk I/O intensive task). It seems that this Kingston A2000 NVMe SSD also has issues exposing its APST table similar to the Samsung 950, making the system unstable due to a power-saving state that the drive cannot use.
The problem resurfaced when I transferred my Nuxt.js project files (uncompressed) from my work MacBook Pro to the NUC's Linux partition via
scp. Happened twice, and I was able to get the error logs in a separate terminal session using
dmesg -w on the second attempt:
nvme nvme0: I/O 10 QID 2 timeout, aborting nvme nvme0: I/O 11 QID 2 timeout, aborting nvme nvme0: I/O 12 QID 2 timeout, aborting nvme nvme0: I/O 13 QID 2 timeout, aborting nvme nvme0: I/O 14 QID 2 timeout, aborting nvme nvme0: I/O 10 QID 2 timeout, reset controller nvme nvme0: I/O 24 QID 0 timeout, reset controller nvme nvme0: Device not ready; aborting reset nvme nvme0: Abort status: 0x371 nvme nvme0: Abort status: 0x371 nvme nvme0: Abort status: 0x371 nvme nvme0: Abort status: 0x371 nvme nvme0: Abort status: 0x371 nvme nvme0: Device not ready; aborting reset nvme nvme0: Removing after probe failure status: -19 nvme nvme0: Device not ready; aborting reset nvme nvme0: failed to set APST feature (-19)
Coincidentally, the ArchWiki SSD NVMe page troubleshooting section was updated on Jan 8th. It states that Kingston A2000 drives with firmware S5Z42105 exhibit controller issues related to power saving.
As a workaround, add the kernel parameter
nvme_core.default_ps_max_latency_us=0to completely disable APST, or set a custom threshold to disable specific states.
I ended up disabling the APST altogether for the drive. I restarted the machine and got the file transfer successful on the third attempt.
Hoping this issue gets fixed on future kernel updates.
Instead of completely disabling the APST, I only disabled the 4th power state. Nonetheless, my NUC doesn't crash anymore.