Unraid Server High CPU + crash (nvme?)

Unraid Version : 4.143.0

Hello, my PMS is configured as follow :

  1. Intel NUC 11 pro :
  • 1 nvme cache for all share (Samsung 990 Pro 2To)
  • Arraw : 1 empty USB stick
  1. SMB Shares from my sinology server for medias

The reason I use unraid is just because I love the interface, it is 90% used for Plex Media Server.

My problem :

Sometimes, my Nuc is full CPU usage, and I can’t do nothing, I have to “power button” to force turn off.

When I tip cmd “top”, I have 4 high cpu process :
z_wr_int_1
z_wr_int_0
z_wr_iss
kthreadd

Here the logs :

Jan  8 10:21:54 TowerNew kernel: nvme nvme0: I/O 189 (I/O Cmd) QID 5 timeout, aborting
Jan  8 10:22:24 TowerNew kernel: nvme nvme0: I/O 189 QID 5 timeout, reset controller
Jan  8 10:22:33 TowerNew kernel: nvme nvme0: I/O 17 QID 0 timeout, reset controller
Jan  8 10:23:47 TowerNew kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jan  8 10:23:47 TowerNew kernel: nvme nvme0: Abort status: 0x371
Jan  8 10:24:07 TowerNew kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jan  8 10:24:07 TowerNew kernel: nvme nvme0: Removing after probe failure status: -19
Jan  8 10:24:28 TowerNew kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jan  8 10:24:28 TowerNew kernel: nvme0n1: detected capacity change from 3907029168 to 0
Jan  8 10:24:28 TowerNew kernel: zio pool=cache vdev=/dev/nvme0n1p1 error=5 type=1 offset=399035248640 size=4096 flags=180880
Jan  8 10:24:28 TowerNew kernel: zio pool=cache vdev=/dev/nvme0n1p1 error=5 type=2 offset=584140775424 size=12288 flags=180880
Jan  8 10:24:28 TowerNew kernel: WARNING: Pool 'cache' has encountered an uncorrectable I/O failure and has been suspended.
Jan  8 10:24:28 TowerNew kernel: 
Jan  8 10:24:29 TowerNew kernel: WARNING: Pool 'cache' has encountered an uncorrectable I/O failure and has been suspended.
Jan  8 10:24:29 TowerNew kernel: 
Jan  8 10:35:12 TowerNew kernel: WARNING: Pool 'cache' has encountered an uncorrectable I/O failure and has been suspended.
Jan  8 10:35:12 TowerNew kernel: 
Jan  8 11:36:45 TowerNew kernel: WARNING: Pool 'cache' has encountered an uncorrectable I/O failure and has been suspended.

Thank you for help

A common failure cause of the nvme is the firmware.
Did you update the firmware before putting it into service ?
If not, the device will fail out at about the 2TBW mark.

Thank you for reply.
My firmware is already up to date.

Do you have a way to get the TBW for the SSD?

If not, I offer this script which does a reasonable job in most cases. Sometimes it doesn’t work at all (depends on the SSD type)

[chuck@glockner bin.2005]$ cat GetTBW 
#!/bin/bash

#######################################
# Variables                           #
#######################################

SSD_DEVICE="/dev/sda"

ON_TIME_TAG="Power_On_Hours"
WEAR_COUNT_TAG="Wear_Leveling_Count"
LBAS_WRITTEN_TAG="Total_LBAs_Written"
LBA_SIZE=512 # Value in bytes

BYTES_PER_MB=1048576
BYTES_PER_GB=1073741824
BYTES_PER_TB=1099511627776

#######################################
# Get total data written...           #
#######################################

# use given SSD
[ "$1" != "" ] && SSD_DEVICE="$1"

# Get SMART attributes
SMART_INFO=$(sudo /usr/sbin/smartctl -A "$SSD_DEVICE")

# Extract required attributes
ON_TIME=$(echo "$SMART_INFO" | grep "$ON_TIME_TAG" | awk '{print $10}')
WEAR_COUNT=$(echo "$SMART_INFO" | grep "$WEAR_COUNT_TAG" | awk '{print $4}' | sed 's/^0*//')
LBAS_WRITTEN=$(echo "$SMART_INFO" | grep "$LBAS_WRITTEN_TAG" | awk '{print $10}')

# Convert LBAs -> bytes
BYTES_WRITTEN=$(echo "$LBAS_WRITTEN * $LBA_SIZE" | bc)
MB_WRITTEN=$(echo "scale=3; $BYTES_WRITTEN / $BYTES_PER_MB" | bc)
GB_WRITTEN=$(echo "scale=3; $BYTES_WRITTEN / $BYTES_PER_GB" | bc)
TB_WRITTEN=$(echo "scale=3; $BYTES_WRITTEN / $BYTES_PER_TB" | bc)

# Output results...
echo "------------------------------"
echo " SSD Status:   $SSD_DEVICE"
echo "------------------------------"
echo " On time:      $(echo $ON_TIME | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta') hr"
echo "------------------------------"
echo " Data written:"
echo "           MB: $(echo $MB_WRITTEN | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta')"
echo "           GB: $(echo $GB_WRITTEN | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta')"
echo "           TB: $(echo $TB_WRITTEN | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta')"
echo "------------------------------"
echo " Mean write rate:"
echo "        MB/hr: $(echo "scale=3; $MB_WRITTEN / $ON_TIME" | bc | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta')"
echo "------------------------------"
echo " Drive health: ${WEAR_COUNT} %"
echo "------------------------------"
[chuck@glockner bin.2006]$

Looks like this:

[chuck@glockner ~.2001]$ GetTBW /dev/sdh
------------------------------
 SSD Status:   /dev/sdh
------------------------------
 On time:      45,713 hr
------------------------------
 Data written:
           MB: 21,274,444.369
           GB: 20,775.824
           TB: 20.288
------------------------------
 Mean write rate:
        MB/hr: 465.391
------------------------------
 Drive health:  %
------------------------------
[chuck@glockner ~.2002]$ 

Hello, no my SSD is about 10TBW

New crash this night…

Jan 15 03:43:55 TowerNew kernel: nvme nvme0: I/O 873 (I/O Cmd) QID 1 timeout, aborting
Jan 15 03:43:55 TowerNew kernel: nvme nvme0: I/O 429 (I/O Cmd) QID 7 timeout, aborting
Jan 15 03:44:26 TowerNew kernel: nvme nvme0: I/O 873 QID 1 timeout, reset controller
Jan 15 03:45:49 TowerNew kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jan 15 03:45:49 TowerNew kernel: nvme nvme0: Abort status: 0x371
Jan 15 03:45:49 TowerNew kernel: nvme nvme0: Abort status: 0x371
Jan 15 03:46:09 TowerNew kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jan 15 03:46:09 TowerNew kernel: nvme nvme0: Disabling device after reset failure: -19
Jan 15 03:46:09 TowerNew kernel: zio pool=cache vdev=/dev/nvme0n1p1 error=5 type=2 offset=366586630144 size=131072 flags=1589376
Jan 15 03:46:09 TowerNew kernel: zio pool=cache vdev=/dev/nvme0n1p1 error=5 type=2 offset=1735682654208 size=4096 flags=1589376
Jan 15 03:46:09 TowerNew kernel: zio pool=cache vdev=/dev/nvme0n1p1 error=5 type=2 offset=1742517813248 size=131072 flags=1589376
Jan 15 03:46:09 TowerNew kernel: zio pool=cache vdev=/dev/nvme0n1p1 error=5 type=2 offset=584134623232 size=12288 flags=1572992
Jan 15 04:15:04 TowerNew kernel: WARNING: Pool 'cache' has encountered an uncorrectable I/O failure and has been suspended.
Jan 15 04:15:04 TowerNew kernel: 
Jan 15 04:17:17 TowerNew kernel: WARNING: Pool 'cache' has encountered an uncorrectable I/O failure and has been suspended.
Jan 15 04:17:17 TowerNew kernel: 

that is either hardware failure or software failure to drive the hardware correctly (aka… BUG / not-correctly-supported hardware)

I think I solved the problem (2+ weeks without issu) :

On the main GUI page click on the flash drive, scroll down to “Syslinux Configuration”, make sure it’s set to “menu view” (top right) and add this to your default boot option, after “append initrd=/bzroot”

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

source : Seems one of my NVMe drives threw up on itself overnight. Help? (Diagnostics attached) - General Support - Unraid

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.