====== Troubleshooting RAID ====== If you've installed RAID1 arrays you will have done so in anticipation of hard drives failing. I encountered problems with the [[|Logical Volumes]] that I have sitting on top/over the two RAID1 arrays that I have not being detected. The specific error messages were about a physical volume (PV) not being detected and looked like this on trying to start ''/etc/init.d/lvm restart''... ```bash # /etc/init.d/lvm restart * Setting up the Logical Volume Manager ... No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. Refusing activation of partial LV vg/pics. Use '--activationmode partial' to override. Refusing activation of partial LV vg/video. Use '--activationmode partial' to override. Refusing activation of partial LV vg/music. Use '--activationmode partial' to override. * Failed to setup the LVM [ !! ] * ERROR: lvm failed to start # /etc/init.d/mdadm restart * Stopping mdadm monitor ... [ ok ] * Setting up the Logical Volume Manager ... No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. Refusing activation of partial LV vg/pics. Use '--activationmode partial' to override. Refusing activation of partial LV vg/video. Use '--activationmode partial' to override. Refusing activation of partial LV vg/music. Use '--activationmode partial' to override. * Failed to setup the LVM [ !! ] * ERROR: lvm failed to start * Starting mdadm monitor ... [ ok ] ``` ===== Software ====== Theres a host of programs that you can use to help trouble shoot and fix harddrives make sure they're installed first, chances are ''sys-fs/mdadm'' will already be on your system as you will have needed it to setup your RAID array in the first place, but you may not have ''sys-apps/smartmontools'' which is a program for monitoring the state of hard drives. ```bash emerge -av sys-apps/smartmontools sys-fs/mdadm ``` ===== RAID - Checking ===== ===== Hard Drives - Checking ===== ===== Catching Errors Earlier ===== I was rightly advised that I should configure `sys-fs/mdadm` and `sys-apps/smartmonools` to send emails when errors occur so that they can be addressed as early as possible. A quick search led to [Using mdadm to send e-mail alerts for RAID failures](http://www.novell.com/support/kb/doc.php?id=7001034). ==== RAID ==== First step is to identify the UUID of the RAID arrays you are using, this is pretty straight forward... ```bash # mdadm --examine /dev/sd* | grep -i uuid UUID : 1c2d7311:c6b39a77:188edc34:4aa1c004 UUID : 1c2d7311:c6b39a77:188edc34:4aa1c004 UUID : 1072c19c:6a50e9c5:188edc34:4aa1c004 UUID : 1072c19c:6a50e9c5:188edc34:4aa1c004 ``` Now you know the UUID you can modify ''/etc/mdadm.conf'' ```bash # When used in --follow (aka --monitor) mode, mdadm needs a # mail address and/or a program. This can be given with "mailaddr" # and "program" lines to that monitoring can be started using MAILADDR your.email@ddress.com ``` That should suffice since the `/etc/init.d/mdadm` init script starts `mdadm` with the `--monitor` flag. # SMART `sys-fs/smartmontools` are a suite of tools that utilise [[wp>S.M.A.R.T.|Self-Monitoring, Analysis and Reporting Technology (SMART)]] features of hard disk drives and solid state drives for monitoring the health of the drives. As with `mdadm` you can add an email address to the configuration file `/etc/smartd.conf` to ensure that you get emails when errors and problems are detected. ```bash ## Your email address it should preceed 'DEVICESCAN' since that halts reading of subsequent lines ## and tells smartd to start scanning all devices DEFAULT -H -m your.email@ddress.com DEVICESCAN ``` ## Errors I encountered some errors on one of my RAID drives after running [`hw-probe`](https://github.com/linuxhw/hw-probe/), which [reported](https://linux-hardware.org/?probe=e69d1396bf)... ```bash Error 58 [9] occurred at disk power-on lifetime: 36301 hours (1512 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 05 45 21 80 40 00 Error: UNC at LBA = 0x05452180 = 88416640 ``` ...for various sectors (five in total). Searching around this is a [read error](https://serverfault.com/a/381027) so I initiated a full scan as the post advises. ```bash smartctl -t long /dev/sdc smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.7.0-gentoo] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 402 minutes for test to complete. Test will complete after Sun Jun 7 04:07:14 2020 BST Use smartctl -X to abort test. ``` Having complete the long test I ran checks again and... ```bash smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.7.2-gentoo] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD30EFRX-68EUZN0 Serial Number: WD-WMC4N0D6M4E0 LU WWN Device Id: 5 0014ee 003dfdf3f Firmware Version: 82.00A82 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Jun 17 18:03:46 2020 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (40080) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 402) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 58 3 Spin_Up_Time 0x0027 179 176 021 Pre-fail Always - 6025 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1372 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 049 049 000 Old_age Always - 37764 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1348 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 42 193 Load_Cycle_Count 0x0032 183 183 000 Old_age Always - 53786 194 Temperature_Celsius 0x0022 118 111 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 37564 85685760 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ``` ### nvme drive I also had some errors with my NVMe drive... ``` /dev/nvme0n1 smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.7.0-gentoo] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: Samsung SSD 970 EVO 250GB Serial Number: -- Firmware Version: 1B2QEXE7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 250,059,350,016 [250 GB] Unallocated NVM Capacity: 0 Controller ID: 4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 250,059,350,016 [250 GB] Namespace 1 Utilization: 114,235,158,528 [114 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 002538 ... Local Time is: Wed Jun 3 08:04:18 2020 BST Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 84 Celsius Critical Comp. Temp. Threshold: 84 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.20W - - 0 0 0 0 0 0 1 + 4.30W - - 1 1 1 1 0 0 2 + 2.10W - - 2 2 2 2 0 0 3 - 0.0400W - - 3 3 3 3 210 1200 4 - 0.0050W - - 4 4 4 4 2000 8000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 39 Celsius Available Spare: 99% Available Spare Threshold: 10% Percentage Used: 1% Data Units Read: 117,513,999 [60.1 TB] Data Units Written: 10,229,263 [5.23 TB] Host Read Commands: 1,019,745,877 Host Write Commands: 262,947,061 Controller Busy Time: 3,266 Power Cycles: 284 Power On Hours: 7,705 Unsafe Shutdowns: 3 Media and Data Integrity Errors: 159 Error Information Log Entries: 159 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 39 Celsius Temperature Sensor 2: 48 Celsius Error Information (NVMe Log 0x01, max 64 entries) No Errors Logged ``` There are unsafe shutdowns and some media/data integrity errors and I've noticed some problems with this drive in the past, with file-system being repaired on rebooting. Something to keep an eye on and replace in the not too distant future I think . # Links * [[https://forums.gentoo.org/viewtopic-p-7630012.html|Gentoo Forums - [SOLVED] RAID/LVM no longer detected... (2014)]] * [[https://forums.gentoo.org/viewtopic-t-922084.html|Gentoo Forums - [SOLVED] RAID not autodetected... (2012)]] * [[http://www.novell.com/support/kb/doc.php?id=7001034|Using mdadm to send e-mail alerts for RAID failures]] * [[http://www.ducea.com/2009/03/08/mdadm-cheat-sheet/|mdadm cheat sheet]]