====== Troubleshooting RAID ======

If you've installed RAID1 arrays you will have done so in anticipation of hard drives failing.  I encountered problems with the [[|Logical Volumes]] that I have sitting on top/over the two RAID1 arrays that I have not being detected.  The specific error messages were about a physical volume (PV) not being detected and looked like this on trying to start ''/etc/init.d/lvm restart''...

```bash
# /etc/init.d/lvm restart 
 * Setting up the Logical Volume Manager ... 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  Refusing activation of partial LV vg/pics.  Use '--activationmode partial' to override. 
  Refusing activation of partial LV vg/video.  Use '--activationmode partial' to override. 
  Refusing activation of partial LV vg/music.  Use '--activationmode partial' to override. 
 * Failed to setup the LVM                                                                                                                                                                                                            [ !! ] 
 * ERROR: lvm failed to start 
# /etc/init.d/mdadm restart 
 * Stopping mdadm monitor ...                                                                                                                                                                                                         [ ok ] 
 * Setting up the Logical Volume Manager ... 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  No device found for PV JqPNBk-noWD-H6HZ-foaW-RrbJ-92Iu-GeEvPn. 
  Refusing activation of partial LV vg/pics.  Use '--activationmode partial' to override. 
  Refusing activation of partial LV vg/video.  Use '--activationmode partial' to override. 
  Refusing activation of partial LV vg/music.  Use '--activationmode partial' to override. 
 * Failed to setup the LVM                                                                                                                                                                                                            [ !! ] 
 * ERROR: lvm failed to start 
 * Starting mdadm monitor ...                                                                                                                                                                                                         [ ok ] 
```

===== Software ======

Theres a host of programs that you can use to help trouble shoot and fix harddrives make sure they're installed first, chances are ''sys-fs/mdadm'' will already be on your system as you will have needed it to setup your RAID array in the first place, but you may not have ''sys-apps/smartmontools'' which is a program for monitoring the state of hard drives.

```bash
emerge -av sys-apps/smartmontools sys-fs/mdadm
```


===== RAID - Checking =====

===== Hard Drives - Checking =====

===== Catching Errors Earlier =====

I was rightly advised that I should configure `sys-fs/mdadm` and `sys-apps/smartmonools` to send emails when errors occur so that they can be addressed as early as possible. A quick search led to [Using mdadm to send e-mail alerts for RAID failures](http://www.novell.com/support/kb/doc.php?id=7001034).

==== RAID  ====

First step is to identify the UUID of the RAID arrays you are using, this is pretty straight forward...

```bash
# mdadm --examine /dev/sd* | grep -i uuid
           UUID : 1c2d7311:c6b39a77:188edc34:4aa1c004
           UUID : 1c2d7311:c6b39a77:188edc34:4aa1c004
           UUID : 1072c19c:6a50e9c5:188edc34:4aa1c004
           UUID : 1072c19c:6a50e9c5:188edc34:4aa1c004
```


Now you know the UUID you can modify ''/etc/mdadm.conf''

```bash
# When used in --follow (aka --monitor) mode, mdadm needs a
# mail address and/or a program.  This can be given with "mailaddr"
# and "program" lines to that monitoring can be started using
MAILADDR your.email@ddress.com
```

That should suffice since the `/etc/init.d/mdadm` init script starts `mdadm` with the `--monitor` flag.

# SMART

`sys-fs/smartmontools` are a suite of tools that utilise [[wp>S.M.A.R.T.|Self-Monitoring, Analysis and Reporting Technology (SMART)]] features of hard disk drives and solid state drives for monitoring the health of the drives.  As with `mdadm` you can add an email address to the configuration file `/etc/smartd.conf` to ensure that you get emails when errors and problems are detected.

```bash
## Your email address it should preceed 'DEVICESCAN' since that halts reading of subsequent lines
## and tells smartd to start scanning all devices
DEFAULT -H -m your.email@ddress.com
DEVICESCAN
```


## Errors

I encountered some errors on one of my RAID drives after running [`hw-probe`](https://github.com/linuxhw/hw-probe/), which [reported](https://linux-hardware.org/?probe=e69d1396bf)...

```bash
Error 58 [9] occurred at disk power-on lifetime: 36301 hours (1512 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 05 45 21 80 40 00  Error: UNC at LBA = 0x05452180 = 88416640
```

...for various sectors (five in total).  Searching around this is a [read error](https://serverfault.com/a/381027) so I initiated a full scan as the post advises.

```bash
smartctl -t long /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.7.0-gentoo] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 402 minutes for test to complete.
Test will complete after Sun Jun  7 04:07:14 2020 BST
Use smartctl -X to abort test.
```

Having complete the long test I ran checks again and...

```bash
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.7.2-gentoo] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WMC4N0D6M4E0
LU WWN Device Id: 5 0014ee 003dfdf3f
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jun 17 18:03:46 2020 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(40080) seconds.
Offline data collection
capabilities: 			(0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	(   2) minutes.
Extended self-test routine
recommended polling time: 	( 402) minutes.
Conveyance self-test routine
recommended polling time: 	(   5) minutes.
SCT capabilities: 	      (0x703d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       58
  3 Spin_Up_Time            0x0027   179   176   021    Pre-fail  Always       -       6025
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1372
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       37764
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1348
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       42
193 Load_Cycle_Count        0x0032   183   183   000    Old_age   Always       -       53786
194 Temperature_Celsius     0x0022   118   111   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     37564         85685760

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
```

### nvme drive

I also had some errors with my NVMe drive...

```
/dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.7.0-gentoo] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 250GB
Serial Number:                      --
Firmware Version:                   1B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            114,235,158,528 [114 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 ...
Local Time is:                      Wed Jun  3 08:04:18 2020 BST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     84 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    99%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    117,513,999 [60.1 TB]
Data Units Written:                 10,229,263 [5.23 TB]
Host Read Commands:                 1,019,745,877
Host Write Commands:                262,947,061
Controller Busy Time:               3,266
Power Cycles:                       284
Power On Hours:                     7,705
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    159
Error Information Log Entries:      159
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               48 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
```

There are unsafe shutdowns and some media/data integrity errors and I've noticed some problems with this drive in the past, with file-system being repaired on rebooting.  Something to keep an eye on and replace in the not too distant future I think
.
# Links

  * [[https://forums.gentoo.org/viewtopic-p-7630012.html|Gentoo Forums - [SOLVED] RAID/LVM no longer detected... (2014)]]
  * [[https://forums.gentoo.org/viewtopic-t-922084.html|Gentoo Forums - [SOLVED] RAID not autodetected... (2012)]]
  * [[http://www.novell.com/support/kb/doc.php?id=7001034|Using mdadm to send e-mail alerts for RAID failures]]
  * [[http://www.ducea.com/2009/03/08/mdadm-cheat-sheet/|mdadm cheat sheet]]