Compare commits

..

7 Commits

Author SHA1 Message Date
packagrio-bot 145c819fc1 (v0.4.13) Automated packaging of release by Packagr 2022-06-14 14:42:54 +00:00
Jason Kulatunga a9ea231de0 Merge pull request #301 from AnalogJ/disable_seek_read_error_rates 2022-06-14 07:33:45 -07:00
Jason Kulatunga c2488af1c3 Disable Seek & Read error rate attribute analysis. Causes issues with Seagate Ironwolf drives.
Added documentation.
2022-06-14 07:32:33 -07:00
Jason Kulatunga ecf7a447a7 Disable Seek & Read error rate attribute analysis. Causes issues with Seagate Ironwolf drives.
Added documentation.
2022-06-14 07:29:23 -07:00
Jason Kulatunga f8e61af2f9 adding docs. 2022-06-13 08:55:27 -07:00
Jason Kulatunga ee61d986d8 Update docker-nightly.yaml 2022-06-12 10:13:10 -07:00
Jason Kulatunga 8fe8cec09a Update TROUBLESHOOTING_DOCKER.md 2022-06-12 10:09:25 -07:00
5 changed files with 72 additions and 112 deletions
+2 -2
View File
@@ -1,4 +1,4 @@
name: Docker
name: Docker - Nightly
on:
schedule:
- cron: '36 12 * * *'
@@ -66,4 +66,4 @@ jobs:
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
# cache-from: type=gha
# cache-to: type=gha,mode=max
# cache-to: type=gha,mode=max
+66
View File
@@ -180,6 +180,72 @@ If Scrutiny detects that an attribute corresponds with a high rate of failure us
This can cause some confusion when comparing Scrutiny's dashboard against other SMART analysis tools.
If you hover over the "failed" label beside an attribute, Scrutiny will tell you if the failure was due to SMART or Scrutiny/BackBlaze data.
### Device failed but Smart & Scrutiny passed
Device SMART results are the source of truth for Scrutiny, however we don't just take into account the current SMART results, but also historical analysis of a disk.
This means that if a device is marked as failed at any point in its history, it will continue to be stored in the database as failed until the device is removed (or status is reset -- see below).
In some cases, this historical failure may have been due to attribute analysis/thresholds that have since been relaxed:
- NVME - Numb Error Log Entries (v0.4.7)
- ATA - Power Cycle Count (v0.4.7)
- ATA - Read Error Rate (v0.4.13)
- ATA - Seek Error Rate (v0.4.13)
If you'd like to reset the status of a disk (to healthy) and allow the next run of the collector to determine the actual status, you can run the following command:
```bash
# connect to scrutiny docker container
docker exec -it scrutiny bash
# install sqlite CLI tools (inside container)
apt update && apt install -y sqlite3
# connect to the scrutiny database
sqlite3 /opt/scrutiny/config/scrutiny.db
# reset/update the devices table, unset the failure status.
UPDATE devices SET device_status = null
# exit sqlite CLI
.exit
```
### Seagate Drives Failing
As thoroughly discussed in [#255](https://github.com/AnalogJ/scrutiny/issues/255), Seagate (Ironwolf & others) drives are almost always marked as failed by Scrutiny.
> The `Seek Error Rate` & `Read Error Rate` attribute raw values are typically very high, and the
> normalised values (Current / Worst / Threshold) are usually quite low. Despite this, the numbers in most cases are perfectly OK
>
> The anxiety arises because we intuitively expect that the normalised values should reflect a "health" score, with
> 100 being the ideal value. Similarly, we would expect that the raw values should reflect an error count, in
> which case a value of 0 would be most desirable. However, Seagate calculates and applies these attribute values
> in a counterintuitive way.
>
> http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html
Some analysis has been done which shows that Seagate drives break the common SMART conventions, which also causes Scrutiny's
comparison against BackBlaze data to detect these drives as failed.
**So what's the Solution?**
After taking a look at the BackBlaze data for the relevant Attributes (`Seek Error Rate` & `Read Error Rate`), I've decided
to disable Scrutiny analysis for them. Both are non-critical, and have low-correlation with failure.
> Please note: SMART failures for these attributes will still cause the drive to be marked as failed. Only BackBlaze analysis has been disabled
If this is effecting your drives, you'll need to do the following:
1. Upgrade to v0.4.13+
2. Reset your drive status using the SQLite script in [#device-failed-but-smart--scrutiny-passed](https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#device-failed-but-smart--scrutiny-passed)
3. Wait for (or manually start) the collector.
If you'd like to learn more about how the Seagate Ironwolf SMART attributes work under the hood, and how they differ from
other drives, please read the following:
- http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html
- https://www.truenas.com/community/threads/seagate-ironwolf-smart-test-raw_read_error_rate-seek_error_rate.68634/
## Hub & Spoke model, with multiple Hosts.
+3 -9
View File
@@ -14,12 +14,6 @@ is almost immediately created (and tagged with `latest`)
So changing from `master-omnibus -> latest` will be the same thing for all intents and purposes.
Having said that -- the one key difference is the `automated cron builds` that run on the `master` and `beta` branches.
They trigger a `nightly` build, even if nothing has changed on the branch. This has a couple of benefits, but one is to
ensure that there's no broken external dependencies in our (unchanged) code.
However, as everyone unfortunately found out recently, I had an error in my CI script, which caused failures to be
ignored -- https://github.com/AnalogJ/scrutiny/issues/287. That has since been fixed.
Hope that gives you an understanding for how everything is wired up.
> NOTE: Previously, there was a `automated cron build` that ran on the `master` and `beta` branches.
They used to trigger a `nightly` build, even if nothing has changed on the branch. This has a couple of benefits, but one is to
ensure that there's no broken external dependencies in our (unchanged) code. This `nightly` build no longer updates the `master-omnibus` tag.
@@ -36,56 +36,6 @@ var AtaMetadata = map[int]AtaAttributeMetadata{
Ideal: ObservedThresholdIdealLow,
Critical: false,
Description: "(Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number.",
ObservedThresholds: []ObservedThreshold{
{
Low: 80,
High: 95,
AnnualFailureRate: 0.8879749768303985,
ErrorInterval: []float64{0.682344353388663, 1.136105732920724},
},
{
Low: 95,
High: 110,
AnnualFailureRate: 0.034155719633986996,
ErrorInterval: []float64{0.030188482024981093, 0.038499386872354435},
},
{
Low: 110,
High: 125,
AnnualFailureRate: 0.06390002135229157,
ErrorInterval: []float64{0.05852004676110847, 0.06964160930553712},
},
{
Low: 125,
High: 140,
AnnualFailureRate: 0,
ErrorInterval: []float64{0, 0},
},
{
Low: 140,
High: 155,
AnnualFailureRate: 0,
ErrorInterval: []float64{0, 0},
},
{
Low: 155,
High: 170,
AnnualFailureRate: 0,
ErrorInterval: []float64{0, 0},
},
{
Low: 170,
High: 185,
AnnualFailureRate: 0,
ErrorInterval: []float64{0, 0},
},
{
Low: 185,
High: 200,
AnnualFailureRate: 0.044823775021490854,
ErrorInterval: []float64{0.032022762038723306, 0.06103725943096589},
},
},
},
2: {
ID: 2,
@@ -290,56 +240,6 @@ var AtaMetadata = map[int]AtaAttributeMetadata{
Ideal: "",
Critical: false,
Description: "(Vendor specific raw value.) Rate of seek errors of the magnetic heads. If there is a partial failure in the mechanical positioning system, then seek errors will arise. Such a failure may be due to numerous factors, such as damage to a servo, or thermal widening of the hard disk. The raw value has different structure for different vendors and is often not meaningful as a decimal number.",
ObservedThresholds: []ObservedThreshold{
{
Low: 58,
High: 76,
AnnualFailureRate: 0.2040131025936549,
ErrorInterval: []float64{0.17032852883286412, 0.2424096283327138},
},
{
Low: 76,
High: 94,
AnnualFailureRate: 0.08725919610118257,
ErrorInterval: []float64{0.08077138510999876, 0.09412943212007528},
},
{
Low: 94,
High: 112,
AnnualFailureRate: 0.01087335627722523,
ErrorInterval: []float64{0.008732197944943352, 0.013380600544561905},
},
{
Low: 112,
High: 130,
AnnualFailureRate: 0,
ErrorInterval: []float64{0, 0},
},
{
Low: 130,
High: 148,
AnnualFailureRate: 0,
ErrorInterval: []float64{0, 0},
},
{
Low: 148,
High: 166,
AnnualFailureRate: 0,
ErrorInterval: []float64{0, 0},
},
{
Low: 166,
High: 184,
AnnualFailureRate: 0,
ErrorInterval: []float64{0, 0},
},
{
Low: 184,
High: 202,
AnnualFailureRate: 0.05316285755900475,
ErrorInterval: []float64{0.03370069132942804, 0.07977038905848267},
},
},
},
8: {
ID: 8,
+1 -1
View File
@@ -2,4 +2,4 @@ package version
// VERSION is the app-global version string, which will be replaced with a
// new value during packaging
const VERSION = "0.4.12"
const VERSION = "0.4.13"