Compare commits
7 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 145c819fc1 | |||
| a9ea231de0 | |||
| c2488af1c3 | |||
| ecf7a447a7 | |||
| f8e61af2f9 | |||
| ee61d986d8 | |||
| 8fe8cec09a |
@@ -1,4 +1,4 @@
|
||||
name: Docker
|
||||
name: Docker - Nightly
|
||||
on:
|
||||
schedule:
|
||||
- cron: '36 12 * * *'
|
||||
@@ -66,4 +66,4 @@ jobs:
|
||||
tags: ${{ steps.meta.outputs.tags }}
|
||||
labels: ${{ steps.meta.outputs.labels }}
|
||||
# cache-from: type=gha
|
||||
# cache-to: type=gha,mode=max
|
||||
# cache-to: type=gha,mode=max
|
||||
|
||||
@@ -180,6 +180,72 @@ If Scrutiny detects that an attribute corresponds with a high rate of failure us
|
||||
This can cause some confusion when comparing Scrutiny's dashboard against other SMART analysis tools.
|
||||
If you hover over the "failed" label beside an attribute, Scrutiny will tell you if the failure was due to SMART or Scrutiny/BackBlaze data.
|
||||
|
||||
### Device failed but Smart & Scrutiny passed
|
||||
|
||||
Device SMART results are the source of truth for Scrutiny, however we don't just take into account the current SMART results, but also historical analysis of a disk.
|
||||
This means that if a device is marked as failed at any point in its history, it will continue to be stored in the database as failed until the device is removed (or status is reset -- see below).
|
||||
|
||||
In some cases, this historical failure may have been due to attribute analysis/thresholds that have since been relaxed:
|
||||
|
||||
- NVME - Numb Error Log Entries (v0.4.7)
|
||||
- ATA - Power Cycle Count (v0.4.7)
|
||||
- ATA - Read Error Rate (v0.4.13)
|
||||
- ATA - Seek Error Rate (v0.4.13)
|
||||
|
||||
If you'd like to reset the status of a disk (to healthy) and allow the next run of the collector to determine the actual status, you can run the following command:
|
||||
|
||||
```bash
|
||||
# connect to scrutiny docker container
|
||||
docker exec -it scrutiny bash
|
||||
|
||||
# install sqlite CLI tools (inside container)
|
||||
apt update && apt install -y sqlite3
|
||||
|
||||
# connect to the scrutiny database
|
||||
sqlite3 /opt/scrutiny/config/scrutiny.db
|
||||
|
||||
# reset/update the devices table, unset the failure status.
|
||||
UPDATE devices SET device_status = null
|
||||
|
||||
# exit sqlite CLI
|
||||
.exit
|
||||
```
|
||||
|
||||
### Seagate Drives Failing
|
||||
|
||||
As thoroughly discussed in [#255](https://github.com/AnalogJ/scrutiny/issues/255), Seagate (Ironwolf & others) drives are almost always marked as failed by Scrutiny.
|
||||
|
||||
> The `Seek Error Rate` & `Read Error Rate` attribute raw values are typically very high, and the
|
||||
> normalised values (Current / Worst / Threshold) are usually quite low. Despite this, the numbers in most cases are perfectly OK
|
||||
>
|
||||
> The anxiety arises because we intuitively expect that the normalised values should reflect a "health" score, with
|
||||
> 100 being the ideal value. Similarly, we would expect that the raw values should reflect an error count, in
|
||||
> which case a value of 0 would be most desirable. However, Seagate calculates and applies these attribute values
|
||||
> in a counterintuitive way.
|
||||
>
|
||||
> http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html
|
||||
|
||||
Some analysis has been done which shows that Seagate drives break the common SMART conventions, which also causes Scrutiny's
|
||||
comparison against BackBlaze data to detect these drives as failed.
|
||||
|
||||
**So what's the Solution?**
|
||||
|
||||
After taking a look at the BackBlaze data for the relevant Attributes (`Seek Error Rate` & `Read Error Rate`), I've decided
|
||||
to disable Scrutiny analysis for them. Both are non-critical, and have low-correlation with failure.
|
||||
|
||||
> Please note: SMART failures for these attributes will still cause the drive to be marked as failed. Only BackBlaze analysis has been disabled
|
||||
|
||||
If this is effecting your drives, you'll need to do the following:
|
||||
|
||||
1. Upgrade to v0.4.13+
|
||||
2. Reset your drive status using the SQLite script in [#device-failed-but-smart--scrutiny-passed](https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#device-failed-but-smart--scrutiny-passed)
|
||||
3. Wait for (or manually start) the collector.
|
||||
|
||||
If you'd like to learn more about how the Seagate Ironwolf SMART attributes work under the hood, and how they differ from
|
||||
other drives, please read the following:
|
||||
|
||||
- http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html
|
||||
- https://www.truenas.com/community/threads/seagate-ironwolf-smart-test-raw_read_error_rate-seek_error_rate.68634/
|
||||
|
||||
## Hub & Spoke model, with multiple Hosts.
|
||||
|
||||
|
||||
@@ -14,12 +14,6 @@ is almost immediately created (and tagged with `latest`)
|
||||
|
||||
So changing from `master-omnibus -> latest` will be the same thing for all intents and purposes.
|
||||
|
||||
Having said that -- the one key difference is the `automated cron builds` that run on the `master` and `beta` branches.
|
||||
They trigger a `nightly` build, even if nothing has changed on the branch. This has a couple of benefits, but one is to
|
||||
ensure that there's no broken external dependencies in our (unchanged) code.
|
||||
|
||||
However, as everyone unfortunately found out recently, I had an error in my CI script, which caused failures to be
|
||||
ignored -- https://github.com/AnalogJ/scrutiny/issues/287. That has since been fixed.
|
||||
|
||||
Hope that gives you an understanding for how everything is wired up.
|
||||
|
||||
> NOTE: Previously, there was a `automated cron build` that ran on the `master` and `beta` branches.
|
||||
They used to trigger a `nightly` build, even if nothing has changed on the branch. This has a couple of benefits, but one is to
|
||||
ensure that there's no broken external dependencies in our (unchanged) code. This `nightly` build no longer updates the `master-omnibus` tag.
|
||||
|
||||
@@ -36,56 +36,6 @@ var AtaMetadata = map[int]AtaAttributeMetadata{
|
||||
Ideal: ObservedThresholdIdealLow,
|
||||
Critical: false,
|
||||
Description: "(Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number.",
|
||||
ObservedThresholds: []ObservedThreshold{
|
||||
{
|
||||
Low: 80,
|
||||
High: 95,
|
||||
AnnualFailureRate: 0.8879749768303985,
|
||||
ErrorInterval: []float64{0.682344353388663, 1.136105732920724},
|
||||
},
|
||||
{
|
||||
Low: 95,
|
||||
High: 110,
|
||||
AnnualFailureRate: 0.034155719633986996,
|
||||
ErrorInterval: []float64{0.030188482024981093, 0.038499386872354435},
|
||||
},
|
||||
{
|
||||
Low: 110,
|
||||
High: 125,
|
||||
AnnualFailureRate: 0.06390002135229157,
|
||||
ErrorInterval: []float64{0.05852004676110847, 0.06964160930553712},
|
||||
},
|
||||
{
|
||||
Low: 125,
|
||||
High: 140,
|
||||
AnnualFailureRate: 0,
|
||||
ErrorInterval: []float64{0, 0},
|
||||
},
|
||||
{
|
||||
Low: 140,
|
||||
High: 155,
|
||||
AnnualFailureRate: 0,
|
||||
ErrorInterval: []float64{0, 0},
|
||||
},
|
||||
{
|
||||
Low: 155,
|
||||
High: 170,
|
||||
AnnualFailureRate: 0,
|
||||
ErrorInterval: []float64{0, 0},
|
||||
},
|
||||
{
|
||||
Low: 170,
|
||||
High: 185,
|
||||
AnnualFailureRate: 0,
|
||||
ErrorInterval: []float64{0, 0},
|
||||
},
|
||||
{
|
||||
Low: 185,
|
||||
High: 200,
|
||||
AnnualFailureRate: 0.044823775021490854,
|
||||
ErrorInterval: []float64{0.032022762038723306, 0.06103725943096589},
|
||||
},
|
||||
},
|
||||
},
|
||||
2: {
|
||||
ID: 2,
|
||||
@@ -290,56 +240,6 @@ var AtaMetadata = map[int]AtaAttributeMetadata{
|
||||
Ideal: "",
|
||||
Critical: false,
|
||||
Description: "(Vendor specific raw value.) Rate of seek errors of the magnetic heads. If there is a partial failure in the mechanical positioning system, then seek errors will arise. Such a failure may be due to numerous factors, such as damage to a servo, or thermal widening of the hard disk. The raw value has different structure for different vendors and is often not meaningful as a decimal number.",
|
||||
ObservedThresholds: []ObservedThreshold{
|
||||
{
|
||||
Low: 58,
|
||||
High: 76,
|
||||
AnnualFailureRate: 0.2040131025936549,
|
||||
ErrorInterval: []float64{0.17032852883286412, 0.2424096283327138},
|
||||
},
|
||||
{
|
||||
Low: 76,
|
||||
High: 94,
|
||||
AnnualFailureRate: 0.08725919610118257,
|
||||
ErrorInterval: []float64{0.08077138510999876, 0.09412943212007528},
|
||||
},
|
||||
{
|
||||
Low: 94,
|
||||
High: 112,
|
||||
AnnualFailureRate: 0.01087335627722523,
|
||||
ErrorInterval: []float64{0.008732197944943352, 0.013380600544561905},
|
||||
},
|
||||
{
|
||||
Low: 112,
|
||||
High: 130,
|
||||
AnnualFailureRate: 0,
|
||||
ErrorInterval: []float64{0, 0},
|
||||
},
|
||||
{
|
||||
Low: 130,
|
||||
High: 148,
|
||||
AnnualFailureRate: 0,
|
||||
ErrorInterval: []float64{0, 0},
|
||||
},
|
||||
{
|
||||
Low: 148,
|
||||
High: 166,
|
||||
AnnualFailureRate: 0,
|
||||
ErrorInterval: []float64{0, 0},
|
||||
},
|
||||
{
|
||||
Low: 166,
|
||||
High: 184,
|
||||
AnnualFailureRate: 0,
|
||||
ErrorInterval: []float64{0, 0},
|
||||
},
|
||||
{
|
||||
Low: 184,
|
||||
High: 202,
|
||||
AnnualFailureRate: 0.05316285755900475,
|
||||
ErrorInterval: []float64{0.03370069132942804, 0.07977038905848267},
|
||||
},
|
||||
},
|
||||
},
|
||||
8: {
|
||||
ID: 8,
|
||||
|
||||
@@ -2,4 +2,4 @@ package version
|
||||
|
||||
// VERSION is the app-global version string, which will be replaced with a
|
||||
// new value during packaging
|
||||
const VERSION = "0.4.12"
|
||||
const VERSION = "0.4.13"
|
||||
|
||||
Reference in New Issue
Block a user