How do I check system health?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP








up vote
2
down vote

favorite












I want a report on my system health, so that I know that all my hardware components (CPU, memory, disks...) are functioning as expected. It would be easiest to read if the report listed only the problems found (if any). Is there a system tool that does this?



Related notes:



  • I know that the disk utility can report SMART results for my disk. I'd like something similar for all my other components.


  • Raw diagnostic tools and benchmarks aren't suitable. Diagnostic tools list component details, but not their health. Benchmarks only sometimes highlight health issues. I am only interested in direct health reports.

  • I am aware of an equivalent tool that performs this function in Windows (reports if a hardware component is failing), but I've forgotten the name :P I'd basically like an equivalent of this.






share|improve this question
























    up vote
    2
    down vote

    favorite












    I want a report on my system health, so that I know that all my hardware components (CPU, memory, disks...) are functioning as expected. It would be easiest to read if the report listed only the problems found (if any). Is there a system tool that does this?



    Related notes:



    • I know that the disk utility can report SMART results for my disk. I'd like something similar for all my other components.


    • Raw diagnostic tools and benchmarks aren't suitable. Diagnostic tools list component details, but not their health. Benchmarks only sometimes highlight health issues. I am only interested in direct health reports.

    • I am aware of an equivalent tool that performs this function in Windows (reports if a hardware component is failing), but I've forgotten the name :P I'd basically like an equivalent of this.






    share|improve this question






















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I want a report on my system health, so that I know that all my hardware components (CPU, memory, disks...) are functioning as expected. It would be easiest to read if the report listed only the problems found (if any). Is there a system tool that does this?



      Related notes:



      • I know that the disk utility can report SMART results for my disk. I'd like something similar for all my other components.


      • Raw diagnostic tools and benchmarks aren't suitable. Diagnostic tools list component details, but not their health. Benchmarks only sometimes highlight health issues. I am only interested in direct health reports.

      • I am aware of an equivalent tool that performs this function in Windows (reports if a hardware component is failing), but I've forgotten the name :P I'd basically like an equivalent of this.






      share|improve this question












      I want a report on my system health, so that I know that all my hardware components (CPU, memory, disks...) are functioning as expected. It would be easiest to read if the report listed only the problems found (if any). Is there a system tool that does this?



      Related notes:



      • I know that the disk utility can report SMART results for my disk. I'd like something similar for all my other components.


      • Raw diagnostic tools and benchmarks aren't suitable. Diagnostic tools list component details, but not their health. Benchmarks only sometimes highlight health issues. I am only interested in direct health reports.

      • I am aware of an equivalent tool that performs this function in Windows (reports if a hardware component is failing), but I've forgotten the name :P I'd basically like an equivalent of this.








      share|improve this question











      share|improve this question




      share|improve this question










      asked May 21 at 12:59









      d3vid

      7,0751868133




      7,0751868133




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote













          Electronics generally work 100% or zero percent. Mechanical devices such as hard drives do have indicators of impending failure as per SMART reporting which you already know about.



          Fans



          Fans have impending failure indicators but that is based on your hearing and listening for indicators such as oscillating speeds, squealing bearings, etc.



          CPU



          Another potential indicator of a degrading fan is CPU heat level. On a laptop means fan exhaust vents are clogged or RPM is too low. It could also mean CPU / motherboard needs a dust cleaning with compressed air (don't use your breath which contains moisture). It could also mean your CPU heat sink needs to be reseated with new thermal paste.



          RAM



          If your machine locks up and display a bad memory error you can test your RAM following these instructions: How to check for errors in RAM via linux?.



          If the RAM checker finds a bad memory block you can blacklist it using these instructions: Is there a way of limiting the Kernel's memory manager to use only 75% of memory?



          NVMe PCIe M.2 Gen 3.0 x 4 (or 2) SSD



          If you have an SSD they're life span is measured in trillions of writes. Your SMART utility already measures SSD life but not for NVMe SSDs. For that you need nvme-cli. To install it use:
          sudo apt install nvme-cli



          Next gather information available from SSD:



          $ sudo nvme smart-log /dev/nvme0
          Smart Log for NVME device:nvme0 namespace-id:ffffffff
          critical_warning : 0
          temperature : 35 C
          available_spare : 100%
          available_spare_threshold : 10%
          percentage_used : 0%
          data_units_read : 9,328,609
          data_units_written : 5,383,685
          host_read_commands : 169,669,400
          host_write_commands : 51,959,850
          controller_busy_time : 387
          power_cycles : 568
          power_on_hours : 401
          unsafe_shutdowns : 77
          media_errors : 0
          num_err_log_entries : 216
          Warning Temperature Time : 0
          Critical Composite Temperature Time : 0
          Temperature Sensor 1 : 35 C
          Temperature Sensor 2 : 41 C
          Temperature Sensor 3 : 0 C
          Temperature Sensor 4 : 0 C
          Temperature Sensor 5 : 0 C
          Temperature Sensor 6 : 0 C
          Temperature Sensor 7 : 0 C
          Temperature Sensor 8 : 0 C


          The most important field is Percentage used which shows as 0%. This isn't disk usage percent but life used percent. I've had this drive since October 2017 and now it's May 2018. As soon as Percentage used hits 1% I can multiply the number of months I've owned it by 100 to find out when it will die. But they say the drive typically lives longer than that.



          System Monitor on desktop with conky



          Many people like to show their system status (and health) on a portion of their desktop. I like to keep my Conky running on the right 20% of my primary monitor:



          Conky all.gif



          Note: The 97% CPU usage on single CPU is caused by screen recorder itself.



          To learn more about conky and CPU usage see: How do I stress test CPU and RAM (at the same time)?






          share|improve this answer






















          • this is a comprehensive answer, thanks! how do I know that the CPU is running too hot? I'm also aware of the potential for RAM and CPU failures, which might manifest as intermittent crashing, how would I check for these?
            – d3vid
            May 21 at 13:50










          • @d3vid I added sections for RAM checking, CPU stress testing and overall system monitoring using conky.
            – WinEunuuchs2Unix
            May 21 at 14:09











          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "89"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1038701%2fhow-do-i-check-system-health%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote













          Electronics generally work 100% or zero percent. Mechanical devices such as hard drives do have indicators of impending failure as per SMART reporting which you already know about.



          Fans



          Fans have impending failure indicators but that is based on your hearing and listening for indicators such as oscillating speeds, squealing bearings, etc.



          CPU



          Another potential indicator of a degrading fan is CPU heat level. On a laptop means fan exhaust vents are clogged or RPM is too low. It could also mean CPU / motherboard needs a dust cleaning with compressed air (don't use your breath which contains moisture). It could also mean your CPU heat sink needs to be reseated with new thermal paste.



          RAM



          If your machine locks up and display a bad memory error you can test your RAM following these instructions: How to check for errors in RAM via linux?.



          If the RAM checker finds a bad memory block you can blacklist it using these instructions: Is there a way of limiting the Kernel's memory manager to use only 75% of memory?



          NVMe PCIe M.2 Gen 3.0 x 4 (or 2) SSD



          If you have an SSD they're life span is measured in trillions of writes. Your SMART utility already measures SSD life but not for NVMe SSDs. For that you need nvme-cli. To install it use:
          sudo apt install nvme-cli



          Next gather information available from SSD:



          $ sudo nvme smart-log /dev/nvme0
          Smart Log for NVME device:nvme0 namespace-id:ffffffff
          critical_warning : 0
          temperature : 35 C
          available_spare : 100%
          available_spare_threshold : 10%
          percentage_used : 0%
          data_units_read : 9,328,609
          data_units_written : 5,383,685
          host_read_commands : 169,669,400
          host_write_commands : 51,959,850
          controller_busy_time : 387
          power_cycles : 568
          power_on_hours : 401
          unsafe_shutdowns : 77
          media_errors : 0
          num_err_log_entries : 216
          Warning Temperature Time : 0
          Critical Composite Temperature Time : 0
          Temperature Sensor 1 : 35 C
          Temperature Sensor 2 : 41 C
          Temperature Sensor 3 : 0 C
          Temperature Sensor 4 : 0 C
          Temperature Sensor 5 : 0 C
          Temperature Sensor 6 : 0 C
          Temperature Sensor 7 : 0 C
          Temperature Sensor 8 : 0 C


          The most important field is Percentage used which shows as 0%. This isn't disk usage percent but life used percent. I've had this drive since October 2017 and now it's May 2018. As soon as Percentage used hits 1% I can multiply the number of months I've owned it by 100 to find out when it will die. But they say the drive typically lives longer than that.



          System Monitor on desktop with conky



          Many people like to show their system status (and health) on a portion of their desktop. I like to keep my Conky running on the right 20% of my primary monitor:



          Conky all.gif



          Note: The 97% CPU usage on single CPU is caused by screen recorder itself.



          To learn more about conky and CPU usage see: How do I stress test CPU and RAM (at the same time)?






          share|improve this answer






















          • this is a comprehensive answer, thanks! how do I know that the CPU is running too hot? I'm also aware of the potential for RAM and CPU failures, which might manifest as intermittent crashing, how would I check for these?
            – d3vid
            May 21 at 13:50










          • @d3vid I added sections for RAM checking, CPU stress testing and overall system monitoring using conky.
            – WinEunuuchs2Unix
            May 21 at 14:09















          up vote
          3
          down vote













          Electronics generally work 100% or zero percent. Mechanical devices such as hard drives do have indicators of impending failure as per SMART reporting which you already know about.



          Fans



          Fans have impending failure indicators but that is based on your hearing and listening for indicators such as oscillating speeds, squealing bearings, etc.



          CPU



          Another potential indicator of a degrading fan is CPU heat level. On a laptop means fan exhaust vents are clogged or RPM is too low. It could also mean CPU / motherboard needs a dust cleaning with compressed air (don't use your breath which contains moisture). It could also mean your CPU heat sink needs to be reseated with new thermal paste.



          RAM



          If your machine locks up and display a bad memory error you can test your RAM following these instructions: How to check for errors in RAM via linux?.



          If the RAM checker finds a bad memory block you can blacklist it using these instructions: Is there a way of limiting the Kernel's memory manager to use only 75% of memory?



          NVMe PCIe M.2 Gen 3.0 x 4 (or 2) SSD



          If you have an SSD they're life span is measured in trillions of writes. Your SMART utility already measures SSD life but not for NVMe SSDs. For that you need nvme-cli. To install it use:
          sudo apt install nvme-cli



          Next gather information available from SSD:



          $ sudo nvme smart-log /dev/nvme0
          Smart Log for NVME device:nvme0 namespace-id:ffffffff
          critical_warning : 0
          temperature : 35 C
          available_spare : 100%
          available_spare_threshold : 10%
          percentage_used : 0%
          data_units_read : 9,328,609
          data_units_written : 5,383,685
          host_read_commands : 169,669,400
          host_write_commands : 51,959,850
          controller_busy_time : 387
          power_cycles : 568
          power_on_hours : 401
          unsafe_shutdowns : 77
          media_errors : 0
          num_err_log_entries : 216
          Warning Temperature Time : 0
          Critical Composite Temperature Time : 0
          Temperature Sensor 1 : 35 C
          Temperature Sensor 2 : 41 C
          Temperature Sensor 3 : 0 C
          Temperature Sensor 4 : 0 C
          Temperature Sensor 5 : 0 C
          Temperature Sensor 6 : 0 C
          Temperature Sensor 7 : 0 C
          Temperature Sensor 8 : 0 C


          The most important field is Percentage used which shows as 0%. This isn't disk usage percent but life used percent. I've had this drive since October 2017 and now it's May 2018. As soon as Percentage used hits 1% I can multiply the number of months I've owned it by 100 to find out when it will die. But they say the drive typically lives longer than that.



          System Monitor on desktop with conky



          Many people like to show their system status (and health) on a portion of their desktop. I like to keep my Conky running on the right 20% of my primary monitor:



          Conky all.gif



          Note: The 97% CPU usage on single CPU is caused by screen recorder itself.



          To learn more about conky and CPU usage see: How do I stress test CPU and RAM (at the same time)?






          share|improve this answer






















          • this is a comprehensive answer, thanks! how do I know that the CPU is running too hot? I'm also aware of the potential for RAM and CPU failures, which might manifest as intermittent crashing, how would I check for these?
            – d3vid
            May 21 at 13:50










          • @d3vid I added sections for RAM checking, CPU stress testing and overall system monitoring using conky.
            – WinEunuuchs2Unix
            May 21 at 14:09













          up vote
          3
          down vote










          up vote
          3
          down vote









          Electronics generally work 100% or zero percent. Mechanical devices such as hard drives do have indicators of impending failure as per SMART reporting which you already know about.



          Fans



          Fans have impending failure indicators but that is based on your hearing and listening for indicators such as oscillating speeds, squealing bearings, etc.



          CPU



          Another potential indicator of a degrading fan is CPU heat level. On a laptop means fan exhaust vents are clogged or RPM is too low. It could also mean CPU / motherboard needs a dust cleaning with compressed air (don't use your breath which contains moisture). It could also mean your CPU heat sink needs to be reseated with new thermal paste.



          RAM



          If your machine locks up and display a bad memory error you can test your RAM following these instructions: How to check for errors in RAM via linux?.



          If the RAM checker finds a bad memory block you can blacklist it using these instructions: Is there a way of limiting the Kernel's memory manager to use only 75% of memory?



          NVMe PCIe M.2 Gen 3.0 x 4 (or 2) SSD



          If you have an SSD they're life span is measured in trillions of writes. Your SMART utility already measures SSD life but not for NVMe SSDs. For that you need nvme-cli. To install it use:
          sudo apt install nvme-cli



          Next gather information available from SSD:



          $ sudo nvme smart-log /dev/nvme0
          Smart Log for NVME device:nvme0 namespace-id:ffffffff
          critical_warning : 0
          temperature : 35 C
          available_spare : 100%
          available_spare_threshold : 10%
          percentage_used : 0%
          data_units_read : 9,328,609
          data_units_written : 5,383,685
          host_read_commands : 169,669,400
          host_write_commands : 51,959,850
          controller_busy_time : 387
          power_cycles : 568
          power_on_hours : 401
          unsafe_shutdowns : 77
          media_errors : 0
          num_err_log_entries : 216
          Warning Temperature Time : 0
          Critical Composite Temperature Time : 0
          Temperature Sensor 1 : 35 C
          Temperature Sensor 2 : 41 C
          Temperature Sensor 3 : 0 C
          Temperature Sensor 4 : 0 C
          Temperature Sensor 5 : 0 C
          Temperature Sensor 6 : 0 C
          Temperature Sensor 7 : 0 C
          Temperature Sensor 8 : 0 C


          The most important field is Percentage used which shows as 0%. This isn't disk usage percent but life used percent. I've had this drive since October 2017 and now it's May 2018. As soon as Percentage used hits 1% I can multiply the number of months I've owned it by 100 to find out when it will die. But they say the drive typically lives longer than that.



          System Monitor on desktop with conky



          Many people like to show their system status (and health) on a portion of their desktop. I like to keep my Conky running on the right 20% of my primary monitor:



          Conky all.gif



          Note: The 97% CPU usage on single CPU is caused by screen recorder itself.



          To learn more about conky and CPU usage see: How do I stress test CPU and RAM (at the same time)?






          share|improve this answer














          Electronics generally work 100% or zero percent. Mechanical devices such as hard drives do have indicators of impending failure as per SMART reporting which you already know about.



          Fans



          Fans have impending failure indicators but that is based on your hearing and listening for indicators such as oscillating speeds, squealing bearings, etc.



          CPU



          Another potential indicator of a degrading fan is CPU heat level. On a laptop means fan exhaust vents are clogged or RPM is too low. It could also mean CPU / motherboard needs a dust cleaning with compressed air (don't use your breath which contains moisture). It could also mean your CPU heat sink needs to be reseated with new thermal paste.



          RAM



          If your machine locks up and display a bad memory error you can test your RAM following these instructions: How to check for errors in RAM via linux?.



          If the RAM checker finds a bad memory block you can blacklist it using these instructions: Is there a way of limiting the Kernel's memory manager to use only 75% of memory?



          NVMe PCIe M.2 Gen 3.0 x 4 (or 2) SSD



          If you have an SSD they're life span is measured in trillions of writes. Your SMART utility already measures SSD life but not for NVMe SSDs. For that you need nvme-cli. To install it use:
          sudo apt install nvme-cli



          Next gather information available from SSD:



          $ sudo nvme smart-log /dev/nvme0
          Smart Log for NVME device:nvme0 namespace-id:ffffffff
          critical_warning : 0
          temperature : 35 C
          available_spare : 100%
          available_spare_threshold : 10%
          percentage_used : 0%
          data_units_read : 9,328,609
          data_units_written : 5,383,685
          host_read_commands : 169,669,400
          host_write_commands : 51,959,850
          controller_busy_time : 387
          power_cycles : 568
          power_on_hours : 401
          unsafe_shutdowns : 77
          media_errors : 0
          num_err_log_entries : 216
          Warning Temperature Time : 0
          Critical Composite Temperature Time : 0
          Temperature Sensor 1 : 35 C
          Temperature Sensor 2 : 41 C
          Temperature Sensor 3 : 0 C
          Temperature Sensor 4 : 0 C
          Temperature Sensor 5 : 0 C
          Temperature Sensor 6 : 0 C
          Temperature Sensor 7 : 0 C
          Temperature Sensor 8 : 0 C


          The most important field is Percentage used which shows as 0%. This isn't disk usage percent but life used percent. I've had this drive since October 2017 and now it's May 2018. As soon as Percentage used hits 1% I can multiply the number of months I've owned it by 100 to find out when it will die. But they say the drive typically lives longer than that.



          System Monitor on desktop with conky



          Many people like to show their system status (and health) on a portion of their desktop. I like to keep my Conky running on the right 20% of my primary monitor:



          Conky all.gif



          Note: The 97% CPU usage on single CPU is caused by screen recorder itself.



          To learn more about conky and CPU usage see: How do I stress test CPU and RAM (at the same time)?







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 21 at 14:08

























          answered May 21 at 13:37









          WinEunuuchs2Unix

          34.5k756131




          34.5k756131











          • this is a comprehensive answer, thanks! how do I know that the CPU is running too hot? I'm also aware of the potential for RAM and CPU failures, which might manifest as intermittent crashing, how would I check for these?
            – d3vid
            May 21 at 13:50










          • @d3vid I added sections for RAM checking, CPU stress testing and overall system monitoring using conky.
            – WinEunuuchs2Unix
            May 21 at 14:09

















          • this is a comprehensive answer, thanks! how do I know that the CPU is running too hot? I'm also aware of the potential for RAM and CPU failures, which might manifest as intermittent crashing, how would I check for these?
            – d3vid
            May 21 at 13:50










          • @d3vid I added sections for RAM checking, CPU stress testing and overall system monitoring using conky.
            – WinEunuuchs2Unix
            May 21 at 14:09
















          this is a comprehensive answer, thanks! how do I know that the CPU is running too hot? I'm also aware of the potential for RAM and CPU failures, which might manifest as intermittent crashing, how would I check for these?
          – d3vid
          May 21 at 13:50




          this is a comprehensive answer, thanks! how do I know that the CPU is running too hot? I'm also aware of the potential for RAM and CPU failures, which might manifest as intermittent crashing, how would I check for these?
          – d3vid
          May 21 at 13:50












          @d3vid I added sections for RAM checking, CPU stress testing and overall system monitoring using conky.
          – WinEunuuchs2Unix
          May 21 at 14:09





          @d3vid I added sections for RAM checking, CPU stress testing and overall system monitoring using conky.
          – WinEunuuchs2Unix
          May 21 at 14:09













           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1038701%2fhow-do-i-check-system-health%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How do so many people here on Academia.SE, and in general, afford lavish higher education programs?

          Trouble downloading packages list due to a “Hash sum mismatch” error

          How do I move numbers in filenames, in a batch renaming operation?