Server Crashing Intermittently


#1

My Hetzner server has been crashing intermittently for the last two or three months and I can’t trace the reason for it. After every crash I check /var/log/syslog and apache2 logs, nothing. Just gaps in the time when it was down. Luckily, Hetzner has a great automatic server reset function which works quite well, so the server is up and running within 5 mins of doing the reset.

The most recent crash was in the early hours this morning. I rebuilt couchpotato last night which caused the server to be flooded with a ton of movie downloads from the backlog it had. IThis meant that rutorrent super busy DL’ing movies, CP renaming heavily and ACD_CLI was working extra hard to get them uploaded. Right before I finished up with the server, I checked the load and everything was running smoothly except the hard drives were taking some strain.

What can I do to diagnose the crashes? It could be anything, hard drives might be going or some app might be causing the crash. What do you guys think?

These are my stats:

OS: Ubuntu 16.04
QuickBox Version: 2.48


#2

I would certainly look into hardware at this point. If it was an application on your server causing the issue then you would definitely see some trace of cause in the error logs. I would suggest running some SMART tests on your drives and see if you possibly have some issues there. Additionally, this could be something happening behind the server… that would be best left for Hetzner to check into.

Outside of that, not a whole lot to suggest as it could be anything. I would start with hardware and move out from there.


#3

id think power/heat/
maybe malware…
but most of the time malware will keep you running… so unlikely


#4

Thanks for the suggestions. I’ll try that.


#5

I did some short SMART tests on the two drives, this is the result:

[[email protected]]:(33.4Mb)~$ sudo smartctl -l selftest /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-59-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17341         -
# 2  Extended offline    Completed without error       00%     11681         -
# 3  Extended offline    Completed without error       00%     11664         -
# 4  Extended offline    Completed without error       00%     11562         -
# 5  Extended offline    Completed without error       00%     11544         -
# 6  Extended offline    Completed without error       00%     11144         -
# 7  Extended offline    Completed without error       00%     11127         -
# 8  Extended offline    Completed without error       00%     11119         -
# 9  Short offline       Completed without error       00%     11112         -

[[email protected]]:(33.4Mb)~$ sudo smartctl -l selftest /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-59-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     29120         3992821280
# 2  Short offline       Completed: read failure       90%     29119         3992821280
# 3  Extended offline    Completed without error       00%     23463         -
# 4  Extended offline    Completed without error       00%     23445         -
# 5  Extended offline    Completed without error       00%     23403         -
# 6  Short offline       Completed without error       00%     23396         -
# 7  Extended offline    Completed without error       00%         5         -

The result of drive /dev/sdb is a bit worrisome. What do you guys think?


#6

With the amount of read errors and the life on that disk… I too would be a bit worried.


#7

Sigh, I guess I’ll have to get Hetzner to replace the drive :unamused:
I’ve started backing up all the data on the server. This is going to take a while but I guess it’s a good thing because I would not have done the backup if not for the dodgy hard drive.