How to auto-reboot if CPU load too high?

PlutoniumAcid@lemmy.world · edit-2 2 years ago

How to auto-reboot if CPU load too high?

Possibly linux@lemmy.zip · 2 years ago

Here’s a better suggestion. Why don’t you see if you can find out what’s causing the issue? It sounds a like a problem occurring in userspace. Try running htop

PlutoniumAcid@lemmy.world · 2 years ago

You know you are right, and I’ve tried. I can manually monitor but it doesn’t happen just then. I don’t know yet what causes it, I can only assume it’s one of the Docker containers because the machine is doing nothing else.

I am doing this to find out how often it happens, how quickly it happens, and what’s at the top when it happens.

vegetaaaaaaa@lemmy.world · 2 years ago

I can manually monitor but it doesn’t happen just then

Setup proper monitoring with history. That way yo don’t have to babysit the server, you can just look at the charts after a crash. I usually go with netdata

Possibly linux@lemmy.zip · 2 years ago

Maybe try capping the resource usage of each container. At least then the machine won’t completely lock up

PlutoniumAcid@lemmy.world · 2 years ago

That’s a good idea, didn’t know Docker had such capability. I will read up on that - could you give me some keywords to start on?

Midnight Wolf@lemmy.world · 2 years ago

I very recently spun up a vps and wanted to limit resources; I use docker-compose so this was the info I needed https://docs.docker.com/compose/compose-file/compose-file-v3/#resources

GlitzyArmrest@lemmy.world · 2 years ago

Crontab to just auto reboot daily is probably better - if your PC becomes unresponsive I doubt it would be able to execute another script on top of everything. Ideally though, you’d do some log diving and figure out the cause.

PlutoniumAcid@lemmy.world · 2 years ago

This issue doesn’t happen very often, maybe every few weeks. That’s why I think a nightly reboot is overkill, and weekly might be missing the mark? But you are right in any case: regardless of what the cron says, the machine might never get around to executing it.

agent_flounder@lemmy.world · edit-2 2 years ago

Load average of 400???

You could install systat (or similar) and use output from sar to watch for thresholds and reboot if exceeded.

The upside of doing this is you may also be able to narrow down what is going on, exactly, when this happens, since sar records stats for CPU, memory, disk etc. So you can go back after the fact and you might be able to see if it is just a CPU thing or more than that. (Unless the problem happens instantly rather than gradually increasing).

PS: rather than using cron, you could run a script as a daemon that runs sar at 1 sec intervals.

Another thought is some kind of external watchdog. Curl webpage on server, if delay too long power cycle with smart home outlet? Idk. Just throwing crazy ideas out there.

PlutoniumAcid@lemmy.world · 2 years ago

Thank you for these ideas, I will read up on systat+sar and give it a go.

Also smart to have the script always running, sleeping, rather than launching it at intervals.

I know all of this is a poor hack, and I must address the cause - but so far I have no clues what’s causing it. I’m running a bunch of Docker containers so it is very likely one of them painting itself into a corner, but after a reboot there’s nothing to see, so I am now starting with logging the top process. Your ideas might work better.

marcos@lemmy.world · 2 years ago

Have you tried turning your swap off?

PlutoniumAcid@lemmy.world · 2 years ago

Nope, haven’t. It says I have 2 GB of swap on a 16 GB RAM system, and that seems reasonable.

Why would you recommend turning swap off?

marcos@lemmy.world · 2 years ago

To check if your problem is caused by excessive memory usage requiring constant swapping. If it is, turning swap off will make some process be killed instead of slowing the computer down.

Max@lemmy.world · 2 years ago

The symptoms you describe are exactly what happens to my machine when it runs out of memory and then starts swapping really hard. This is easy to check by seeing if disk io also spikes when it happens, and if memory usage is high

Politically Incorrect@lemmy.world · edit-2 2 years ago

If your board support it, use watchdog.

tenchiken@lemmy.dbzer0.com · 2 years ago

Run SMART short tests on your drives. Any “pending sectors” at all are failure.

If the test has any problems, especially pending sectors, replace the drive.

lemmyingly@lemm.ee · 2 years ago

You could disable most of the services running, reintroduce one, see how it performs. Once satisfied reintroduce another, so on and so forth until you’ve fingered out what is at issue.

PlutoniumAcid@lemmy.world · 2 years ago

Yes, but given the fact that there can we weeks between incidents, that is going go be a long time to be without my services.

lemmyingly@lemm.ee · 2 years ago

Could you use an alternative machine as a temporary machine until you get it resolved?

And do you actually need all of them running 24/7 or are at least some of them nice to haves?