0

Obviously, a hardware problem is involved. GPU # 8 is "Stuck" and I cannot kill the apps using it. Each app timed out and reported the problem to the control program (boinc) but it seems the control program could not terminate the app and even went on to assign additional tasks to the defective device all of which timed out. All the apps are still running as I can see the %cpu change and occasionally the SHM value changes so I know they are running but I could be mis-reading what is going on. The following did not work

jstateson@h110btc:/usr/bin$ boinccmd --quit
can't connect to local host

root@h110btc:/var/lib/boinc/projects# sudo killall -v boinc
boinc: no process found

sudo kill -9 12374

htop shows activity as the CPU% changes but nvidia-smi show 0.

.

1 Answer 1

0

From poking around I read that processes that are waiting for I/O are in limbo and unresponsive and in the event the driver actually loses contact with the GPU it is has moved into hell.

I thought there was some hope as there was an "R" in the stats column but if nvidia-smi says "cant find device please reboot" then not much can be done. OTOH, in windows I occasionally see a glitch on the screen and if I look in the event log I see nvkernreset or some such message so different OS'es handle problems differently.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .