Question: How can I resolve executables disappearing sometimes during batch processing scripts?
Background: I have a series of scripts designed to parse xyz elevation data, partition into grids, merge and interpolate this data using tools like mbgrid and gmt surface. These can be called on manually, but generally run as scheduled tasks using configuration files specifying things like grid size, cell resolution and directory structures.
These scripts are written in python, using multiprocessing and making system calls with subprocess.
Problem: Sometimes, I get situations where commands are not found (e.g. cat 180E48S_mbgrid_output.log /bin/sh: 1: mbgrid: not found)
Since these scripts are called via cron with a configuration file, I would expect consistent results. Most of the time (and it seems always when I'm watching, even if I just run the script cron initiates), the code runs successfully to completion. But sometimes not.
Below is a snippet of a log file showing mbgrid not found (bolded), followed by a successful run. And then another instance of mbgrid not found before another successful run.
grep '/bin/sh: 1: mbgrid: not found' 22E59N_mbgrid_output.log -B 10 -A 10
/bin/sh: 1: mbgrid: not found
Zgrid starting iterations Zgrid iteration 10 convergence test: -0.001000 last:0.000000 Zgrid iteration 20 convergence test: -0.001000 last:-0.001000 Zgrid iteration 30 convergence test: -0.001000 last:-0.001000 Zgrid iteration 40 convergence test: -0.001000 last:-0.001000 Zgrid iteration 50 convergence test: -0.001000 last:-0.001000 Zgrid iteration 60 convergence test: -0.001000 last:-0.001000 Zgrid iteration 70 convergence test: -0.001000 last:-0.001000 Zgrid iteration 80 convergence test: -0.001000 last:-0.001000 Zgrid iteration 90 convergence test: -0.001000 last:-0.001000
... snipped log ...
mbm_grdplot -I22E59Nweightedgrid.grd -G1 -C -V -L"File 22E59Nweightedgrid.grd - Topography Grid:Topography (m)"
executing mbm_grdplot... mbm_grdplot -I22E59Nweightedgrid_num.grd -G1 -W1/2 -V -L"File 22E59Nweightedgrid_num.grd - Topography Grid:Number of Topography Data Points"
executing mbm_grdplot... mbm_grdplot -I22E59Nweightedgrid_sd.grd -G1 -W1/2 -V -L"File 22E59Nweightedgrid_sd.grd - Topography Grid:Topography Standard Deviation (m)"
Done.
/bin/sh: 1: mbgrid: not found
Zgrid starting iterations Zgrid iteration 10 convergence test: -0.001000 last:0.000000 Zgrid iteration 20 convergence test: -0.001000 last:-0.001000 Zgrid iteration 30 convergence test: -0.001000 last:-0.001000 Zgrid iteration 40 convergence test: -0.001000 last:-0.001000 Zgrid iteration 50 convergence test: -0.001000 last:-0.001000 Zgrid iteration 60 convergence test: -0.001000 last:-0.001000 Zgrid iteration 70 convergence test: -0.001000 last:-0.001000 Zgrid iteration 80 convergence test: -0.001000 last:-0.001000 Zgrid iteration 90 convergence test: -0.001000 last:-0.001000
...snipped log...
Steps taken: When I first encountered this, I associated it with some particularly large and dense grids. I quickly saw that I had 10 processes, each of which wanted 20% of ram. Turning down the number of processes seemed to solve this issue.
However, I've been seeing this occur even when it doesn't look like memory pressure has been a concern, but I may be missing something in my system monitoring.
I have set up sar monitoring, to grab system stats every few minutes. I have also set up a cron to grab some of the top stats every 5 minutes. I don't really see anything in these that would cause linux to be so resource starved that it would fail to find executables. I'll put those at the bottom as this is already rather long.
I've tried adding some logging diagnostics to spot the problem.
Here is a snippet of code that checks for mbgrid's existence and then, if found, logs the exact command that is being run:
if (os.path.exists(r'/usr/local/bin/mbgrid') and os.path.isfile(r'/usr/local/bin/mbgrid')):
try:
logging.info("mbgrid -I"+gridcellname+"mbdatalist.mbf -O"+ \
weightedname+" -R"+str(wbound)+"/"+str(ebound)+"/"+str(sbound)+"/"+str(nbound)+ \
" -A2 -P1 -V -F1 -C1 -N -M -E"+str(cellsize)+"/0.0/degrees! >> " + gridcellname + \
"_mbgrid_output.log 2>&1")
subprocess.call("mbgrid -I"+gridcellname+"mbdatalist.mbf -O"+ \
weightedname+" -R"+str(wbound)+"/"+str(ebound)+"/"+str(sbound)+"/"+str(nbound)+ \
" -A2 -P1 -V -F1 -C1 -N -M -E"+str(cellsize)+"/0.0/degrees! >> " + gridcellname + \
"_mbgrid_output.log 2>&1",cwd=griddir,shell=True)
logging.info("mbgrid complete")
except:
logging.error("Error creating weighted grid for "+gridcellname)
sys.exit("Error creating weighted grid for "+gridcellname)
else:
logging.error("Unable to find the mbgrid executable at /usr/local/bin/mbgrid for " + gridcellname)
From the main log for a given configuration file:
otto@Tendua:~/automated/NZ_TopoBathy_30m$ grep '180E48S_mbgrid' NZ_TopoBathy_30m_grids.log -A 3 -B 3
INFO:09/08/2017 21:35:06:clipping and reformatting grid for 162E50S
INFO:09/08/2017 21:35:06:simplifying 165E34S_ngdc.xyz by converting to grid and back to xyz
INFO:09/08/2017 21:35:06:run mbgrid using the datalist to create weighted grid
INFO:09/08/2017 21:35:06:mbgrid -I180E48Smbdatalist.mbf -O180E48Sweightedgrid -R179.7/181.3/-48.3/-46.7 -A2 -P1 -V -F1 -C1 -N -M -E0.000277777777778/0.0/degrees! >> 180E48S_mbgrid_output.log 2>&1
INFO:09/08/2017 21:35:06:grid converted to netcdf3 format
INFO:09/08/2017 21:35:06:run mbgrid using the datalist to create weighted grid
INFO:09/08/2017 21:35:06:mbgrid -I175E30Smbdatalist.mbf -O175E30Sweightedgrid -R174.7/176.3/-30.3/-28.7 -A2 -P1 -V -F1 -C1 -N -M -E0.000277777777778/0.0/degrees! >> 175E30S_mbgrid_output.log 2>&1
And in the log file for that exact grid:
cat 180E48S_mbgrid_output.log
/bin/sh: 1: mbgrid: not found
And when I issue that exact command in the correct directory, it runs as intended.
What else should I be looking for to identify and fix this issue?
Your time is much appreciated.
=====================================================
Top ten processes, run every 5 minutes (which also tracks disk space on the data drives):
otto@Tendua:~$ cat top_ten_process_monitor.sh
#!/bin/sh
date
free -m | awk 'NR==2{printf "Memory Usage: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }'
df -h | awk '$0 ~ "tendua" {printf "Disk Usage: %1.1f/%1.1fTB (%s)\n", $3,$2,$5}'
top -bn1 | grep load | awk '{printf "CPU Load: %.2f\n", $(NF-2)}'
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head
(section of log follows from the time of the above incident)
Fri Sep 8 21:26:01 MDT 2017
Memory Usage: 63865/64239MB (99.42%)
Disk Usage: 2.7/3.6TB (74%)
Disk Usage: 2.5/3.6TB (68%)
Disk Usage: 2.3/3.6TB (64%)
Disk Usage: 4.3/5.5TB (83%)
CPU Load: 1.57
PID PPID CMD %MEM %CPU
3973 3886 dbus-daemon --fork --sessio 0.6 0.0
4085 3886 /usr/lib/unity/unity-panel- 0.6 0.0
4253 4079 compiz 0.3 0.4
11350 3886 /opt/google/chrome/chrome 0.3 2.1
18562 11366 /opt/google/chrome/chrome - 0.3 8.8
17402 1109 smbd -F 0.1 0.0
3213 1 /opt/google/chrome/chrome - 0.1 0.1
1587 1450 /usr/bin/X -core :0 -seat s 0.1 0.3
4019 3886 nautilus --new-window 0.1 0.0
Fri Sep 8 21:31:01 MDT 2017
Memory Usage: 63852/64239MB (99.40%)
Disk Usage: 2.7/3.6TB (75%)
Disk Usage: 2.5/3.6TB (68%)
Disk Usage: 2.3/3.6TB (64%)
Disk Usage: 4.3/5.5TB (83%)
CPU Load: 1.57
PID PPID CMD %MEM %CPU
3973 3886 dbus-daemon --fork --sessio 0.6 0.0
4085 3886 /usr/lib/unity/unity-panel- 0.6 0.0
4253 4079 compiz 0.3 0.4
11350 3886 /opt/google/chrome/chrome 0.3 2.1
18562 11366 /opt/google/chrome/chrome - 0.3 8.8
17402 1109 smbd -F 0.1 0.0
3213 1 /opt/google/chrome/chrome - 0.1 0.1
1587 1450 /usr/bin/X -core :0 -seat s 0.1 0.3
4019 3886 nautilus --new-window 0.1 0.0
Fri Sep 8 21:36:01 MDT 2017
Memory Usage: 63796/64239MB (99.31%)
Disk Usage: 2.7/3.6TB (75%)
Disk Usage: 2.5/3.6TB (68%)
Disk Usage: 2.3/3.6TB (64%)
Disk Usage: 4.3/5.5TB (83%)
CPU Load: 4.59
PID PPID CMD %MEM %CPU
3973 3886 dbus-daemon --fork --sessio 0.6 0.0
4085 3886 /usr/lib/unity/unity-panel- 0.6 0.0
4253 4079 compiz 0.3 0.4
11350 3886 /opt/google/chrome/chrome 0.3 2.1
18562 11366 /opt/google/chrome/chrome - 0.3 8.8
17402 1109 smbd -F 0.1 0.0
3213 1 /opt/google/chrome/chrome - 0.1 0.1
1587 1450 /usr/bin/X -core :0 -seat s 0.1 0.3
1873 1871 gdal_translate -ot Float32 0.1 0.1
Fri Sep 8 21:41:01 MDT 2017
Memory Usage: 63878/64239MB (99.44%)
Disk Usage: 2.7/3.6TB (75%)
Disk Usage: 2.5/3.6TB (68%)
Disk Usage: 2.3/3.6TB (64%)
Disk Usage: 4.3/5.5TB (83%)
CPU Load: 4.32
PID PPID CMD %MEM %CPU
3973 3886 dbus-daemon --fork --sessio 0.6 0.0
4085 3886 /usr/lib/unity/unity-panel- 0.6 0.0
4253 4079 compiz 0.3 0.4
11350 3886 /opt/google/chrome/chrome 0.3 2.1
18562 11366 /opt/google/chrome/chrome - 0.3 8.8
5636 5634 /usr/bin/python /usr/bin/gd 0.1 25.6
17402 1109 smbd -F 0.1 0.0
3213 1 /opt/google/chrome/chrome - 0.1 0.1
1587 1450 /usr/bin/X -core :0 -seat s 0.1 0.3
System Stats follow:
sar -f /var/log/sysstat/sa08|head -4 && sar -f /var/log/sysstat/sa08|grep '09:35:01 PM' -A 10 -B 10
Linux 3.19.0-80-generic (Tendua) 09/08/2017 _x86_64_ (12 CPU)
12:00:02 AM CPU %user %nice %system %iowait %steal %idle
12:05:03 AM all 51.97 0.00 1.32 0.89 0.00 45.82
07:55:01 PM all 9.44 0.00 0.59 3.51 0.00 86.45
08:05:01 PM all 1.50 0.00 1.12 9.39 0.00 87.99
08:15:01 PM all 1.48 0.00 1.12 9.16 0.00 88.24
08:25:01 PM all 1.46 0.00 1.13 9.29 0.00 88.12
08:35:01 PM all 1.50 0.00 1.24 8.65 0.00 88.61
08:45:01 PM all 1.52 0.00 1.57 7.87 0.00 89.04
08:55:01 PM all 1.55 0.00 1.64 7.96 0.00 88.85
09:05:01 PM all 1.53 0.00 1.70 8.08 0.00 88.69
09:15:01 PM all 1.54 0.00 1.74 8.04 0.00 88.68
09:25:01 PM all 1.50 0.00 1.69 8.16 0.00 88.65
09:35:01 PM all 1.55 0.00 1.69 7.98 0.00 88.78
09:45:01 PM all 2.17 0.00 1.96 11.67 0.00 84.20
09:55:01 PM all 1.71 0.00 1.87 18.06 0.00 78.36
10:05:01 PM all 1.70 0.00 2.13 17.57 0.00 78.60
10:15:01 PM all 1.73 0.00 1.93 18.64 0.00 77.70
10:25:01 PM all 1.75 0.00 1.96 18.36 0.00 77.93
10:35:01 PM all 3.91 0.00 1.80 15.40 0.00 78.89
10:45:01 PM all 1.66 0.00 1.90 19.23 0.00 77.20
10:55:01 PM all 1.68 0.00 1.82 19.05 0.00 77.45
11:05:01 PM all 1.69 0.00 1.77 19.31 0.00 77.23
11:15:01 PM all 7.69 0.00 1.03 8.68 0.00 82.60
sar -r -f /var/log/sysstat/sa08|head -4 && sar -r -f /var/log/sysstat/sa08|grep '09:35:01 PM' -A 10 -B 10
Linux 3.19.0-80-generic (Tendua) 09/08/2017 _x86_64_ (12 CPU)
12:00:02 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
12:05:03 AM 3859704 61921864 94.13 50184 52583736 17699508 25.29 29962700 29837068 611932
07:55:01 PM 392740 65388828 99.40 407920 60884772 10851876 15.51 2703424 60625128 2853088
08:05:01 PM 402104 65379464 99.39 410904 61032736 10845076 15.50 3520252 59959668 1336572
08:15:01 PM 371832 65409736 99.43 413876 61069912 10859576 15.52 3403804 60108248 1220840
08:25:01 PM 397468 65384100 99.40 416592 61053400 10820532 15.46 1277516 62219572 3004488
08:35:01 PM 426124 65355444 99.35 419376 61037072 10808244 15.45 1069828 62416140 306880
08:45:01 PM 453688 65327880 99.31 423508 60966356 10864072 15.53 1070752 62351248 929620
08:55:01 PM 367620 65413948 99.44 427112 61029896 10804544 15.44 1078392 62411352 197616
09:05:01 PM 377924 65403644 99.43 430608 60967232 10794432 15.43 1082116 62345104 234916
09:15:01 PM 415260 65366308 99.37 433856 60842704 10822332 15.47 1099380 62217480 1687536
09:25:01 PM 452492 65329076 99.31 436928 60832668 10842884 15.50 1097624 62204328 1594256
09:35:01 PM 445220 65336348 99.32 460300 60819540 10855512 15.51 1137008 62187660 528
09:45:01 PM 470044 65311524 99.29 463480 60443220 10960588 15.66 24385076 38668700 1765864
09:55:01 PM 472608 65308960 99.28 464708 60422328 11067312 15.82 28572380 34443408 1358136
10:05:01 PM 433176 65348392 99.34 465852 60438212 10957232 15.66 29127204 33924120 4382320
10:15:01 PM 450192 65331376 99.32 467016 60405816 11024924 15.76 30376748 32652588 2063424
10:25:01 PM 376448 65405120 99.43 468388 60579636 10831104 15.48 31660420 31440712 704
10:35:01 PM 560628 65220940 99.15 469856 60329228 10979960 15.69 38914816 23997528 4796836
10:45:01 PM 521660 65259908 99.21 471004 60329408 11008184 15.73 41755472 21194232 3680356
10:55:01 PM 556596 65224972 99.15 472160 60329584 10921484 15.61 44714800 18191336 2351468
11:05:01 PM 543664 65237904 99.17 473288 60329768 11025832 15.76 46713556 16209272 3482536
11:15:01 PM 408040 65373528 99.38 475420 60500764 10935492 15.63 41843776 21182704 168384
sar -b -f /var/log/sysstat/sa08|head -4 && sar -b -f /var/log/sysstat/sa08|grep '09:35:01 PM' -A 10 -B 10
Linux 3.19.0-80-generic (Tendua) 09/08/2017 _x86_64_ (12 CPU)
12:00:02 AM tps rtps wtps bread/s bwrtn/s
12:05:03 AM 135.06 125.31 9.75 32088.68 15436.97
07:55:01 PM 214.60 187.68 26.92 47229.75 31671.99
08:05:01 PM 407.57 334.90 72.67 85024.47 91730.47
08:15:01 PM 411.46 341.30 70.16 86763.79 87368.85
08:25:01 PM 420.48 353.86 66.62 90036.21 84149.34
08:35:01 PM 480.24 394.76 85.48 100597.74 109374.29
08:45:01 PM 634.61 530.99 103.62 134987.46 132954.73
08:55:01 PM 683.04 567.68 115.36 144695.57 147487.06
09:05:01 PM 712.35 594.86 117.50 151571.92 151528.64
09:15:01 PM 732.52 613.41 119.10 156291.37 151223.34
09:25:01 PM 716.90 598.66 118.24 152659.05 152950.26
09:35:01 PM 719.66 594.02 125.65 147751.96 153455.47
09:45:01 PM 449.31 379.28 70.03 96426.44 106934.35
09:55:01 PM 249.01 48.08 200.93 12311.36 304792.29
10:05:01 PM 225.01 32.41 192.60 8286.13 310338.90
10:15:01 PM 220.01 15.86 204.15 4051.83 312400.63
10:25:01 PM 210.24 8.80 201.44 1863.78 308831.78
10:35:01 PM 199.95 23.89 176.05 5375.87 249054.33
10:45:01 PM 204.32 0.00 204.32 0.00 322379.98
10:55:01 PM 201.33 0.00 201.33 0.00 314086.77
11:05:01 PM 206.65 0.00 206.64 0.04 317881.94
11:15:01 PM 168.05 32.33 135.72 2041.40 151844.90
sar -w -f /var/log/sysstat/sa08|head -4 && sar -w -f /var/log/sysstat/sa08|grep '09:35:01 PM' -A 10 -B 10
Linux 3.19.0-80-generic (Tendua) 09/08/2017 _x86_64_ (12 CPU)
12:00:02 AM proc/s cswch/s
12:05:03 AM 1.45 800.71
07:55:01 PM 0.82 1116.96
08:05:01 PM 0.86 1584.54
08:15:01 PM 0.82 1530.11
08:25:01 PM 0.70 1543.24
08:35:01 PM 0.71 1630.97
08:45:01 PM 0.72 1942.89
08:55:01 PM 0.78 1988.69
09:05:01 PM 0.77 2030.87
09:15:01 PM 1.03 2061.84
09:25:01 PM 0.86 2027.47
09:35:01 PM 0.82 2062.62
09:45:01 PM 19.65 1789.45
09:55:01 PM 0.73 945.56
10:05:01 PM 0.53 901.46
10:15:01 PM 0.68 873.46
10:25:01 PM 0.76 793.85
10:35:01 PM 7.85 950.96
10:45:01 PM 0.74 869.87
10:55:01 PM 0.79 913.40
11:05:01 PM 0.78 894.85
11:15:01 PM 0.80 1225.52