0

Question: How can I resolve executables disappearing sometimes during batch processing scripts?

Background: I have a series of scripts designed to parse xyz elevation data, partition into grids, merge and interpolate this data using tools like mbgrid and gmt surface. These can be called on manually, but generally run as scheduled tasks using configuration files specifying things like grid size, cell resolution and directory structures.

These scripts are written in python, using multiprocessing and making system calls with subprocess.

Problem: Sometimes, I get situations where commands are not found (e.g. cat 180E48S_mbgrid_output.log /bin/sh: 1: mbgrid: not found)

Since these scripts are called via cron with a configuration file, I would expect consistent results. Most of the time (and it seems always when I'm watching, even if I just run the script cron initiates), the code runs successfully to completion. But sometimes not.

Below is a snippet of a log file showing mbgrid not found (bolded), followed by a successful run. And then another instance of mbgrid not found before another successful run.

grep '/bin/sh: 1: mbgrid: not found' 22E59N_mbgrid_output.log -B 10 -A 10

/bin/sh: 1: mbgrid: not found

Zgrid starting iterations Zgrid iteration 10 convergence test: -0.001000 last:0.000000 Zgrid iteration 20 convergence test: -0.001000 last:-0.001000 Zgrid iteration 30 convergence test: -0.001000 last:-0.001000 Zgrid iteration 40 convergence test: -0.001000 last:-0.001000 Zgrid iteration 50 convergence test: -0.001000 last:-0.001000 Zgrid iteration 60 convergence test: -0.001000 last:-0.001000 Zgrid iteration 70 convergence test: -0.001000 last:-0.001000 Zgrid iteration 80 convergence test: -0.001000 last:-0.001000 Zgrid iteration 90 convergence test: -0.001000 last:-0.001000

... snipped log ...

mbm_grdplot -I22E59Nweightedgrid.grd -G1 -C -V -L"File 22E59Nweightedgrid.grd - Topography Grid:Topography (m)"

executing mbm_grdplot... mbm_grdplot -I22E59Nweightedgrid_num.grd -G1 -W1/2 -V -L"File 22E59Nweightedgrid_num.grd - Topography Grid:Number of Topography Data Points"

executing mbm_grdplot... mbm_grdplot -I22E59Nweightedgrid_sd.grd -G1 -W1/2 -V -L"File 22E59Nweightedgrid_sd.grd - Topography Grid:Topography Standard Deviation (m)"

Done.

/bin/sh: 1: mbgrid: not found

Zgrid starting iterations Zgrid iteration 10 convergence test: -0.001000 last:0.000000 Zgrid iteration 20 convergence test: -0.001000 last:-0.001000 Zgrid iteration 30 convergence test: -0.001000 last:-0.001000 Zgrid iteration 40 convergence test: -0.001000 last:-0.001000 Zgrid iteration 50 convergence test: -0.001000 last:-0.001000 Zgrid iteration 60 convergence test: -0.001000 last:-0.001000 Zgrid iteration 70 convergence test: -0.001000 last:-0.001000 Zgrid iteration 80 convergence test: -0.001000 last:-0.001000 Zgrid iteration 90 convergence test: -0.001000 last:-0.001000

...snipped log...

Steps taken: When I first encountered this, I associated it with some particularly large and dense grids. I quickly saw that I had 10 processes, each of which wanted 20% of ram. Turning down the number of processes seemed to solve this issue.

However, I've been seeing this occur even when it doesn't look like memory pressure has been a concern, but I may be missing something in my system monitoring.

I have set up sar monitoring, to grab system stats every few minutes. I have also set up a cron to grab some of the top stats every 5 minutes. I don't really see anything in these that would cause linux to be so resource starved that it would fail to find executables. I'll put those at the bottom as this is already rather long.

I've tried adding some logging diagnostics to spot the problem.

Here is a snippet of code that checks for mbgrid's existence and then, if found, logs the exact command that is being run:

 if (os.path.exists(r'/usr/local/bin/mbgrid') and os.path.isfile(r'/usr/local/bin/mbgrid')):
        try:
            logging.info("mbgrid -I"+gridcellname+"mbdatalist.mbf -O"+ \
                            weightedname+" -R"+str(wbound)+"/"+str(ebound)+"/"+str(sbound)+"/"+str(nbound)+ \
                            " -A2 -P1 -V -F1 -C1 -N -M -E"+str(cellsize)+"/0.0/degrees! >> " + gridcellname + \
                            "_mbgrid_output.log 2>&1")
            subprocess.call("mbgrid -I"+gridcellname+"mbdatalist.mbf -O"+ \
                            weightedname+" -R"+str(wbound)+"/"+str(ebound)+"/"+str(sbound)+"/"+str(nbound)+ \
                            " -A2 -P1 -V -F1 -C1 -N -M -E"+str(cellsize)+"/0.0/degrees! >> " + gridcellname + \
                            "_mbgrid_output.log 2>&1",cwd=griddir,shell=True)
            logging.info("mbgrid complete")
        except:
            logging.error("Error creating weighted grid for "+gridcellname)
            sys.exit("Error creating weighted grid for "+gridcellname)
    else:
        logging.error("Unable to find the mbgrid executable at /usr/local/bin/mbgrid for " + gridcellname)

From the main log for a given configuration file: otto@Tendua:~/automated/NZ_TopoBathy_30m$ grep '180E48S_mbgrid' NZ_TopoBathy_30m_grids.log -A 3 -B 3

INFO:09/08/2017 21:35:06:clipping and reformatting grid for 162E50S

INFO:09/08/2017 21:35:06:simplifying 165E34S_ngdc.xyz by converting to grid and back to xyz

INFO:09/08/2017 21:35:06:run mbgrid using the datalist to create weighted grid

INFO:09/08/2017 21:35:06:mbgrid -I180E48Smbdatalist.mbf -O180E48Sweightedgrid -R179.7/181.3/-48.3/-46.7 -A2 -P1 -V -F1 -C1 -N -M -E0.000277777777778/0.0/degrees! >> 180E48S_mbgrid_output.log 2>&1

INFO:09/08/2017 21:35:06:grid converted to netcdf3 format

INFO:09/08/2017 21:35:06:run mbgrid using the datalist to create weighted grid

INFO:09/08/2017 21:35:06:mbgrid -I175E30Smbdatalist.mbf -O175E30Sweightedgrid -R174.7/176.3/-30.3/-28.7 -A2 -P1 -V -F1 -C1 -N -M -E0.000277777777778/0.0/degrees! >> 175E30S_mbgrid_output.log 2>&1

And in the log file for that exact grid:

cat 180E48S_mbgrid_output.log

/bin/sh: 1: mbgrid: not found

And when I issue that exact command in the correct directory, it runs as intended.

What else should I be looking for to identify and fix this issue?

Your time is much appreciated.

=====================================================

Top ten processes, run every 5 minutes (which also tracks disk space on the data drives): otto@Tendua:~$ cat top_ten_process_monitor.sh

#!/bin/sh
date
free -m | awk 'NR==2{printf "Memory Usage: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }'
df -h | awk '$0 ~ "tendua" {printf "Disk Usage: %1.1f/%1.1fTB (%s)\n", $3,$2,$5}'
top -bn1 | grep load | awk '{printf "CPU Load: %.2f\n", $(NF-2)}'
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head

(section of log follows from the time of the above incident)

Fri Sep  8 21:26:01 MDT 2017
Memory Usage: 63865/64239MB (99.42%)
Disk Usage: 2.7/3.6TB (74%)
Disk Usage: 2.5/3.6TB (68%)
Disk Usage: 2.3/3.6TB (64%)
Disk Usage: 4.3/5.5TB (83%)
CPU Load: 1.57
  PID  PPID CMD                         %MEM %CPU
 3973  3886 dbus-daemon --fork --sessio  0.6  0.0
 4085  3886 /usr/lib/unity/unity-panel-  0.6  0.0
 4253  4079 compiz                       0.3  0.4
11350  3886 /opt/google/chrome/chrome    0.3  2.1
18562 11366 /opt/google/chrome/chrome -  0.3  8.8
17402  1109 smbd -F                      0.1  0.0
 3213     1 /opt/google/chrome/chrome -  0.1  0.1
 1587  1450 /usr/bin/X -core :0 -seat s  0.1  0.3
 4019  3886 nautilus --new-window        0.1  0.0
Fri Sep  8 21:31:01 MDT 2017
Memory Usage: 63852/64239MB (99.40%)
Disk Usage: 2.7/3.6TB (75%)
Disk Usage: 2.5/3.6TB (68%)
Disk Usage: 2.3/3.6TB (64%)
Disk Usage: 4.3/5.5TB (83%)
CPU Load: 1.57
  PID  PPID CMD                         %MEM %CPU
 3973  3886 dbus-daemon --fork --sessio  0.6  0.0
 4085  3886 /usr/lib/unity/unity-panel-  0.6  0.0
 4253  4079 compiz                       0.3  0.4
11350  3886 /opt/google/chrome/chrome    0.3  2.1
18562 11366 /opt/google/chrome/chrome -  0.3  8.8
17402  1109 smbd -F                      0.1  0.0
 3213     1 /opt/google/chrome/chrome -  0.1  0.1
 1587  1450 /usr/bin/X -core :0 -seat s  0.1  0.3
 4019  3886 nautilus --new-window        0.1  0.0
Fri Sep  8 21:36:01 MDT 2017
Memory Usage: 63796/64239MB (99.31%)
Disk Usage: 2.7/3.6TB (75%)
Disk Usage: 2.5/3.6TB (68%)
Disk Usage: 2.3/3.6TB (64%)
Disk Usage: 4.3/5.5TB (83%)
CPU Load: 4.59
  PID  PPID CMD                         %MEM %CPU
 3973  3886 dbus-daemon --fork --sessio  0.6  0.0
 4085  3886 /usr/lib/unity/unity-panel-  0.6  0.0
 4253  4079 compiz                       0.3  0.4
11350  3886 /opt/google/chrome/chrome    0.3  2.1
18562 11366 /opt/google/chrome/chrome -  0.3  8.8
17402  1109 smbd -F                      0.1  0.0
 3213     1 /opt/google/chrome/chrome -  0.1  0.1
 1587  1450 /usr/bin/X -core :0 -seat s  0.1  0.3
 1873  1871 gdal_translate -ot Float32   0.1  0.1
Fri Sep  8 21:41:01 MDT 2017
Memory Usage: 63878/64239MB (99.44%)
Disk Usage: 2.7/3.6TB (75%)
Disk Usage: 2.5/3.6TB (68%)
Disk Usage: 2.3/3.6TB (64%)
Disk Usage: 4.3/5.5TB (83%)
CPU Load: 4.32
  PID  PPID CMD                         %MEM %CPU
 3973  3886 dbus-daemon --fork --sessio  0.6  0.0
 4085  3886 /usr/lib/unity/unity-panel-  0.6  0.0
 4253  4079 compiz                       0.3  0.4
11350  3886 /opt/google/chrome/chrome    0.3  2.1
18562 11366 /opt/google/chrome/chrome -  0.3  8.8
 5636  5634 /usr/bin/python /usr/bin/gd  0.1 25.6
17402  1109 smbd -F                      0.1  0.0
 3213     1 /opt/google/chrome/chrome -  0.1  0.1
 1587  1450 /usr/bin/X -core :0 -seat s  0.1  0.3

System Stats follow:

sar -f /var/log/sysstat/sa08|head -4 && sar -f /var/log/sysstat/sa08|grep '09:35:01 PM' -A 10 -B 10

Linux 3.19.0-80-generic (Tendua)        09/08/2017      _x86_64_        (12 CPU)

12:00:02 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
12:05:03 AM     all     51.97      0.00      1.32      0.89      0.00     45.82
07:55:01 PM     all      9.44      0.00      0.59      3.51      0.00     86.45
08:05:01 PM     all      1.50      0.00      1.12      9.39      0.00     87.99
08:15:01 PM     all      1.48      0.00      1.12      9.16      0.00     88.24
08:25:01 PM     all      1.46      0.00      1.13      9.29      0.00     88.12
08:35:01 PM     all      1.50      0.00      1.24      8.65      0.00     88.61
08:45:01 PM     all      1.52      0.00      1.57      7.87      0.00     89.04
08:55:01 PM     all      1.55      0.00      1.64      7.96      0.00     88.85
09:05:01 PM     all      1.53      0.00      1.70      8.08      0.00     88.69
09:15:01 PM     all      1.54      0.00      1.74      8.04      0.00     88.68
09:25:01 PM     all      1.50      0.00      1.69      8.16      0.00     88.65
09:35:01 PM     all      1.55      0.00      1.69      7.98      0.00     88.78
09:45:01 PM     all      2.17      0.00      1.96     11.67      0.00     84.20
09:55:01 PM     all      1.71      0.00      1.87     18.06      0.00     78.36
10:05:01 PM     all      1.70      0.00      2.13     17.57      0.00     78.60
10:15:01 PM     all      1.73      0.00      1.93     18.64      0.00     77.70
10:25:01 PM     all      1.75      0.00      1.96     18.36      0.00     77.93
10:35:01 PM     all      3.91      0.00      1.80     15.40      0.00     78.89
10:45:01 PM     all      1.66      0.00      1.90     19.23      0.00     77.20
10:55:01 PM     all      1.68      0.00      1.82     19.05      0.00     77.45
11:05:01 PM     all      1.69      0.00      1.77     19.31      0.00     77.23
11:15:01 PM     all      7.69      0.00      1.03      8.68      0.00     82.60

sar -r -f /var/log/sysstat/sa08|head -4 && sar -r -f /var/log/sysstat/sa08|grep '09:35:01 PM' -A 10 -B 10

Linux 3.19.0-80-generic (Tendua)        09/08/2017      _x86_64_        (12 CPU)

12:00:02 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
12:05:03 AM   3859704  61921864     94.13     50184  52583736  17699508     25.29  29962700  29837068    611932
07:55:01 PM    392740  65388828     99.40    407920  60884772  10851876     15.51   2703424  60625128   2853088
08:05:01 PM    402104  65379464     99.39    410904  61032736  10845076     15.50   3520252  59959668   1336572
08:15:01 PM    371832  65409736     99.43    413876  61069912  10859576     15.52   3403804  60108248   1220840
08:25:01 PM    397468  65384100     99.40    416592  61053400  10820532     15.46   1277516  62219572   3004488
08:35:01 PM    426124  65355444     99.35    419376  61037072  10808244     15.45   1069828  62416140    306880
08:45:01 PM    453688  65327880     99.31    423508  60966356  10864072     15.53   1070752  62351248    929620
08:55:01 PM    367620  65413948     99.44    427112  61029896  10804544     15.44   1078392  62411352    197616
09:05:01 PM    377924  65403644     99.43    430608  60967232  10794432     15.43   1082116  62345104    234916
09:15:01 PM    415260  65366308     99.37    433856  60842704  10822332     15.47   1099380  62217480   1687536
09:25:01 PM    452492  65329076     99.31    436928  60832668  10842884     15.50   1097624  62204328   1594256
09:35:01 PM    445220  65336348     99.32    460300  60819540  10855512     15.51   1137008  62187660       528
09:45:01 PM    470044  65311524     99.29    463480  60443220  10960588     15.66  24385076  38668700   1765864
09:55:01 PM    472608  65308960     99.28    464708  60422328  11067312     15.82  28572380  34443408   1358136
10:05:01 PM    433176  65348392     99.34    465852  60438212  10957232     15.66  29127204  33924120   4382320
10:15:01 PM    450192  65331376     99.32    467016  60405816  11024924     15.76  30376748  32652588   2063424
10:25:01 PM    376448  65405120     99.43    468388  60579636  10831104     15.48  31660420  31440712       704
10:35:01 PM    560628  65220940     99.15    469856  60329228  10979960     15.69  38914816  23997528   4796836
10:45:01 PM    521660  65259908     99.21    471004  60329408  11008184     15.73  41755472  21194232   3680356
10:55:01 PM    556596  65224972     99.15    472160  60329584  10921484     15.61  44714800  18191336   2351468
11:05:01 PM    543664  65237904     99.17    473288  60329768  11025832     15.76  46713556  16209272   3482536
11:15:01 PM    408040  65373528     99.38    475420  60500764  10935492     15.63  41843776  21182704    168384

sar -b -f /var/log/sysstat/sa08|head -4 && sar -b -f /var/log/sysstat/sa08|grep '09:35:01 PM' -A 10 -B 10

Linux 3.19.0-80-generic (Tendua)        09/08/2017      _x86_64_        (12 CPU)

12:00:02 AM       tps      rtps      wtps   bread/s   bwrtn/s
12:05:03 AM    135.06    125.31      9.75  32088.68  15436.97
07:55:01 PM    214.60    187.68     26.92  47229.75  31671.99
08:05:01 PM    407.57    334.90     72.67  85024.47  91730.47
08:15:01 PM    411.46    341.30     70.16  86763.79  87368.85
08:25:01 PM    420.48    353.86     66.62  90036.21  84149.34
08:35:01 PM    480.24    394.76     85.48 100597.74 109374.29
08:45:01 PM    634.61    530.99    103.62 134987.46 132954.73
08:55:01 PM    683.04    567.68    115.36 144695.57 147487.06
09:05:01 PM    712.35    594.86    117.50 151571.92 151528.64
09:15:01 PM    732.52    613.41    119.10 156291.37 151223.34
09:25:01 PM    716.90    598.66    118.24 152659.05 152950.26
09:35:01 PM    719.66    594.02    125.65 147751.96 153455.47
09:45:01 PM    449.31    379.28     70.03  96426.44 106934.35
09:55:01 PM    249.01     48.08    200.93  12311.36 304792.29
10:05:01 PM    225.01     32.41    192.60   8286.13 310338.90
10:15:01 PM    220.01     15.86    204.15   4051.83 312400.63
10:25:01 PM    210.24      8.80    201.44   1863.78 308831.78
10:35:01 PM    199.95     23.89    176.05   5375.87 249054.33
10:45:01 PM    204.32      0.00    204.32      0.00 322379.98
10:55:01 PM    201.33      0.00    201.33      0.00 314086.77
11:05:01 PM    206.65      0.00    206.64      0.04 317881.94
11:15:01 PM    168.05     32.33    135.72   2041.40 151844.90

sar -w -f /var/log/sysstat/sa08|head -4 && sar -w -f /var/log/sysstat/sa08|grep '09:35:01 PM' -A 10 -B 10

Linux 3.19.0-80-generic (Tendua)        09/08/2017      _x86_64_        (12 CPU)

12:00:02 AM    proc/s   cswch/s
12:05:03 AM      1.45    800.71
07:55:01 PM      0.82   1116.96
08:05:01 PM      0.86   1584.54
08:15:01 PM      0.82   1530.11
08:25:01 PM      0.70   1543.24
08:35:01 PM      0.71   1630.97
08:45:01 PM      0.72   1942.89
08:55:01 PM      0.78   1988.69
09:05:01 PM      0.77   2030.87
09:15:01 PM      1.03   2061.84
09:25:01 PM      0.86   2027.47
09:35:01 PM      0.82   2062.62
09:45:01 PM     19.65   1789.45
09:55:01 PM      0.73    945.56
10:05:01 PM      0.53    901.46
10:15:01 PM      0.68    873.46
10:25:01 PM      0.76    793.85
10:35:01 PM      7.85    950.96
10:45:01 PM      0.74    869.87
10:55:01 PM      0.79    913.40
11:05:01 PM      0.78    894.85
11:15:01 PM      0.80   1225.52
5
  • Yaroslav, I'll add some logging for myself, but if PATH isn't set up correctly for cron, I would expect it would never work. I use full paths for all my python and script calls, but admittedly not for the actual gridding commands that I farm out with subprocess. Could it be that some of the multiprocess threads would get a different path than others? Commented Sep 11, 2017 at 9:09
  • From the log (triggered by cron): INFO:09/11/2017 03:23:22:/usr/bin:/bin So I'll set path and see if that helps things out. Strange that it sometimes works, but I'll be happy if I can get it to run all the time. Commented Sep 11, 2017 at 9:25
  • After setting path before spawning each of the processing threads, I see a path which does include the gdal/gmt commands. And it appears to be running correctly. INFO:09/11/2017 03:33:02:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin Commented Sep 11, 2017 at 9:34
  • I assume your problem is resolved? Commented Sep 11, 2017 at 9:39
  • I'm hoping so. Since it's not a binary 'works or doesn't work' problem, I can't say for sure. But this gives me a good method to address it. Thanks! Commented Sep 11, 2017 at 9:56

1 Answer 1

1

Since these scripts are called via cron with a configuration file, I would expect consistent results. Most of the time (and it seems always when I'm watching, even if I just run the script cron initiates), the code runs successfully to completion. But sometimes not.

Looks like that's envrionment problem. Try logging os.environ with manual run and with cron job.

The variable to look for is PATH.

PATH is an environmental variable in Linux and other Unix-like operating systems that tells the shell which directories to search for executable files (i.e., ready-to-run programs) in response to commands issued by a user.

If my conjecture is correct - you can resolve this issue by manualy setuping os.environ['PATH'] = "path to binaries dir" and then adding os.environ to cwd param of your subprocess call

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.