Bug or Help requested: "Cgroup mem limit exceeded: Cgroup memsw limit exceeded"

Hi,

I am still struggling to process my S1 tiles on HAL (develop branch, pulled yesterday morning). I am unable to success running my S1Tiling job. My job has 8 parallel processes with 4 otb threads, so I asked the scheduler for 32 cpus (Am I right here? 8 is the number of workers, and 4 matches ITK_DEFAULT_MAXIMUM_NUMBER_THREADS?).

I have tested it with ram_per_process: 4096 and requested 40Gb in my PBS job. Problem: Job processes a dozen of tiles, then begin to throw a lot of warnings related to memory usage, like: distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 5.01 GB -- Worker memory limit: 5.37 GB (this is the last one, and the Process memory increased over the previous warnings). At some point, the program throws the following error and exits:

Traceback (most recent call last):
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/bin/S1Processor", line 33, in <module>
    sys.exit(load_entry_point('S1Tiling', 'console_scripts', 'S1Processor')())
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/work/scratch/cressor/S1Tiling/sources/s1tiling/S1Processor.py", line 327, in main
    debug_otb=debug_otb, dryrun=dryrun, debug_tasks=debug_tasks)
  File "/work/scratch/cressor/S1Tiling/sources/s1tiling/S1Processor.py", line 238, in process_one_tile
    results = client.get(dsk, required_products)
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/client.py", line 2725, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/client.py", line 1992, in gather
    asynchronous=asynchronous,
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/client.py", line 833, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/client.py", line 1851, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ('s1a_31TCK_vv_DES_110_20171220txxxxxx', <Worker 'tcp://127.0.0.1:42411', name: 7, memory: 0, processing: 2>)
Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join:
Traceback (most recent call last):
  File "/softs/rh7/Python/3.7.2/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/softs/rh7/Python/3.7.2/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/process.py", line 235, in _watch_process
    assert exitcode is not None
AssertionError

I decided to test it with more available ram, and I set ram_per_process: 1024 and 60Gb ram allocated for the job (I know, the margin is a bit excessive!). While the previously described warning disappeared, my job processed a dozen of tiles, then was killed with last message Cgroup mem limit exceeded: Cgroup memsw limit exceeded

Here is my .cfg file:

[Paths]
output : /home/ad/cressor/work/SENTINEL1/
s1_images : /work/OT/restopt/SENTINEL1_TMP/raw
srtm : /work/datalake/static_aux/MNT/SRTM_30_hgt
geoid_file : /home/ad/cressor/work/egm96.grd
tmp : %(TMPDIR)s/s1tiling
[DataSource]
eodagConfig : /home/ad/cressor/peps.txt
download : True
roi_by_tiles : ALL
polarisation : VV-VH
first_date : 2017-01-01
last_date : 2018-01-01
[Mask]
generate_border_mask: True
[Processing]
calibration: sigma
remove_thermal_noise: True
output_spatial_resolution : 10.
orthorectification_gridspacing : 40
tiles_list_in_file : /work/OT/restopt/jobs/s1_tiling_jobs/tilelist.txt
tile_to_product_overlap_ratio : 0.5
mode : debug logging
nb_parallel_processes : 8
ram_per_process : 1024
nb_otb_threads: 4
[Filtering]
filtering_activated : False
reset_outcore : True
window_radius : 2

I don't know if it's due to:

me, configuring wrongly the tool
a bug in S1Tiling
PBS jobs needing some extra options to define process positioning on cpu/memory