Bug or Help requested: "Cgroup mem limit exceeded: Cgroup memsw limit exceeded"
Hi,
I am still struggling to process my S1 tiles on HAL (develop
branch, pulled yesterday morning).
I am unable to success running my S1Tiling job.
My job has 8 parallel processes with 4 otb threads, so I asked the scheduler for 32 cpus (Am I right here? 8 is the number of workers, and 4 matches ITK_DEFAULT_MAXIMUM_NUMBER_THREADS
?).
- I have tested it with
ram_per_process: 4096
and requested 40Gb in my PBS job. Problem: Job processes a dozen of tiles, then begin to throw a lot of warnings related to memory usage, like:distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 5.01 GB -- Worker memory limit: 5.37 GB
(this is the last one, and theProcess memory
increased over the previous warnings). At some point, the program throws the following error and exits:
Traceback (most recent call last):
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/bin/S1Processor", line 33, in <module>
sys.exit(load_entry_point('S1Tiling', 'console_scripts', 'S1Processor')())
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/work/scratch/cressor/S1Tiling/sources/s1tiling/S1Processor.py", line 327, in main
debug_otb=debug_otb, dryrun=dryrun, debug_tasks=debug_tasks)
File "/work/scratch/cressor/S1Tiling/sources/s1tiling/S1Processor.py", line 238, in process_one_tile
results = client.get(dsk, required_products)
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/client.py", line 2725, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/client.py", line 1992, in gather
asynchronous=asynchronous,
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/client.py", line 833, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/utils.py", line 324, in f
result[0] = yield future
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/client.py", line 1851, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ('s1a_31TCK_vv_DES_110_20171220txxxxxx', <Worker 'tcp://127.0.0.1:42411', name: 7, memory: 0, processing: 2>)
Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join:
Traceback (most recent call last):
File "/softs/rh7/Python/3.7.2/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/softs/rh7/Python/3.7.2/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/work/scratch/cressor/S1Tiling/install_with_otb_module/lib/python3.7/site-packages/distributed/process.py", line 235, in _watch_process
assert exitcode is not None
AssertionError
- I decided to test it with more available ram, and I set
ram_per_process: 1024
and 60Gb ram allocated for the job (I know, the margin is a bit excessive!). While the previously described warning disappeared, my job processed a dozen of tiles, then was killed with last messageCgroup mem limit exceeded: Cgroup memsw limit exceeded
Here is my .cfg file:
[Paths]
output : /home/ad/cressor/work/SENTINEL1/
s1_images : /work/OT/restopt/SENTINEL1_TMP/raw
srtm : /work/datalake/static_aux/MNT/SRTM_30_hgt
geoid_file : /home/ad/cressor/work/egm96.grd
tmp : %(TMPDIR)s/s1tiling
[DataSource]
eodagConfig : /home/ad/cressor/peps.txt
download : True
roi_by_tiles : ALL
polarisation : VV-VH
first_date : 2017-01-01
last_date : 2018-01-01
[Mask]
generate_border_mask: True
[Processing]
calibration: sigma
remove_thermal_noise: True
output_spatial_resolution : 10.
orthorectification_gridspacing : 40
tiles_list_in_file : /work/OT/restopt/jobs/s1_tiling_jobs/tilelist.txt
tile_to_product_overlap_ratio : 0.5
mode : debug logging
nb_parallel_processes : 8
ram_per_process : 1024
nb_otb_threads: 4
[Filtering]
filtering_activated : False
reset_outcore : True
window_radius : 2
I don't know if it's due to:
- me, configuring wrongly the tool
- a bug in S1Tiling
- PBS jobs needing some extra options to define process positioning on cpu/memory