Drastic improvement of the performances of
OTB 8 new
DEMHandler class has very slow access times in multithreaded environment. The problem mainly comes from the fact that every time a height is requested at a given position, a mutex will be locked then unlocked.
In https://gitlab.orfeo-toolbox.org/s1-tiling/normlim_sigma0 XYZToImage application (inspired from OTB DEMToImage POC application), execution takes almost 8 hours, and vtune reports in short executions that 50% of the time is spent in mutexes from
First: thread local DEM handlers
The new design uses a
GDALDataSet per thread. All
GDALDataSet instances point to the same file.
As per https://trac.osgeo.org/gdal/wiki/FAQMiscellaneous#IstheGDALlibrarythread-safe, this new design is sound with GDAL 2.0.
In a direct DEM to XYZ transformation, computation time goes from ~8h to ~1h15.
The current change is half proof of concept, half quick and dirty workaround. In a next iteration, datasets should not be exclusive to one thread but kept in a datapool for reuse between OTB streams.
Then: cached WGS84 coordinate transformations
After that first correction. We see that searching for whether we are in GWS84 can be costly (10% of the time in the DEM to XYZ application). It originates from calls to
dynamic_cast (~4% on total CPU time), mutexes (~9% of the remaining mutexes), and a few allocations that can be avoided.
OGRCoordinateTransformation object creation is factorized when the TLS
DEMHandler is created -- as there is no reason for changes once the
GDALDataset is instantiated.
This MR indirectly relates to:
#2093 regarding "Caching and multi-threading improvements"
#2287: a new
GetGeoidHeight()function is provided
#2265: Orthorectofication should be faster
With this MR + !934, on a 8 cores i5 8th gen w/ SSD HD, Orthorectification executes in ~4min20 instead of ~6min.
OrthoRectification -map utm -map.utm.zone 33 -map.utm.northhem 1 -outputs.mode autosize -outputs.spacingx 10. -outputs.spacingy -10. -opt.gridspacing 40 -io.in S1A_IW_GRDH_1SDV_20200108T044150_20200108T044215_030704_038506_C7F5.SAFE/measurement/s1a-iw-grd-vh-20200108t044150-20200108t044215-030704-038506-002.tiff -io.out s1a_bench_opti.tiff -elev.dem SRTM_30_hgt/ -elev.geoid ~Geoid/egm96.grd
As described in the source code a minor improvement is still possible. Instead of creating and destroying instances of
DEMHandlerTLS every time a new thread is spawned or terminated, we could use a pool a
DEMHandlerTLS instances. It should make a small difference with ITK 4 default threading model that starts and stops threads every time data are generated, IOW for every OTB stream.
The new figures returned by the profiler, show that now most of the time is spent in
Something that should make a much bigger difference would be to not call
RasterIO() once for every pixel coordinates we are interested in, but instead, once per tile of pixels.
Lastly, before merging, I highly recommend an in-depth code review, and may be the implementation of
DEMHandlerTLS data pools. The
DEMHandlerTLS instantiation sequence could and should be simplified as well (as pools are implemented).
The copyright owner is CNES and has signed the ORFEO ToolBox Contributor License Agreement.
Check before merging:
- All discussions are resolved
- At least 2
👍votes from core developers, no 👎vote.
- The feature branch is (reasonably) up-to-date with the base branch
- Dashboard is green
- Copyright owner has signed the ORFEO ToolBox Contributor License Agreement
- Optionally, run
git diff develop... -U0 --no-color | clang-format-diff.py -p1 -ion latest changes and commit