Skip to content

PERF: Improve DEMHandler performances

Luc Hermitte requested to merge lhermitte/otb:OptimDEMHandler into develop

Summary

Drastic improvement of the performances of DEMHandler::GetHeight* functions

Rationale

OTB 8 new DEMHandler class has very slow access times in multithreaded environment. The problem mainly comes from the fact that every time a height is requested at a given position, a mutex will be locked then unlocked.

In https://gitlab.orfeo-toolbox.org/s1-tiling/normlim_sigma0 XYZToImage application (inspired from OTB DEMToImage POC application), execution takes almost 8 hours, and vtune reports in short executions that 50% of the time is spent in mutexes from DEMDetails::DEMValue

Implementation Details

First: thread local DEM handlers

The new design uses a GDALDataSet per thread. All GDALDataSet instances point to the same file.

As per https://trac.osgeo.org/gdal/wiki/FAQMiscellaneous#IstheGDALlibrarythread-safe, this new design is sound with GDAL 2.0.

In a direct DEM to XYZ transformation, computation time goes from ~8h to ~1h15.

The current change is half proof of concept, half quick and dirty workaround. In a next iteration, datasets should not be exclusive to one thread but kept in a datapool for reuse between OTB streams.

Then: cached WGS84 coordinate transformations

After that first correction. We see that searching for whether we are in GWS84 can be costly (10% of the time in the DEM to XYZ application). It originates from calls to dynamic_cast (~4% on total CPU time), mutexes (~9% of the remaining mutexes), and a few allocations that can be avoided.

Now the OGRCoordinateTransformation object creation is factorized when the TLS DEMHandler is created -- as there is no reason for changes once the GDALDataset is instantiated.

Notes

This MR indirectly relates to:

  • #2093 regarding "Caching and multi-threading improvements"

  • #2287: a new GetGeoidHeight() function is provided

  • #2265 (closed): Orthorectofication should be faster

    With this MR + !934 (closed), on a 8 cores i5 8th gen w/ SSD HD, Orthorectification executes in ~4min20 instead of ~6min.

    OrthoRectification -map utm -map.utm.zone 33 -map.utm.northhem 1 -outputs.mode autosize -outputs.spacingx 10. -outputs.spacingy -10.  -opt.gridspacing 40  -io.in S1A_IW_GRDH_1SDV_20200108T044150_20200108T044215_030704_038506_C7F5.SAFE/measurement/s1a-iw-grd-vh-20200108t044150-20200108t044215-030704-038506-002.tiff -io.out s1a_bench_opti.tiff -elev.dem SRTM_30_hgt/ -elev.geoid ~Geoid/egm96.grd

Perspectives

As described in the source code a minor improvement is still possible. Instead of creating and destroying instances of DEMHandlerTLS every time a new thread is spawned or terminated, we could use a pool a DEMHandlerTLS instances. It should make a small difference with ITK 4 default threading model that starts and stops threads every time data are generated, IOW for every OTB stream.

The new figures returned by the profiler, show that now most of the time is spent in (gdal)VRTSimpleSource::GetSrcDstWindow() and IRasterIO. Something that should make a much bigger difference would be to not call RasterIO() once for every pixel coordinates we are interested in, but instead, once per tile of pixels.

Lastly, before merging, I highly recommend an in-depth code review, and may be the implementation of DEMHandlerTLS data pools. The DEMHandlerTLS instantiation sequence could and should be simplified as well (as pools are implemented).

Copyright

The copyright owner is CNES and has signed the ORFEO ToolBox Contributor License Agreement.


Check before merging:

  • All discussions are resolved
  • At least 2 👍 votes from core developers, no 👎 vote.
  • The feature branch is (reasonably) up-to-date with the base branch
  • Dashboard is green
  • Copyright owner has signed the ORFEO ToolBox Contributor License Agreement
  • Optionally, run git diff develop... -U0 --no-color | clang-format-diff.py -p1 -i on latest changes and commit
Edited by Luc Hermitte

Merge request reports