PERF: Improve DEMHandler performances
Summary
Drastic improvement of the performances of DEMHandler::GetHeight*
functions
Rationale
OTB 8 new DEMHandler
class has very slow access times in multithreaded environment. The problem mainly comes from the fact that every time a height is requested at a given position, a mutex will be locked then unlocked.
In https://gitlab.orfeo-toolbox.org/s1-tiling/normlim_sigma0 XYZToImage application (inspired from OTB DEMToImage POC application), execution takes almost 8 hours, and vtune reports in short executions that 50% of the time is spent in mutexes from DEMDetails::DEMValue
Implementation Details
First: thread local DEM handlers
The new design uses a GDALDataSet
per thread. All GDALDataSet
instances point to the same file.
As per https://trac.osgeo.org/gdal/wiki/FAQMiscellaneous#IstheGDALlibrarythread-safe, this new design is sound with GDAL 2.0.
In a direct DEM to XYZ transformation, computation time goes from ~8h to ~1h15.
The current change is half proof of concept, half quick and dirty workaround. In a next iteration, datasets should not be exclusive to one thread but kept in a datapool for reuse between OTB streams.
Then: cached WGS84 coordinate transformations
After that first correction. We see that searching for whether we are in GWS84 can be costly (10% of the time in the DEM to XYZ application). It originates from calls to dynamic_cast
(~4% on total CPU time), mutexes (~9% of the remaining mutexes), and a few allocations that can be avoided.
Now the OGRCoordinateTransformation
object creation is factorized when the TLS DEMHandler
is created -- as there is no reason for changes once the
GDALDataset
is instantiated.
Notes
This MR indirectly relates to:
-
#2093 (closed) regarding "Caching and multi-threading improvements"
-
#2287: a new
GetGeoidHeight()
function is provided -
#2265 (closed): Orthorectofication should be faster
With this MR + !934 (closed), on a 8 cores i5 8th gen w/ SSD HD, Orthorectification executes in ~4min20 instead of ~6min.
OrthoRectification -map utm -map.utm.zone 33 -map.utm.northhem 1 -outputs.mode autosize -outputs.spacingx 10. -outputs.spacingy -10. -opt.gridspacing 40 -io.in S1A_IW_GRDH_1SDV_20200108T044150_20200108T044215_030704_038506_C7F5.SAFE/measurement/s1a-iw-grd-vh-20200108t044150-20200108t044215-030704-038506-002.tiff -io.out s1a_bench_opti.tiff -elev.dem SRTM_30_hgt/ -elev.geoid ~Geoid/egm96.grd
Perspectives
As described in the source code a minor improvement is still possible. Instead of creating and destroying instances of DEMHandlerTLS
every time a new thread is spawned or terminated, we could use a pool a DEMHandlerTLS
instances. It should make a small difference with ITK 4 default threading model that starts and stops threads every time data are generated, IOW for every OTB stream.
The new figures returned by the profiler, show that now most of the time is spent in (gdal)VRTSimpleSource::GetSrcDstWindow()
and IRasterIO
.
Something that should make a much bigger difference would be to not call RasterIO()
once for every pixel coordinates we are interested in, but instead, once per tile of pixels.
Lastly, before merging, I highly recommend an in-depth code review, and may be the implementation of DEMHandlerTLS
data pools. The DEMHandlerTLS
instantiation sequence could and should be simplified as well (as pools are implemented).
Copyright
The copyright owner is CNES and has signed the ORFEO ToolBox Contributor License Agreement.
Check before merging:
- All discussions are resolved
- At least 2
👍 votes from core developers, no👎 vote. - The feature branch is (reasonably) up-to-date with the base branch
- Dashboard is green
- Copyright owner has signed the ORFEO ToolBox Contributor License Agreement
- Optionally, run
git diff develop... -U0 --no-color | clang-format-diff.py -p1 -i
on latest changes and commit