PERF: Use Boost.SmallVec in BCO interpolator
Summary
Use a small vector container for the output pixel and interpolator coefficients in BCOInterpolateImageFunction
.
Rationale
This avoids two (for scalar images) or three (for vector ones) for small interpolation radii and number of bands.
before:
286.76user 5.17system 0:36.00elapsed 810%CPU (0avgtext+0avgdata 1013876maxresident)k
2288inputs+15071432outputs (5major+1442780minor)pagefaults 0swaps
290.62user 5.28system 0:34.09elapsed 867%CPU (0avgtext+0avgdata 1013312maxresident)k
35776inputs+15071432outputs (207major+1443014minor)pagefaults 0swaps
291.82user 5.01system 0:34.05elapsed 871%CPU (0avgtext+0avgdata 1013196maxresident)k
18960inputs+15071432outputs (133major+1443078minor)pagefaults 0swaps
after:
260.86user 5.44system 0:37.12elapsed 717%CPU (0avgtext+0avgdata 1013652maxresident)k
4272inputs+15071432outputs (3major+1443217minor)pagefaults 0swaps
257.90user 5.85system 0:38.22elapsed 690%CPU (0avgtext+0avgdata 1014272maxresident)k
8inputs+15071432outputs (1major+1443207minor)pagefaults 0swaps
261.12user 5.58system 0:36.90elapsed 722%CPU (0avgtext+0avgdata 1012968maxresident)k
0inputs+15071432outputs (0major+1443212minor)pagefaults 0swaps
Tested on Ubuntu 18.04. I expect it might make a larger difference on older systems, because the GLIBC memory allocator used to be very slow (it got a per-thread cache in 2.26).
Implementation Details
I'm not sure what's the minimum Boost version we support, so I left in a path for pre-1.58, where small_vector
is not available.
Additional notes
This is technically a breaking change:
- the
BCOInterpolateImageFunctionBase::CoefContainer
type is public and its type has changed -
BCOInterpolateImageFunctionBase::EvaluateCoef
is no longer virtual
Copyright
The copyright owner is Laurențiu Nicola / CS ROMANIA and has signed the ORFEO ToolBox Contributor License Agreement.
Check before merging:
- All discussions are resolved
- At least 2
votes from core developers, no vote. - The feature branch is (reasonably) up-to-date with the base branch
- Dashboard is green
- Copyright owner has signed the ORFEO ToolBox Contributor License Agreement
- Optionally, run
git diff develop... -U0 --no-color | clang-format-diff.py -p1 -i
on latest changes and commit
Merge request reports
Activity
added feature label
added 1 commit
- fa393363 - PERF: Use Boost.SmallVec in BCO interpolator
added breaking label
- Automatically resolved by Laurențiu Nicola
- Automatically resolved by Laurențiu Nicola
- Resolved by Laurențiu Nicola
- Resolved by Laurențiu Nicola
This looks great. What config is used in your benchmark?
After making
EvaluateCoef
non-virtual, the CPU times are longer, but the wall times are shorter:264.82user 5.31system 0:30.86elapsed 875%CPU (0avgtext+0avgdata 1016596maxresident)k 0inputs+15071432outputs (0major+1443220minor)pagefaults 0swaps 267.91user 4.96system 0:31.26elapsed 872%CPU (0avgtext+0avgdata 1016548maxresident)k 0inputs+15071432outputs (0major+1443217minor)pagefaults 0swaps 264.21user 5.65system 0:36.12elapsed 746%CPU (0avgtext+0avgdata 1016396maxresident)k 0inputs+15071432outputs (0major+1443217minor)pagefaults 0swaps
I feel these numbers are too noisy to be reliable, but it's probably fine to keep that change.
And directly returning the vector:
247.04user 5.30system 0:31.45elapsed 802%CPU (0avgtext+0avgdata 1016364maxresident)k 0inputs+15071432outputs (0major+1443212minor)pagefaults 0swaps 247.62user 5.44system 0:33.31elapsed 759%CPU (0avgtext+0avgdata 1016368maxresident)k 0inputs+15071432outputs (0major+1443205minor)pagefaults 0swaps 252.58user 5.18system 0:30.52elapsed 844%CPU (0avgtext+0avgdata 1016552maxresident)k 0inputs+15071432outputs (0major+1443215minor)pagefaults 0swaps
added 1 commit
- 1b99e1f2 - PERF: Use Boost.SmallVec in BCO interpolator
When using out parameters, as the vector is somehow seen as shared, the accumulation would be best done into a local variable instead of
vect[i]
.Also, copying
m_Alpha
into a local variable, to tell the compiler that it cannot change while the function is being executed, should slightly improve the performances.added 1 commit
- e20c2bb2 - PERF: Use Boost.SmallVec in BCO interpolator
- Resolved by Laurențiu Nicola
Updated. I'm pretty sure the compiler can see through that, but it does make the code nicer. I also removed some
abs
calls there weren't doing anything.
Actually, this seems to have regressed back the performance:
268.92user 5.08system 0:31.17elapsed 879%CPU (0avgtext+0avgdata 1016668maxresident)k 112inputs+15071432outputs (1major+1443221minor)pagefaults 0swaps 270.39user 5.11system 0:33.26elapsed 828%CPU (0avgtext+0avgdata 1016628maxresident)k 0inputs+15071432outputs (0major+1443209minor)pagefaults 0swaps 269.68user 5.32system 0:34.94elapsed 787%CPU (0avgtext+0avgdata 1016480maxresident)k 0inputs+15071432outputs (0major+1443214minor)pagefaults 0swaps
Or maybe there's no actual difference.
I don't think they're particularly reliable. It's a desktop system, but I don't seem to be able to control the frequency scaling governor:
cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: Invalid argument
and judging by how easily the CPU reaches 80-90 C then goes back, I don't think it's adequately cooled.
Edited by Laurențiu Nicola@lnicola do you confirm that there is an actual speedup? (I had no time to benchmark but I can give a try in coming days)
Edited by Rémi CressonOn my side I measured 12% gain in processing time. Great job @lnicola
@remicress Thanks. I wanted to test on a different computer (a laptop, though), but didn't get a chance. 12% sounds like my best results here, in !697 (comment 84492).
mentioned in issue #1999 (closed)
@remicress see the "additional notes" section in the MR description.
mentioned in commit 33480ce3
changed milestone to %8.0.0
mentioned in issue #2075 (closed)
mentioned in commit 54d2985b
mentioned in issue #1666 (closed)