Skip to content
Snippets Groups Projects

PERF: Use Boost.SmallVec in BCO interpolator

Merged Laurențiu Nicola requested to merge bco-small-vec into develop
All threads resolved!

Summary

Use a small vector container for the output pixel and interpolator coefficients in BCOInterpolateImageFunction.

Rationale

This avoids two (for scalar images) or three (for vector ones) for small interpolation radii and number of bands.

before:
286.76user 5.17system 0:36.00elapsed 810%CPU (0avgtext+0avgdata 1013876maxresident)k
2288inputs+15071432outputs (5major+1442780minor)pagefaults 0swaps
290.62user 5.28system 0:34.09elapsed 867%CPU (0avgtext+0avgdata 1013312maxresident)k
35776inputs+15071432outputs (207major+1443014minor)pagefaults 0swaps
291.82user 5.01system 0:34.05elapsed 871%CPU (0avgtext+0avgdata 1013196maxresident)k
18960inputs+15071432outputs (133major+1443078minor)pagefaults 0swaps

after:
260.86user 5.44system 0:37.12elapsed 717%CPU (0avgtext+0avgdata 1013652maxresident)k
4272inputs+15071432outputs (3major+1443217minor)pagefaults 0swaps
257.90user 5.85system 0:38.22elapsed 690%CPU (0avgtext+0avgdata 1014272maxresident)k
8inputs+15071432outputs (1major+1443207minor)pagefaults 0swaps
261.12user 5.58system 0:36.90elapsed 722%CPU (0avgtext+0avgdata 1012968maxresident)k
0inputs+15071432outputs (0major+1443212minor)pagefaults 0swaps

Tested on Ubuntu 18.04. I expect it might make a larger difference on older systems, because the GLIBC memory allocator used to be very slow (it got a per-thread cache in 2.26).

Implementation Details

I'm not sure what's the minimum Boost version we support, so I left in a path for pre-1.58, where small_vector is not available.

Additional notes

This is technically a breaking change:

  • the BCOInterpolateImageFunctionBase::CoefContainer type is public and its type has changed
  • BCOInterpolateImageFunctionBase::EvaluateCoef is no longer virtual

Copyright

The copyright owner is Laurențiu Nicola / CS ROMANIA and has signed the ORFEO ToolBox Contributor License Agreement.


Check before merging:

  • All discussions are resolved
  • At least 2 :thumbsup: votes from core developers, no :thumbsdown: vote.
  • The feature branch is (reasonably) up-to-date with the base branch
  • Dashboard is green
  • Copyright owner has signed the ORFEO ToolBox Contributor License Agreement
  • Optionally, run git diff develop... -U0 --no-color | clang-format-diff.py -p1 -i on latest changes and commit
Edited by Laurențiu Nicola

Merge request reports

Merge request pipeline #3954 passed

Merge request pipeline passed for e20c2bb2

Merged by Laurențiu NicolaLaurențiu Nicola 5 years ago (Apr 6, 2020 10:26am UTC)

Loading

Pipeline #4127 passed

Pipeline passed for 33480ce3 on develop

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Luc Hermitte
  • Luc Hermitte
  • After making EvaluateCoef non-virtual, the CPU times are longer, but the wall times are shorter:

    264.82user 5.31system 0:30.86elapsed 875%CPU (0avgtext+0avgdata 1016596maxresident)k
    0inputs+15071432outputs (0major+1443220minor)pagefaults 0swaps
    267.91user 4.96system 0:31.26elapsed 872%CPU (0avgtext+0avgdata 1016548maxresident)k
    0inputs+15071432outputs (0major+1443217minor)pagefaults 0swaps
    264.21user 5.65system 0:36.12elapsed 746%CPU (0avgtext+0avgdata 1016396maxresident)k
    0inputs+15071432outputs (0major+1443217minor)pagefaults 0swaps

    I feel these numbers are too noisy to be reliable, but it's probably fine to keep that change.

  • And directly returning the vector:

    247.04user 5.30system 0:31.45elapsed 802%CPU (0avgtext+0avgdata 1016364maxresident)k
    0inputs+15071432outputs (0major+1443212minor)pagefaults 0swaps
    247.62user 5.44system 0:33.31elapsed 759%CPU (0avgtext+0avgdata 1016368maxresident)k
    0inputs+15071432outputs (0major+1443205minor)pagefaults 0swaps
    252.58user 5.18system 0:30.52elapsed 844%CPU (0avgtext+0avgdata 1016552maxresident)k
    0inputs+15071432outputs (0major+1443215minor)pagefaults 0swaps
  • added 1 commit

    • 1b99e1f2 - PERF: Use Boost.SmallVec in BCO interpolator

    Compare with previous version

  • Laurențiu Nicola changed the description

    changed the description

  • When using out parameters, as the vector is somehow seen as shared, the accumulation would be best done into a local variable instead of vect[i].

    Also, copying m_Alpha into a local variable, to tell the compiler that it cannot change while the function is being executed, should slightly improve the performances.

  • added 1 commit

    • e20c2bb2 - PERF: Use Boost.SmallVec in BCO interpolator

    Compare with previous version

  • Actually, this seems to have regressed back the performance:

    268.92user 5.08system 0:31.17elapsed 879%CPU (0avgtext+0avgdata 1016668maxresident)k
    112inputs+15071432outputs (1major+1443221minor)pagefaults 0swaps
    270.39user 5.11system 0:33.26elapsed 828%CPU (0avgtext+0avgdata 1016628maxresident)k
    0inputs+15071432outputs (0major+1443209minor)pagefaults 0swaps
    269.68user 5.32system 0:34.94elapsed 787%CPU (0avgtext+0avgdata 1016480maxresident)k
    0inputs+15071432outputs (0major+1443214minor)pagefaults 0swaps

    Or maybe there's no actual difference.

  • Do you think the user/system/elapse metrics are reliable? To be sure you can use timeprobes and average runtimes over something like 10 runs, measure mean and stdev

  • I don't think they're particularly reliable. It's a desktop system, but I don't seem to be able to control the frequency scaling governor:

    cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: Invalid argument

    and judging by how easily the CPU reaches 80-90 C then goes back, I don't think it's adequately cooled.

    Edited by Laurențiu Nicola
  • I assume the owTvQtWidgetShow failure on centos-xdk-build has nothing to do with these changes.

  • @lnicola do you confirm that there is an actual speedup? (I had no time to benchmark but I can give a try in coming days)

    Edited by Rémi Cresson
  • On my side I measured 12% gain in processing time. Great job @lnicola :thumbsup:

  • @remicress Thanks. I wanted to test on a different computer (a laptop, though), but didn't get a chance. 12% sounds like my best results here, in !697 (comment 84492).

  • mentioned in issue #1999 (closed)

  • This has votes, but I would postpone merging it until at least after the release.

    Will the next version be 7.2 or 8.0? This is technically semver-breaking, so it should go in 8.0.

  • Why did you add the breaking label? I don't see

  • @remicress see the "additional notes" section in the MR description.

  • Laurențiu Nicola resolved all threads

    resolved all threads

  • mentioned in commit 33480ce3

  • note : next version will probably be 8.0 !

  • Laurențiu Nicola changed milestone to %8.0.0

    changed milestone to %8.0.0

  • mentioned in issue #2075 (closed)

  • mentioned in commit 54d2985b

  • mentioned in issue #1666 (closed)

  • Please register or sign in to reply
    Loading