Decreased performances in the OTB applications have been observed in the cluster of CNES between RH6 and RH7 : if the type of the output is not a float RH7 takes more time than RH6.
Can you record a pair of profiles using perf record otbApplicationLauncherCommandLine . UpdateSynthesis, then look at the output using perf report? It should be available in the perf package.
And does this only happen when the output image type is not float?
For information these differences were found on the CNES MAJA processing chain.
It extensively uses OTB applications.
After a lot of search we found out that the issue comes from the anormal use of CastImageFilters during the reading/writing of images when using SWIGPointer within a python application.
In Valgrind we can see that the number of instructions on RH7 for the memory allocation is quite bigger than on RH6.
Since this issue is not python specific we think that this could happen in any case of application branching ( pipelines ...).
Also we monitored the execution time for each part of applications ( image reading, writing, processing) and the time of writing was almost twice the time on RH7 than RH6.
Even with a patch for the OTB issues ( #2042#2043 ) the problem is still here as MAJA application works on DoubleVectorImages while native OTB applications works in FloatVectorImage.
MAJA (DoubleVector) -> Cast -> OTB ( FloatVector) -> Cast -> MAJA (DoubleVector)
We also works on other image types such as Int16 UInt8 ...
Here is an example of logging on the Cast Filter ( "Casted Bad" means that there was a mismath type ) :
I started to create forked OTB applications for double on BandMath for example and the performance improvements were huge. We also don't have the "Casted Bad" logs as the types are matching.
However I think that this issue needs a change on the OTB level as it impact many chains ( see WASP ...).
One solution that I have already tried on my side is to have various DoExecute method to handle various input image types.
As the input pointer might have different types (Vector or Not, UInt, Float ...) their treatment might differs and uses other filters more appropriate for these types and more importantly will not do any cast.
Not all Apps have to be modified for now, only the most commonly used with other types than Float ( the default for OTB apps).
For example in the BandMath App there is no need to have an ExtractROI for each band if the input image is a single Image.
Also please note that adding CastImageFilters in the pipeline has performance impact as it artificially enlarge the size of the pipeline.
We should make a list of application that would benefit the most from having different DoExecute method. We could start by refactoring image utility applications like Concatenate, BandMath etc. Do you have the list of the OTB applications used in MAJA where the refactoring would make a difference ?
If the application class starts supporting various DoExecute methods, e.g. DoExecute templated on input type, we need to think about how this would be integrated in the class. It should not be to complex/hard to maintain and the API should not be broken.
Also how should the case where the application has several inputs be handled ? The number of input is only known at run-time (DoInit).
We are above 100% on RH7 because of the way the nodes are allocated on RH7. We only takes 8 threads but the loadaverage is for all cores (12) thus when I do loadavg()/nbcores it is more than 100% since nbcores=8. On RH6 we can allocate the correct number of core.
These are benchmarks provided by CNES to ensure MAJA performances are not degraded from one version to another. Moreover they ensure that MAJA is always launched with the same configuration anywhere.
Also it is not a problem with binaries not compatible RH7 ( we already worked this out).
To fix/improve it, we would need support for mono-parameter expression templates and/or use it to provide CastInto<> that resolves into an expression template
Modules/Core/Common/include/itkPixelTraits.h
namespaceitk{/**\ingroup ITKCommon * Generic function to cast a pixel into another pixel type. * This function is aimed at writing efficient algorithms that cast pixel * values. It is particularly important to not degrade performances on * \c itk::VariableLentghVector based pixels. * * This overload is the default case. For numeric pixels, a simple \c * static_cast<> should be enough. * @tparam TDest Pixel type of the destination pixel * @tparam TOrig Pixel type of the origin pixel * @param[out] d TDestination pixel that shal receive the value of \c o * @param[in] o TOrigin pixel * * @throw Whatever a <tt>static_cast<TDest>(o)</tt> may throw. * * @post a copy is expected to be made. Hence, use this function to copy a * pixel value that could be a proxy into another pixel variable. In other * cases, \c itk::MoveInto() is to be prefered. * @code * RealType & value = m_cachedPixel[threadId]; * CastInto(value, inputImagePtr->GetPixel(index); * @endcode * * \see <tt>itk::CastInto(VariableLengthVector<TDest>, VariableLengthVector<TOrig>)</tt> * \todo Handle the case of \c itk::FixedArray<> */template<typenameTDest,typenameTOrig>inlinevoidCastInto(TDest&d,TOrigconst&o){d=static_cast<TDest>(o);}/**\ingroup ITKCommon * Generic function to convert a pixel into another pixel type. * This function is aimed at writing efficient algorithms that move pixel * values around. It is particularly important to not degrade performances on * \c itk::VariableLentghVector based pixels. * * This overload is the default case. For numeric pixels, a simple \c * static_cast<> should be enough. * @tparam TDest Pixel type of the destination pixel * @tparam TOrig Pixel type of the origin pixel * @param[out] d TDestination pixel that shal receive the value of \c o * @param[in] o TOrigin pixel * * @throw Whatever a <tt>static_cast<TDest>(o)</tt> may throw. * \see <tt>itk::MoveInto(VariableLengthVector<TDest>, VariableLengthVector<TOrig>)</tt> * \see <tt>itk::MoveInto(VariableLengthVector<T>, VariableLengthVector<T>)</tt> */template<typenameTDest,typenameTOrig>inlinevoidMoveInto(TDest&d,TOrigconst&o){d=static_cast<TDest>(o);}
// aim: avoid a static cast that generates a new VLVtemplate<typenameTDest,typenameTOrig>inlinevoidCastInto(VariableLengthVector<TDest>&d,VariableLengthVector<TOrig>const&o){d=o;}/** * Specialized function to convert a vector pixel into another vector pixel type. * This function is aimed at writing efficient algorithms that move pixel * values around. It is particularly important to not degrade performances on * \c itk::VariableLengthVector based pixels. * * While this overload does not really move (the internals of) \c o into \d * (as C++11 rvalue-references would permit), it will nonetheless avoid * allocating memory for a new \c VariableLengthVector, thing that would have * been a consequence of a \c static_cast<>. * * Instead, this overload will use the conversion assignment operator from \c * VariableLengthVector to copy (and convert on-the-fly) the value from \c o. * * \tparam TDest Internal pixel type of the destination pixel * \tparam TOrig Internal pixel type of the origin pixel * \param[out] d Destination pixel that shal receive the value of \c o * \param[in] o Origin pixel * \pre \c d shall not be a \c VariableLengthVector that acts as a proxy. * * \throw None * \see <tt>itk::MoveInto(TDest, TOrig)</tt> * \see <tt>itk::MoveInto(VariableLengthVector<T>, VariableLengthVector<T>)</tt> * \ingroup DataRepresentation * \ingroup ITKCommon */template<typenameTDest,typenameTOrig>inlinevoidMoveInto(VariableLengthVector<TDest>&d,VariableLengthVector<TOrig>const&o){d=o;}
PS: this is an old code. I'm not sure the distinction between MoveInto and CastInto is correctly implemented.
On the subject, I've observed a lot of cycles spent to convert pixel types in s1tling project while SARCalibration is executed alone -- I guess the same issue can be found in other scenarios.
Here is a an excerpt of what callgrind can report.
After some analysis, two things can permit to reduce conversion cost
Don't convert InternalPixelType -> VariableLengthVector<double> -> OutputPixelType (actually, the less we use VariableLengthVector, the less we waste in memory allocations and releases ; IMO VLV use shall be restricted to application that do indeed manipulate multi-channels images)
For the cases we need to convert in between vectorimages and complex images or other vector images, a pixel component iterator could be used -- I've written the component iterators for array and scalar images, I'll share it once I'm done with complex (number) images.
Another solution (or workaround) to your problem would be to define the type of otb applications at compile time (e.g. with a CMake variable). OTB applications could be compiled to use float or double as their main type (for IO and for defining filter types).
This would reduce the number of cast in processing chains with custom applications using double for precision.
one with that implementation (specialized applications (BandMath, ConcatenateImages, etc.) for int16, float, double pixel type). This implementation gives satisfying performances and could be merged in OTB 7.2
a "feature" issue, to keep in mind that we would like to keep genericity (one single app can handle efficiently all image types) and performance. For that issue, it seems that there could be a lot of work to do : it may be planned in OTB 8.x releases.
Bear in mind that the 'MAJA' OTB modified applications are not simply specialized applications for double ,they are extensively modified to bypass most of the ApplicationEngine input machinery.
So I don't think they can be merged in the OTB as is without work.
Bear in mind that the 'MAJA' OTB modified applications are not simply specialized applications for double ,they are extensively modified to bypass most of the ApplicationEngine input machinery. So I don't think they can be merged in the OTB as is without work.
Ok, could you precise your implementation, and the impacts ? When we discussed with @msavinaud, it seemed that it was quite straigth forward to merge your modifications in OTB. But if it has two many side effects, we should discuss again of the best solution, that improves performance and does not delay OTB 7.2.
Is there any way to get these applications merged into a public OTB development branch? As of now, these changes only affect Maja, but it would be good to share them with the OTB community as a whole. @msavinaud mentioned that this might be possible, but since then I haven't heard of any technical details.
I understand that you want Maja to use a standard (or publicly available) version of OTB.
However as for now my solution is just a 'crappy' workaround that doesn't meet the OTB expectations.
We can put it in a branch but it can't be merged into develop as it is bypassing most of the OTBApplication API.
We can't have applications that does this because it would mean that there is something missing in the OTB :