KMeans input centroids
Summary
Add the option to provide user defined centroids as initialization of the kmean algorithm in KMeansClassification
and TrainVectorClassifier
.
See feature request #1820 (closed)
Rationale
The result of the KMeans algorithm depends on the input centroids, but it is currently not possible to set them (the k
first points of the training sample are used as initialization`. This MR adds the possibility to provide the centroids in a text file.
In the TrainVectorClassifier
application, the following parameters have been added:
- classifier.sharkkm.centroids : input centroid text file
- classifier.sharkkm.centroidstats : a file containing stats to normalize the input centroids (non mandatory)
- classifier.sharkkm.outcentroids : a text file containing the output centroids
In the KMeansClassification
composite application, the following parameters have been added:
- incentroids.in : input centroid text file
- incentroids.normalize : flag for centroid normalization (the stats are already computed in the app for data normalization)
Implementation Details
Input centroid file reading is done using the Shark API (importCSV).
In SharkKMeansMachineLearningModel
, the normalization option has been removed. Normalization was possible during training (Train()
), using the Shark API to train a normalizer on the input list sample. This option was not used anywhere in OTB, and I removed it because the normalizer cannot be used afterward during classification... Instead the data normalization should be done prior to the training (as it is done in the applications).
In SharkKMeansMachineLearningModel
, I added a method to export the centroids as a text file (using the Shark's exportCSV
method), this can be used to obtain a human readable version of the centroid (the serialized model file can be hard to read). The centroids can now be exported in the TrainVectorClassifier
application, and the KMeansClassification
uses this method instead of creating the output centroid file from the serialized file (this was not working anyway, the output centroids where wrong...)
This means that the output centroids text file from a kmean application can be used as input of another kmean application !
Copyright
The copyright owner is CNES and has signed the ORFEO ToolBox Contributor License Agreement.
Check before merging:
- All discussions are resolved
- At least 2
votes from core developers, no vote. - The feature branch is (reasonably) up-to-date with the base branch
- Dashboard is green
- Copyright owner has signed the ORFEO ToolBox Contributor License Agreement
Merge request reports
Activity
changed milestone to %7.0.0
added CNES backlog app feature labels
Looks good :)
- Just a few points about parameter names for UX:
In the
TrainVectorClassifier
application, the following parameters have been added:- classifier.sharkkm.centroids : input centroid text file
- classifier.sharkkm.centroidstats : a file containing stats to normalize the input centroids (non mandatory)
- classifier.sharkkm.outcentroids : a text file containing the output centroids
In the
KMeansClassification
composite application, the following parameters have been added:- incentroids.in : input centroid text file
- incentroids.normalize : flag for centroid normalization (the stats are already computed in the app for data normalization)
Can we make it so that "input centroid text file" has the same key in both applications? Something like:
classifier.sharkkm.centroids
andin.centroids
?Is the parameter
incentroids.normalize
really useful? Are there cases where it makes sense to have it false?- Can you add a test for the new parameters?
If the input centroids file is the output of another kmeans algorithm, the centroids will already be normalized. For images that are close (same sensor, same kind of scenes...) the output of the kmeans algorithm on one image can be a good starting point for kmeans algorithms on other images. But maybe nobody will do that in practice, I don't know ... Maybe this adds unnecessary complexity in the app. We could remove it from the
KMeansClassification
application and leave it inTrainVectorClassifier
(normalize if there is a stat file).Edited by Cédric Traizetadded 138 commits
-
69f175a3...44005a23 - 129 commits from branch
develop
- 944792ee - ENH: remove the centroid normalization option in KMeansClassification
- 9e1f8174 - DOC: rename centroid parameters
- 63376264 - DOC: rename parameters in TrainVectorClassifier
- 1df9f3f7 - COMP: move include to ignore warnings from shark
- ae7bd659 - Merge branch 'develop' into kmean_centroids
- 3025bf9d - ENH: added a test for KMeans with input centroids
- f82e3682 - ENH: added a warning if the number of input of centroid is not the same as the number of classes
- 44e00450 - DOC: more kmeans doc
- a5a1c34a - STY: clang format
Toggle commit list-
69f175a3...44005a23 - 129 commits from branch
Done.
The new parameter names are
centroids.in
andcentroids.out
inKMeansClassification
andclassifier.sharkkm.centroids.in
,classifier.sharkkm.centroids.stats
andclassifier.sharkkm.centroids.out
inTrainVectorClassifier
I created a new test (+baselines) for
KMeansClassification
with input centroids. I don't think we need an additional test forTrainVectorClassifier
, as this app is a part of the composite applicationKMeansClassification
.I also removed the Shark Kmeans algorithm from the available regression unsupervised classifier in
LearningApplicationBase
. It was available but the underlyingMachineLearningModel
does not support regression (before this fix callingTrainRegression -classifier sharkkm ...
results in2019-04-23 17:55:21 (FATAL) TrainRegression: itk::ERROR: Regression mode not implemented.
)added 63 commits
-
faa4a364...eb92becf - 61 commits from branch
develop
- e71da72b - ENH: update baselines
- c5495bf9 - Merge branch 'develop' into kmean_centroids
-
faa4a364...eb92becf - 61 commits from branch
In the Shark K means algorithm, to prevent the centroids from converging to the same points, if at the end of one iteration a centroid has no associated points in the input training data, it is reinitialized to a random point from the training data. This randomness makes the output of the algorithm platform dependent, and I don't know how we can control the random seed used (if it is possible).
With the input centroid file I used in the test, all the points of the training data set were associated with one of the five centroids, and so the centroids were reinitialized after this iteration ... Changing the baseline to something more coherent seems to correct the test.
- Automatically resolved by Cédric Traizet
- Automatically resolved by Cédric Traizet
- Automatically resolved by Cédric Traizet
mentioned in commit 98a40235