Skip to content
Snippets Groups Projects

KMeans input centroids

Merged Cédric Traizet requested to merge kmean_centroids into develop
All threads resolved!

Summary

Add the option to provide user defined centroids as initialization of the kmean algorithm in KMeansClassification and TrainVectorClassifier.

See feature request #1820 (closed)

Rationale

The result of the KMeans algorithm depends on the input centroids, but it is currently not possible to set them (the k first points of the training sample are used as initialization`. This MR adds the possibility to provide the centroids in a text file.

In the TrainVectorClassifier application, the following parameters have been added:

  • classifier.sharkkm.centroids : input centroid text file
  • classifier.sharkkm.centroidstats : a file containing stats to normalize the input centroids (non mandatory)
  • classifier.sharkkm.outcentroids : a text file containing the output centroids

In the KMeansClassification composite application, the following parameters have been added:

  • incentroids.in : input centroid text file
  • incentroids.normalize : flag for centroid normalization (the stats are already computed in the app for data normalization)

Implementation Details

Input centroid file reading is done using the Shark API (importCSV).

In SharkKMeansMachineLearningModel, the normalization option has been removed. Normalization was possible during training (Train()), using the Shark API to train a normalizer on the input list sample. This option was not used anywhere in OTB, and I removed it because the normalizer cannot be used afterward during classification... Instead the data normalization should be done prior to the training (as it is done in the applications).

In SharkKMeansMachineLearningModel, I added a method to export the centroids as a text file (using the Shark's exportCSV method), this can be used to obtain a human readable version of the centroid (the serialized model file can be hard to read). The centroids can now be exported in the TrainVectorClassifier application, and the KMeansClassification uses this method instead of creating the output centroid file from the serialized file (this was not working anyway, the output centroids where wrong...)

This means that the output centroids text file from a kmean application can be used as input of another kmean application !

Copyright

The copyright owner is CNES and has signed the ORFEO ToolBox Contributor License Agreement.


Check before merging:

  • All discussions are resolved
  • At least 2 :thumbsup: votes from core developers, no :thumbsdown: vote.
  • The feature branch is (reasonably) up-to-date with the base branch
  • Dashboard is green
  • Copyright owner has signed the ORFEO ToolBox Contributor License Agreement
Edited by Cédric Traizet

Merge request reports

Pipeline #1252 canceled

Pipeline canceled for 97d8b279 on kmean_centroids

Merged by Cédric TraizetCédric Traizet 5 years ago (Apr 29, 2019 7:30am UTC)

Loading

Pipeline #1261 passed

Pipeline passed for 98a40235 on develop

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Cédric Traizet changed milestone to %7.0.0

    changed milestone to %7.0.0

  • Cédric Traizet added 1 deleted label

    added 1 deleted label

  • Cédric Traizet changed title from Kmean input centroids to KMeans input centroids

    changed title from Kmean input centroids to KMeans input centroids

  • Cédric Traizet changed the description

    changed the description

  • Looks good :)

    1. Just a few points about parameter names for UX:

    In the TrainVectorClassifier application, the following parameters have been added:

    • classifier.sharkkm.centroids : input centroid text file
    • classifier.sharkkm.centroidstats : a file containing stats to normalize the input centroids (non mandatory)
    • classifier.sharkkm.outcentroids : a text file containing the output centroids

    In the KMeansClassification composite application, the following parameters have been added:

    • incentroids.in : input centroid text file
    • incentroids.normalize : flag for centroid normalization (the stats are already computed in the app for data normalization)

    Can we make it so that "input centroid text file" has the same key in both applications? Something like: classifier.sharkkm.centroids and in.centroids?

    Is the parameter incentroids.normalize really useful? Are there cases where it makes sense to have it false?

    1. Can you add a test for the new parameters?
  • If the input centroids file is the output of another kmeans algorithm, the centroids will already be normalized. For images that are close (same sensor, same kind of scenes...) the output of the kmeans algorithm on one image can be a good starting point for kmeans algorithms on other images. But maybe nobody will do that in practice, I don't know ... Maybe this adds unnecessary complexity in the app. We could remove it from the KMeansClassification application and leave it in TrainVectorClassifier (normalize if there is a stat file).

    Edited by Cédric Traizet
  • yes i agree that would be simpler! let's do that?

  • Cédric Traizet added 138 commits

    added 138 commits

    • 69f175a3...44005a23 - 129 commits from branch develop
    • 944792ee - ENH: remove the centroid normalization option in KMeansClassification
    • 9e1f8174 - DOC: rename centroid parameters
    • 63376264 - DOC: rename parameters in TrainVectorClassifier
    • 1df9f3f7 - COMP: move include to ignore warnings from shark
    • ae7bd659 - Merge branch 'develop' into kmean_centroids
    • 3025bf9d - ENH: added a test for KMeans with input centroids
    • f82e3682 - ENH: added a warning if the number of input of centroid is not the same as the number of classes
    • 44e00450 - DOC: more kmeans doc
    • a5a1c34a - STY: clang format

    Compare with previous version

  • Done.

    The new parameter names are centroids.in and centroids.out in KMeansClassification and classifier.sharkkm.centroids.in, classifier.sharkkm.centroids.stats and classifier.sharkkm.centroids.out in TrainVectorClassifier

    I created a new test (+baselines) for KMeansClassification with input centroids. I don't think we need an additional test for TrainVectorClassifier, as this app is a part of the composite application KMeansClassification.

  • Did you compress the tif file with gdal?

  • hmm no ...

    What command should I use ? gdal_translate -co "COMPRESS=lzw" src_dataset dst_dataset ?

  • Cédric Traizet added 2 commits

    added 2 commits

    • 91430a7f - ENH: compression baseline for the kmean with input centroid test
    • e89a91b2 - BUG: remove kmeans from regression algorithms

    Compare with previous version

  • I also removed the Shark Kmeans algorithm from the available regression unsupervised classifier in LearningApplicationBase. It was available but the underlying MachineLearningModel does not support regression (before this fix calling TrainRegression -classifier sharkkm ... results in 2019-04-23 17:55:21 (FATAL) TrainRegression: itk::ERROR: Regression mode not implemented.)

  • added 1 commit

    Compare with previous version

  • Status: 1 failing test

  • Cédric Traizet added 63 commits

    added 63 commits

    Compare with previous version

  • In the Shark K means algorithm, to prevent the centroids from converging to the same points, if at the end of one iteration a centroid has no associated points in the input training data, it is reinitialized to a random point from the training data. This randomness makes the output of the algorithm platform dependent, and I don't know how we can control the random seed used (if it is possible).

    With the input centroid file I used in the test, all the points of the training data set were associated with one of the five centroids, and so the centroids were reinitialized after this iteration ... Changing the baseline to something more coherent seems to correct the test.

  • Yannick TANGUY
  • Yannick TANGUY
  • Yannick TANGUY
  • Cédric Traizet resolved all discussions

    resolved all discussions

  • added 1 commit

    Compare with previous version

  • Cédric Traizet mentioned in commit 98a40235

    mentioned in commit 98a40235

  • Please register or sign in to reply
    Loading