Skip to content
Snippets Groups Projects

Data augmentation

Merged Jordi Inglada requested to merge jinglada/otb:data-augmentation into develop

Summary

Implement data augmentation techniques to generate synthetic samples for classification.

Rationale

Some classifiers have poor performances when class inbalance is important. There are different approaches to generating synthetic samples for minority classes. Three strategies are easy to implement

  1. Replicate existing samples
  2. Jitter existing samples (add noise to each component whose amount is proportional to the variance of the given component)
  3. SMOTE (http://dx.doi.org/10.1613/jair.953): take a random point on the line linking 2 close samples of the same class

Implementation Details

Classes and files

The 3 strategies described above are implemented as free functions in a new otb::sampleAugmentation namespace. These functions consume and produce std::vector. A SampleAugmentationFilter consuming and producing ogr::DataSource objects applies the data augmentation calling the free functions above.

The otbOGRDataSourceWrapper class has been updated in order to correctly detect the GeoCSV format (https://giswiki.hsr.ch/GeoCSV).

Applications

The SampleAugmentation application consumes a vector file containing samples (as for instance one produced by the SampleExtraction application) and generates N samples of one class selected by the user using one of the 3 strategies:

This is the Sample Extraction (SampleAugmentation) application, version 6.5.0

Generates synthetic samples from a sample data file.
Tags: Learning 

The application takes a sample data file as generated by the SampleExtraction application and generates synthetic samples to increase the number of available samples.

Parameters: 
MISSING -in                        <string>         Input samples  (mandatory)
        -out                       <string>         Output samples  (optional, off by default)
        -field                     <string>         Field Name  (mandatory, default value is )
        -layer                     <int32>          Layer Index  (optional, on by default, default value is 0)
        -label                     <int32>          Label of the class to be augmented  (mandatory, default value is 1)
        -samples                   <int32>          Number of generated samples  (mandatory, default value is 100)
        -exclude                   <string list>    Field names for excluded features.  (mandatory, default value is )
        -strategy                  <string>         Augmentation strategy [replicate/jitter/smote] (mandatory, default value is replicate)
        -strategy.jitter.stdfactor <float>          Factor for dividing the standard deviation of each feature  (mandatory, default value is 10000)
        -strategy.smote.neighbors  <int32>          Number of nearest neighbors.  (mandatory)
        -seed                      <int32>          Random seed.  (optional, off by default)
        -inxml                     <string>         Load otb application from xml file  (optional, off by default)
        -progress                  <boolean>        Report progress 
        -help                      <string list>    Display long help (empty list), or help for given parameters keys

Use -help param1 [... paramN] to see detailed documentation of those parameters.

Examples: 
otbcli_SampleAugmentation -in samples.sqlite -field class -label 3 -samples 100 -out augmented_samples.sqlite -exclude OGC_FID name class originfid -strategy smote -strategy.smote.neighbors 5

Tests

Documentation

Additional notes

The application as well as the filter provide only the ability to generate a user defined number of samples for one single class. It would be nice to provide an option to balance all classes (all would have the same number of samples as the majority one) or to select a set of labels and the number of samples for each class.

Edited by Jordi Inglada

Merge request reports

Approval is optional

Merged by Julien MichelJulien Michel 6 years ago (Mar 28, 2018 9:01am UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Jordi Inglada added 40 commits

    added 40 commits

    • 50e631da...751d1724 - 37 commits from branch orfeotoolbox:develop
    • 900ee0d7 - Merge branch 'develop' into data-augmentation
    • 3075e433 - ENH: correct doc name
    • 79ce383f - ENH: remove the option to update the input file

    Compare with previous version

  • Julien Michel resolved all discussions

    resolved all discussions

  • Luc Hermitte
  • Luc Hermitte
  • Jordi Inglada added 5 commits

    added 5 commits

    • b61960b4 - ENH: pass vector as const&
    • 9e823ab6 - ENH: compute normalization factor outside of the loop
    • eaf9352b - ENH: square distance is enough for sorting
    • a55ea065 - ENH: avoid copy by moving
    • 68b293a4 - ENH: use find_if instead of hand made loop

    Compare with previous version

  • Jordi Inglada added 70 commits

    added 70 commits

    Compare with previous version

  • Jordi Inglada added 1 commit

    added 1 commit

    • 0db768ff - ENH: parallelise the inner loop for cache friendliness

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading