Skip to content

Data augmentation

Jordi Inglada requested to merge jinglada/otb:data-augmentation into develop

Summary

Implement data augmentation techniques to generate synthetic samples for classification.

Rationale

Some classifiers have poor performances when class inbalance is important. There are different approaches to generating synthetic samples for minority classes. Three strategies are easy to implement

  1. Replicate existing samples
  2. Jitter existing samples (add noise to each component whose amount is proportional to the variance of the given component)
  3. SMOTE (http://dx.doi.org/10.1613/jair.953): take a random point on the line linking 2 close samples of the same class

Implementation Details

Classes and files

The 3 strategies described above are implemented as free functions in a new otb::sampleAugmentation namespace. These functions consume and produce std::vector. A SampleAugmentationFilter consuming and producing ogr::DataSource objects applies the data augmentation calling the free functions above.

The otbOGRDataSourceWrapper class has been updated in order to correctly detect the GeoCSV format (https://giswiki.hsr.ch/GeoCSV).

Applications

The SampleAugmentation application consumes a vector file containing samples (as for instance one produced by the SampleExtraction application) and generates N samples of one class selected by the user using one of the 3 strategies:

This is the Sample Extraction (SampleAugmentation) application, version 6.5.0

Generates synthetic samples from a sample data file.
Tags: Learning 

The application takes a sample data file as generated by the SampleExtraction application and generates synthetic samples to increase the number of available samples.

Parameters: 
MISSING -in                        <string>         Input samples  (mandatory)
        -out                       <string>         Output samples  (optional, off by default)
        -field                     <string>         Field Name  (mandatory, default value is )
        -layer                     <int32>          Layer Index  (optional, on by default, default value is 0)
        -label                     <int32>          Label of the class to be augmented  (mandatory, default value is 1)
        -samples                   <int32>          Number of generated samples  (mandatory, default value is 100)
        -exclude                   <string list>    Field names for excluded features.  (mandatory, default value is )
        -strategy                  <string>         Augmentation strategy [replicate/jitter/smote] (mandatory, default value is replicate)
        -strategy.jitter.stdfactor <float>          Factor for dividing the standard deviation of each feature  (mandatory, default value is 10000)
        -strategy.smote.neighbors  <int32>          Number of nearest neighbors.  (mandatory)
        -seed                      <int32>          Random seed.  (optional, off by default)
        -inxml                     <string>         Load otb application from xml file  (optional, off by default)
        -progress                  <boolean>        Report progress 
        -help                      <string list>    Display long help (empty list), or help for given parameters keys

Use -help param1 [... paramN] to see detailed documentation of those parameters.

Examples: 
otbcli_SampleAugmentation -in samples.sqlite -field class -label 3 -samples 100 -out augmented_samples.sqlite -exclude OGC_FID name class originfid -strategy smote -strategy.smote.neighbors 5

Tests

Documentation

Additional notes

The application as well as the filter provide only the ability to generate a user defined number of samples for one single class. It would be nice to provide an option to balance all classes (all would have the same number of samples as the majority one) or to select a set of labels and the number of samples for each class.

Edited by Jordi Inglada

Merge request reports