Skip to content

ENH: optimize memory usage of FindKNNIndices in SMOTE

Laurențiu Nicola requested to merge sample-augmentation-memory into develop

std::vector::resize doesn't shrink the vector, so on an 80k rows x 614 colums dataset this was blowing over 120 GB RAM.

We could simply call shrink_to_fit, but priority_queue doesn't really seem to use more CPU, so I think it's fine.

Testing on a subset of 20k rows x 614 columns I get:

# before
268.52user 1.83system 4:31.07elapsed 99%CPU (0avgtext+0avgdata 7089172maxresident)k
1592inputs+503296outputs (3major+1763580minor)pagefaults 0swaps

# after
269.46user 0.39system 4:30.49elapsed 99%CPU (0avgtext+0avgdata 771308maxresident)k
0inputs+503072outputs (0major+183565minor)pagefaults 0swaps

But the difference is even more visible on larger inputs.

Of course, we still keep the input and output in RAM.

Based on !988 (merged) because I'm not patient enough to wait for the file to load.

Merge request reports