Script to compute normalization stats from dataset
We need a utility to compute the data for the SampleNormInfo
class used by
normalize_measures.
We can build a script that iterates through a CachedDataset
and fills a SampleNormInfo
instance and serializes it to disk.
We may want to use different stats for loc
(mean, median) and scale
(std, q95-q05, max-min).
We will need to use formulas for computing stats on-line, since we can’t have the whole dataset in memory. This is straightforward for mean, min and max. For std, we may need to use Welford's algorithm.
However, I don’t know how to do this for quantiles (including the median) other than approximating with reservoir sampling.
Edited by Jordi Inglada