Skip to content

Script to compute normalization stats from dataset

We need a utility to compute the data for the SampleNormInfo class used by normalize_measures.

We can build a script that iterates through a CachedDataset and fills a SampleNormInfo instance and serializes it to disk.

We may want to use different stats for loc (mean, median) and scale (std, q95-q05, max-min).

We will need to use formulas for computing stats on-line, since we can’t have the whole dataset in memory. This is straightforward for mean, min and max. For std, we may need to use Welford's algorithm.

However, I don’t know how to do this for quantiles (including the median) other than approximating with reservoir sampling.

Edited by Jordi Inglada