PerceiverIO context redundancy: estimations and solutions

The code in !26 (merged) allows to experiment with training a ScalarSITSPerceiver with multi-temporal multi-modal data.

My first tests show that with default parameters (32 learnable queries, 128 channels per query, 4 heads, 6 layers) and the current context encoding (Fourier with 10 freqs for each context subtoken), a sample with sizes (T, C, W, H):

SENTINEL2 torch.Size([138, 14, 100, 100])
SENTINEL1A torch.Size([116, 3, 100, 100])
SENTINEL1D torch.Size([115, 3, 100, 100])
no AGERA5 and no DEM

causes an out of memory error on a H100 GPU. S2 and S1A alone go through. With S2 alone and 60 frequencies for the Fourier ecoding of geo and time info, we get again an out of memory error.

We can consider that 100 dates is a lot. This is what we get with random sampling half of the dates during a 2 year period.

Some thoughts:

we can sample datasets with fewer dates per sample and still keep long term and short term correlations available if we have some samples spread along wide time spans and others concentrated over short spans;
the data volume can be reduced if we keep the 6 S2 20m bands at their native resolution
we can use context embedding other than Fourier for things like modality;
we need to explore efficient ways of storing the redundant parts of the context;
we can explore ways of compressing context before the cross-attention computation.

Any other idea or suggestion?

Edited Jul 02, 2025 by Jordi Inglada