Implement masked autoencoding training for ScalarSitsPerceiver

Decide whether this needs a new LightningModule or whether it can be implemented as an option of the current one
List masking strategies to implement and define their parameters
1. Temporal: drop dates in SITS
2. Spectral: drop spectral bands (either the same for all dates or different bands for different dates)
3. Spatial: drop pixels or subpatches (either the same for all dates or different regions for different dates)
4. Modality: drop data sources (again, all dates or only some of them)
5. Drop tokens
6. A combination of the above
Decide when to perform the masking:
1. At the dataset level: generate a new cached dataset containing both the masked and unmasked versions of each sample. This does not seem easy to implement for spatial masking.
2. During tokenization (do not generate the tokens that should be masked)
3. After tokenization (remove the masked tokens). This one seems the easiest if we do it before positional embedding, since the subtokens (date, band, location, etc) contain the filtering criteria.
Implement the masking
Implement the masked reconstruction loss computation