Implement padding and masking for batch_size > 1 in ScalarSITSPerceiver
When using a batch_size > 1, the perceiver will receive sequences of possibily different lengths (number of tokens). To store them in a tensor, the sequences need to be padded (to the max length) and attention operations need to use a padding mask (of valid tokens).
The PerceiverIO implementation that we use already knows how to exploit such a mask.
The pad mask can be build during the tokenization step (a max_lenght
parameter will need to be added). All operations following the tokenization (positionnal embedding) can benefit from using the mask in order to avoid useless operations.