Issue when caching samples from VALDIR
Using script sample_dataset.py
with the config SITSSamplingConfig.base_dir = Path("/work/CESBIO/projects/RELEO/datasets/MMDC_OE/val")
generates some samples before exiting with the following error :
File "/work/scratch/data/faucher/releo_mvp/src/releo_mvp/data/dataset_utils.py", line 433, in get_ds_roi
ds = open_mfdataset(files, combine="by_coords")
ValueError: Resulting object does not have monotonic global indexes along dimension t
I found that the error can also happen along other dimensions like the dimension y.
It would seem that coordinates in the NetCDF files are overlapping or in an unordered way.
The script sample_dataset.py
with the config SITSSamplingConfig.base_dir = Path("/work/CESBIO/projects/RELEO/datasets/MMDC_OE/train")
is working properly. Comparing the NetCDF files of the TRAINDIR and the VALDIR, we see that the NetCDF files of the TRAINDIR have roi id with only 1 digit, while VALDIR have also roi id with 2 digits.
Looking at the function get_ds_roi()
that raises the error
def get_ds_roi(roi: SensorROI, base_dir: Path = TRAINDIR) -> XDataset:
"""Get an xarray dataset for the sensor data for ROI"""
sensor_pattern = SensorFolderPattern[roi.sensor]
pattern = f"{roi.roi.tile}/*/*{sensor_pattern[0]}/*{roi.roi.id}{sensor_pattern[1]}.nc"
files = sorted([p.absolute() for p in Path(base_dir).rglob(pattern)])
ds = open_mfdataset(files, combine="by_coords")
return ds
we see that the files are sorted before being passed through open_mfdataset()
, which results in unatural order like the following : ["openEO_19_clip.nc", "openEO_2_clip.nc", "openEO_20_clip.nc"]. I tried to sort in natural order like : ["openEO_2_clip.nc", "openEO_19_clip.nc", "openEO_20_clip.nc"], but the ValueError still occured.
I suppose this experiment shows that the error does not come from unordered coordinates. So it might rather come from overlapping or repeated coordinates.
I tried keeping from the listed files only the paths pointing to NetCDF with roi id of 1 digit, just as it is for TRAINDIR. This way, sample_dataset.py
runs successfully.
Is this behavior expected ?
As a temporary fix, here the modifications I made to keep the 1 digit roi paths :
def get_ds_roi(roi: SensorROI, base_dir: Path = TRAINDIR) -> XDataset:
"""Get an xarray dataset for the sensor data for ROI"""
sensor_pattern = SensorFolderPattern[roi.sensor]
# pattern = f"{roi.roi.tile}/*/*{sensor_pattern[0]}/*{roi.roi.id}{sensor_pattern[1]}.nc" # original
pattern = f"{roi.roi.tile}/*/*{sensor_pattern[0]}/*_{roi.roi.id}{sensor_pattern[1]}.nc" # temporary fix
files = sorted([p.absolute() for p in Path(base_dir).rglob(pattern)])
ds = open_mfdataset(files, combine="by_coords")
return ds
def discover_dataset(base_dir: Path) -> DatasetDescription:
"""Generate a dataset description inspecting the file system"""
available_sensors: set[SensorTag] = set()
available_rois: set[ROIId] = set()
files: DatasetFiles = {}
sensor_patterns = SensorFolderPattern
for tile in discover_tiles(base_dir):
years = discover_years(base_dir / tile)
for year in years:
for sensor in AvailableSensors:
sensor_pattern = sensor_patterns[sensor.name]
matching_rois = list(
base_dir.glob(
# f"{tile}/{year}/{sensor_pattern[0]}/openEO_*{sensor_pattern[1]}.nc" # original
f"{tile}/{year}/{sensor_pattern[0]}/openEO_?{sensor_pattern[1]}.nc" # temporary fix
)
)
suffix = f"{sensor_pattern[1]}.nc"
if matching_rois:
available_sensors.add(sensor.name)
for roi in matching_rois:
roi_id = roi_id_from_file_name(roi, tile, suffix)
available_rois.add(roi_id)
files[SensorROI(sensor.name, roi_id)] = roi
return DatasetDescription(
list(available_sensors), list(available_rois), files, base_dir
)