CI runners seem to stop working half way and still report successful execution
Some tests that are known to be KO are still reported successful.
For instance:
- https://gitlab.orfeo-toolbox.org/s1-tiling/s1tiling/-/jobs/44026
- https://gitlab.orfeo-toolbox.org/s1-tiling/s1tiling/-/jobs/44000
- https://gitlab.orfeo-toolbox.org/s1-tiling/s1tiling/-/jobs/43872 -- this one is not on master
Expected result should look like: https://gitlab.orfeo-toolbox.org/s1-tiling/s1tiling/-/jobs/44035
We see that pytest
seems to halt half way and the job still reports everything is fine, but we miss the final report summary:
---------- generated xml file: /builds/s1-tiling/s1tiling/report.xml -----------
=========================== short test summary info ============================
FAILED tests/test_0200306-NR.py::test_33NWB_202001_NR[baselinedir0-srtmdir0-False-outputdir0-tmpdir0-False]
=================== 1 failed, 1 passed in 691.43s (0:11:31) ====================
In some other tests, it seems nothing has been run:
Actually we can see the log halts in middle on pip execution
Building wheel for S1Tiling (setup.py): started
the expected following
Building wheel for S1Tiling (setup.py): finished with status 'done'
is nowhere to be found.
After further investigation it may be related to a communication issue between the job and the CI when kubernete is used. See the related issues:
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/3175
- https://github.com/kubernetes/kubernetes/issues/66661
- https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4119
- https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1775/diffs#c5104256d4f07908327fd933b95e7ee542874483
which final resolution seems to be currently tracked under:
Should we make sure to spend our time logging? (https://gitlab.com/gitlab-org/gitlab-runner/-/issues/26654) and/or should we use the new FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
option?