You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training FastConformer-Transducer for 76 epochs produced 5 .ckpt files, 1 -last.ckpt file, and 136 (sic!) .nemo files.
I found that the number of saved checkpoints is more or equal to the number of validations runs during training with improvements.
Steps/Code to reproduce bug
The easiest way to reproduce the bug (and check the behavior in tests) is to alter the test test_nemo_checkpoint_always_save_nemo in tests/core/test_exp_manager.py in the following way:
"save_top_k:2"
check number of checkpoints
In this setup, I expect the number of checkpoints to be either 1 or 2, but not 3.
1 (as previous NeMo versions produced) or save_top_k.nemo checkpoints
For FastConformer, I expect 1 or 5 checkpoints, since save_top_k: 5, always_save_nemo: True in the config examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml.
Environment overview (please complete the following information)
Reproducible in Docker built from the main branch, and also locally on MacOS.
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
OS version
PyTorch version
Python version
Additional context
Add any other context about the problem here.
Example: GPU model
The text was updated successfully, but these errors were encountered:
Yes, I checked this is connected (@artbataev thanks for a good repro, it helped a lot!).
In #9015 I introduced a mechanism that makes a .nemo backup of an existing instead of blindly overwriting it.
Maybe we should remove the backup checkpoint after the new one is successfully created?
Describe the bug
Training FastConformer-Transducer for 76 epochs produced 5
.ckpt
files, 1-last.ckpt
file, and 136 (sic!).nemo
files.I found that the number of saved checkpoints is more or equal to the number of validations runs during training with improvements.
Steps/Code to reproduce bug
The easiest way to reproduce the bug (and check the behavior in tests) is to alter the test
test_nemo_checkpoint_always_save_nemo
intests/core/test_exp_manager.py
in the following way:"save_top_k:2"
In this setup, I expect the number of checkpoints to be either
1
or2
, but not3
.Full test:
Expected behavior
1 (as previous NeMo versions produced) or
save_top_k
.nemo
checkpointsFor FastConformer, I expect 1 or 5 checkpoints, since
save_top_k: 5
,always_save_nemo: True
in the configexamples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml
.Environment overview (please complete the following information)
Reproducible in Docker built from the main branch, and also locally on MacOS.
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
Additional context
Add any other context about the problem here.
Example: GPU model
The text was updated successfully, but these errors were encountered: