Enormous number of `.nemo` checkpoints produced in training #9232

artbataev · 2024-05-17T14:30:25Z

Describe the bug

Training FastConformer-Transducer for 76 epochs produced 5 .ckpt files, 1 -last.ckpt file, and 136 (sic!) .nemo files.
I found that the number of saved checkpoints is more or equal to the number of validations runs during training with improvements.

Steps/Code to reproduce bug

The easiest way to reproduce the bug (and check the behavior in tests) is to alter the test test_nemo_checkpoint_always_save_nemo in tests/core/test_exp_manager.py in the following way:

"save_top_k:2"
check number of checkpoints

In this setup, I expect the number of checkpoints to be either 1 or 2, but not 3.

AssertionError: assert 3 == 1

Full test:

def test_nemo_checkpoint_always_save_nemo(self, tmp_path):
        test_trainer = pl.Trainer(accelerator='cpu', enable_checkpointing=False, logger=False, max_epochs=4)
        exp_manager(
            test_trainer,
            {
                "checkpoint_callback_params": {"save_best_model": True, "always_save_nemo": True, "save_top_k": 2},
                "explicit_log_dir": str(tmp_path / "test"),
            },
        )
        model = ExampleModel()
        test_trainer.fit(model)

        assert Path(str(tmp_path / "test" / "checkpoints" / "default.nemo")).exists()
        assert len(list((tmp_path / "test/checkpoints").glob("default*.nemo"))) == 1  # check number of `.nemo` checkpoints

        model = ExampleModel.restore_from(str(tmp_path / "test" / "checkpoints" / "default.nemo"))
        assert float(model(torch.tensor([1.0, 1.0], device=model.device))) == 0.0

Expected behavior

1 (as previous NeMo versions produced) or save_top_k .nemo checkpoints

For FastConformer, I expect 1 or 5 checkpoints, since save_top_k: 5, always_save_nemo: True in the config examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml.

Environment overview (please complete the following information)

Reproducible in Docker built from the main branch, and also locally on MacOS.

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here.
Example: GPU model

The text was updated successfully, but these errors were encountered:

titu1994 · 2024-05-17T16:15:30Z

@athitten

anteju · 2024-05-20T20:41:17Z

@mikolajblaz, maybe the above is related to #9015?

mikolajblaz · 2024-05-21T10:43:17Z

Yes, I checked this is connected (@artbataev thanks for a good repro, it helped a lot!).
In #9015 I introduced a mechanism that makes a .nemo backup of an existing instead of blindly overwriting it.
Maybe we should remove the backup checkpoint after the new one is successfully created?

titu1994 · 2024-05-21T17:22:42Z

Lets delete the backups

artbataev added the bug Something isn't working label May 17, 2024

titu1994 assigned athitten May 17, 2024

ericharper assigned mikolajblaz May 22, 2024

anteju linked a pull request May 22, 2024 that will close this issue

Remove .nemo instead of renaming #9281

Merged

8 tasks

artbataev mentioned this issue May 23, 2024

Remove .nemo instead of renaming #9281

Merged

8 tasks

pablo-garay closed this as completed in #9281 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enormous number of `.nemo` checkpoints produced in training #9232

Enormous number of `.nemo` checkpoints produced in training #9232

artbataev commented May 17, 2024 •

edited

titu1994 commented May 17, 2024

anteju commented May 20, 2024

mikolajblaz commented May 21, 2024 •

edited

titu1994 commented May 21, 2024

Enormous number of .nemo checkpoints produced in training #9232

Enormous number of .nemo checkpoints produced in training #9232

Comments

artbataev commented May 17, 2024 • edited

titu1994 commented May 17, 2024

anteju commented May 20, 2024

mikolajblaz commented May 21, 2024 • edited

titu1994 commented May 21, 2024

Enormous number of `.nemo` checkpoints produced in training #9232

Enormous number of `.nemo` checkpoints produced in training #9232

artbataev commented May 17, 2024 •

edited

mikolajblaz commented May 21, 2024 •

edited