Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enormous number of .nemo checkpoints produced in training #9232

Closed
artbataev opened this issue May 17, 2024 · 4 comments · Fixed by #9281
Closed

Enormous number of .nemo checkpoints produced in training #9232

artbataev opened this issue May 17, 2024 · 4 comments · Fixed by #9281
Assignees
Labels
bug Something isn't working

Comments

@artbataev
Copy link
Collaborator

artbataev commented May 17, 2024

Describe the bug

Training FastConformer-Transducer for 76 epochs produced 5 .ckpt files, 1 -last.ckpt file, and 136 (sic!) .nemo files.
I found that the number of saved checkpoints is more or equal to the number of validations runs during training with improvements.

Steps/Code to reproduce bug

The easiest way to reproduce the bug (and check the behavior in tests) is to alter the test test_nemo_checkpoint_always_save_nemo in tests/core/test_exp_manager.py in the following way:

  • "save_top_k:2"
  • check number of checkpoints

In this setup, I expect the number of checkpoints to be either 1 or 2, but not 3.

AssertionError: assert 3 == 1

Full test:

def test_nemo_checkpoint_always_save_nemo(self, tmp_path):
        test_trainer = pl.Trainer(accelerator='cpu', enable_checkpointing=False, logger=False, max_epochs=4)
        exp_manager(
            test_trainer,
            {
                "checkpoint_callback_params": {"save_best_model": True, "always_save_nemo": True, "save_top_k": 2},
                "explicit_log_dir": str(tmp_path / "test"),
            },
        )
        model = ExampleModel()
        test_trainer.fit(model)

        assert Path(str(tmp_path / "test" / "checkpoints" / "default.nemo")).exists()
        assert len(list((tmp_path / "test/checkpoints").glob("default*.nemo"))) == 1  # check number of `.nemo` checkpoints

        model = ExampleModel.restore_from(str(tmp_path / "test" / "checkpoints" / "default.nemo"))
        assert float(model(torch.tensor([1.0, 1.0], device=model.device))) == 0.0

Expected behavior

1 (as previous NeMo versions produced) or save_top_k .nemo checkpoints

For FastConformer, I expect 1 or 5 checkpoints, since save_top_k: 5, always_save_nemo: True in the config examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml.

Environment overview (please complete the following information)

Reproducible in Docker built from the main branch, and also locally on MacOS.

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

@artbataev artbataev added the bug Something isn't working label May 17, 2024
@titu1994
Copy link
Collaborator

@athitten

@anteju
Copy link
Collaborator

anteju commented May 20, 2024

@mikolajblaz, maybe the above is related to #9015?

@mikolajblaz
Copy link
Collaborator

mikolajblaz commented May 21, 2024

Yes, I checked this is connected (@artbataev thanks for a good repro, it helped a lot!).
In #9015 I introduced a mechanism that makes a .nemo backup of an existing instead of blindly overwriting it.
Maybe we should remove the backup checkpoint after the new one is successfully created?

@titu1994
Copy link
Collaborator

Lets delete the backups

@anteju anteju linked a pull request May 22, 2024 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants