Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the consistency between ONNX and torch #835

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kyakuno
Copy link

@kyakuno kyakuno commented Mar 21, 2024

Thank you for making available an excellent repository on speech synthesis.

Upon using the included script for conversion to ONNX, I found that the inference accuracy was lower than with torch. Consequently, I have identified the cause.

There are four issues.

1. Lack of exp in the conversion to multinomial_sample_one_no_sync for a sample

In torch, it is as follows:
q = torch.empty_like(probs_sort).exponential_(1)
However, in onnx, it was:
q = torch.random_like(probs_sort)
Therefore, I corrected it to:

    lambda_ = 1.0
    q = -torch.log(torch.rand_like(probs_sort)) / lambda_

2. Correction to the SinePositionalEmbedding's pe

In torch:
tensor([[[ 0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00, 1.0000e+00],
In onnx:
tensor([[[ 8.4147e-01, 5.4030e-01, 8.2186e-01, ..., 1.0000e+00, 1.0366e-04, 1.0000e+00],
It was not in the [sin, cos, sin, cos] pattern and thus was corrected.

3. Introduction of noise_scale in vq_decode

While torch multiplies by the noise_scale, onnx did not do so, hence it was corrected.

4. Removal of EOS in first_stage_decode

In torch, EOS in first_stage_decode is ignored, but it was not ignored in onnx, so it was corrected.

Moreover, cnhubert was not exported to ONNX, so I exported it.
Additionally, I have included a test inference script.

This significantly improves the inference results with ONNX.

You can confirm the differences in the generated audio in the following wav file.
before_after.zip

@ZhangJianBeiJing
Copy link

Thank you for making available an excellent repository on speech synthesis.

Upon using the included script for conversion to ONNX, I found that the inference accuracy was lower than with torch. Consequently, I have identified the cause.

There are four issues.

1. Lack of exp in the conversion to multinomial_sample_one_no_sync for a sample

In torch, it is as follows: q = torch.empty_like(probs_sort).exponential_(1) However, in onnx, it was: q = torch.random_like(probs_sort) Therefore, I corrected it to:

    lambda_ = 1.0
    q = -torch.log(torch.rand_like(probs_sort)) / lambda_

2. Correction to the SinePositionalEmbedding's pe

In torch: tensor([[[ 0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00, 1.0000e+00], In onnx: tensor([[[ 8.4147e-01, 5.4030e-01, 8.2186e-01, ..., 1.0000e+00, 1.0366e-04, 1.0000e+00], It was not in the [sin, cos, sin, cos] pattern and thus was corrected.

3. Introduction of noise_scale in vq_decode

While torch multiplies by the noise_scale, onnx did not do so, hence it was corrected.

4. Removal of EOS in first_stage_decode

In torch, EOS in first_stage_decode is ignored, but it was not ignored in onnx, so it was corrected.

Moreover, cnhubert was not exported to ONNX, so I exported it. Additionally, I have included a test inference script.

This significantly improves the inference results with ONNX.

You can confirm the differences in the generated audio in the following wav file. before_after.zip

Awesome, I also had the same problem, thank you for sharing, let my doubts disappear, great sharing.

@DonkeyHang
Copy link

DonkeyHang commented May 8, 2024

Soooooo good! I had same problem, thx bro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants