Improve the consistency between ONNX and torch #835

kyakuno · 2024-03-21T06:30:05Z

Thank you for making available an excellent repository on speech synthesis.

Upon using the included script for conversion to ONNX, I found that the inference accuracy was lower than with torch. Consequently, I have identified the cause.

There are four issues.

1. Lack of exp in the conversion to multinomial_sample_one_no_sync for a sample

In torch, it is as follows:
q = torch.empty_like(probs_sort).exponential_(1)
However, in onnx, it was:
q = torch.random_like(probs_sort)
Therefore, I corrected it to:

    lambda_ = 1.0
    q = -torch.log(torch.rand_like(probs_sort)) / lambda_

2. Correction to the SinePositionalEmbedding's pe

In torch:
tensor([[[ 0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00, 1.0000e+00],
In onnx:
tensor([[[ 8.4147e-01, 5.4030e-01, 8.2186e-01, ..., 1.0000e+00, 1.0366e-04, 1.0000e+00],
It was not in the [sin, cos, sin, cos] pattern and thus was corrected.

3. Introduction of noise_scale in vq_decode

While torch multiplies by the noise_scale, onnx did not do so, hence it was corrected.

4. Removal of EOS in first_stage_decode

In torch, EOS in first_stage_decode is ignored, but it was not ignored in onnx, so it was corrected.

Moreover, cnhubert was not exported to ONNX, so I exported it.
Additionally, I have included a test inference script.

This significantly improves the inference results with ONNX.

You can confirm the differences in the generated audio in the following wav file.
before_after.zip

ZhangJianBeiJing · 2024-03-21T08:42:33Z

Thank you for making available an excellent repository on speech synthesis.

Upon using the included script for conversion to ONNX, I found that the inference accuracy was lower than with torch. Consequently, I have identified the cause.

There are four issues.

1. Lack of exp in the conversion to multinomial_sample_one_no_sync for a sample

In torch, it is as follows: q = torch.empty_like(probs_sort).exponential_(1) However, in onnx, it was: q = torch.random_like(probs_sort) Therefore, I corrected it to:
    lambda_ = 1.0
    q = -torch.log(torch.rand_like(probs_sort)) / lambda_
2. Correction to the SinePositionalEmbedding's pe

In torch: tensor([[[ 0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00, 1.0000e+00], In onnx: tensor([[[ 8.4147e-01, 5.4030e-01, 8.2186e-01, ..., 1.0000e+00, 1.0366e-04, 1.0000e+00], It was not in the [sin, cos, sin, cos] pattern and thus was corrected.

3. Introduction of noise_scale in vq_decode

While torch multiplies by the noise_scale, onnx did not do so, hence it was corrected.

4. Removal of EOS in first_stage_decode

In torch, EOS in first_stage_decode is ignored, but it was not ignored in onnx, so it was corrected.

Moreover, cnhubert was not exported to ONNX, so I exported it. Additionally, I have included a test inference script.

This significantly improves the inference results with ONNX.

You can confirm the differences in the generated audio in the following wav file. before_after.zip

Awesome, I also had the same problem, thank you for sharing, let my doubts disappear, great sharing.

DonkeyHang · 2024-05-08T02:15:37Z

Soooooo good! I had same problem, thx bro.

kyakuno added 2 commits March 21, 2024 15:16

Improve the consistency between ONNX and torch

825588b

Added inference code

7632175

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the consistency between ONNX and torch #835

Improve the consistency between ONNX and torch #835

kyakuno commented Mar 21, 2024

ZhangJianBeiJing commented Mar 21, 2024

1. Lack of exp in the conversion to multinomial_sample_one_no_sync for a sample

2. Correction to the SinePositionalEmbedding's pe

3. Introduction of noise_scale in vq_decode

4. Removal of EOS in first_stage_decode

DonkeyHang commented May 8, 2024 •

edited

Improve the consistency between ONNX and torch #835

Are you sure you want to change the base?

Improve the consistency between ONNX and torch #835

Conversation

kyakuno commented Mar 21, 2024

1. Lack of exp in the conversion to multinomial_sample_one_no_sync for a sample

2. Correction to the SinePositionalEmbedding's pe

3. Introduction of noise_scale in vq_decode

4. Removal of EOS in first_stage_decode

ZhangJianBeiJing commented Mar 21, 2024

1. Lack of exp in the conversion to multinomial_sample_one_no_sync for a sample

2. Correction to the SinePositionalEmbedding's pe

3. Introduction of noise_scale in vq_decode

4. Removal of EOS in first_stage_decode

DonkeyHang commented May 8, 2024 • edited

DonkeyHang commented May 8, 2024 •

edited