RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #30914

mosama1994 · 2024-05-20T11:50:01Z

System Info

transformers version = 4.40.0
python = 3.10.2

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The code at this link: https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_fsdp_qlora.py
is the one I am running to train llama3.
line 101: torch_dtype = torch.bfloat16
line 102: quant_storage_dtype = torch.bfloat16

When I use just float16, it runs fine. But when I use bfloat16 it gives me this error:
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16

Expected behavior

Using bfloat16 for loading is causing the issue here in this code, not sure why. Please help ASAP.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-05-20T11:58:15Z

cc @younesbelkada @pacman100

RUFFY-369 · 2024-05-20T12:39:53Z

Hi @mosama1994 , use this context manager before your training block in the file that you mentioned:
with torch.cuda.amp.autocast(): or try moving the context accordingly as i don't have the traceback for your error but it may solve your issue.
Here is the reference for the solution provided(solved issue on peft repo).

Cheers!

mosama1994 · 2024-05-20T13:07:54Z

I had seen this already. How to use this? just wrap the model loading code only? Also, why is this issue still there, in the PEFT repo it is mentioned that this should have been solved a year ago.

RUFFY-369 · 2024-05-20T13:49:12Z

I had seen this already. How to use this? just wrap the model loading code only?

@mosama1994 Basically context manager is used in with forward() and is closed before backward().

Also, why is this issue still there, in the PEFT repo it is mentioned that this should have been solved a year ago.

I just checked and found that the solution wasn't merged with the main branch: look here in the commit history so, that's why this error still exists.

So, if you want to use autocast then you have to do so in the trainer or it would be simple to use the PR solution which was mentioned in the previous message. Modify your src/peft/tuners/lora.py file with the PR sol and check if it solves the thing

younesbelkada · 2024-05-21T07:31:28Z

Hi @mosama1994
Thanks for the issue ! The solution proposed by @RUFFY-369 is correct, you can also activate autocast through bf16=True argument in TrainingArguments that you can activate through --bf16 command line argument if you use the script you shared

mosama1994 · 2024-05-21T10:28:00Z

Hi @younesbelkada the solution does work. I wrapped the code in the autocast and it is working. I already have the bf16 as True in the TrainingArguments. That is why the issue is arising. When i do fp16 as True and bf16 as False, the code runs. However, with bf16 as True, the issue arises. The solution does work so I am closing but this should be mentioned in the documentation or rectified. Thanks

amyeroberts added PyTorch FSDP PEFT labels May 20, 2024

mosama1994 closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #30914

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #30914

mosama1994 commented May 20, 2024

amyeroberts commented May 20, 2024

RUFFY-369 commented May 20, 2024

mosama1994 commented May 20, 2024

RUFFY-369 commented May 20, 2024 •

edited

younesbelkada commented May 21, 2024

mosama1994 commented May 21, 2024

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #30914

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #30914

Comments

mosama1994 commented May 20, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented May 20, 2024

RUFFY-369 commented May 20, 2024

mosama1994 commented May 20, 2024

RUFFY-369 commented May 20, 2024 • edited

younesbelkada commented May 21, 2024

mosama1994 commented May 21, 2024

RUFFY-369 commented May 20, 2024 •

edited