Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #30914

Closed
4 tasks
mosama1994 opened this issue May 20, 2024 · 6 comments

Comments

@mosama1994
Copy link

System Info

transformers version = 4.40.0
python = 3.10.2

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The code at this link: https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_fsdp_qlora.py
is the one I am running to train llama3.
line 101: torch_dtype = torch.bfloat16
line 102: quant_storage_dtype = torch.bfloat16

When I use just float16, it runs fine. But when I use bfloat16 it gives me this error:
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16

Expected behavior

Using bfloat16 for loading is causing the issue here in this code, not sure why. Please help ASAP.

@amyeroberts
Copy link
Collaborator

cc @younesbelkada @pacman100

@RUFFY-369
Copy link

Hi @mosama1994 , use this context manager before your training block in the file that you mentioned:
with torch.cuda.amp.autocast(): or try moving the context accordingly as i don't have the traceback for your error but it may solve your issue.
Here is the reference for the solution provided(solved issue on peft repo).

Cheers!

@mosama1994
Copy link
Author

I had seen this already. How to use this? just wrap the model loading code only? Also, why is this issue still there, in the PEFT repo it is mentioned that this should have been solved a year ago.

@RUFFY-369
Copy link

RUFFY-369 commented May 20, 2024

I had seen this already. How to use this? just wrap the model loading code only?

@mosama1994 Basically context manager is used in with forward() and is closed before backward().

Also, why is this issue still there, in the PEFT repo it is mentioned that this should have been solved a year ago.

I just checked and found that the solution wasn't merged with the main branch: look here in the commit history so, that's why this error still exists.

So, if you want to use autocast then you have to do so in the trainer or it would be simple to use the PR solution which was mentioned in the previous message. Modify your src/peft/tuners/lora.py file with the PR sol and check if it solves the thing

@younesbelkada
Copy link
Contributor

Hi @mosama1994
Thanks for the issue ! The solution proposed by @RUFFY-369 is correct, you can also activate autocast through bf16=True argument in TrainingArguments that you can activate through --bf16 command line argument if you use the script you shared

@mosama1994
Copy link
Author

Hi @younesbelkada the solution does work. I wrapped the code in the autocast and it is working. I already have the bf16 as True in the TrainingArguments. That is why the issue is arising. When i do fp16 as True and bf16 as False, the code runs. However, with bf16 as True, the issue arises. The solution does work so I am closing but this should be mentioned in the documentation or rectified. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants